Should the agent be allowed to issue refunds on its own?

Yes, but only inside a policy envelope you encode as code, not as a prompt. The model proposes the action; a deterministic permission layer decides whether it is allowed. A refund under a dollar threshold, on an order in an eligible state, for a customer in good standing, can run autonomously and be logged. Anything outside those bounds — a larger amount, a flagged account, a second refund in a week — gets routed to a human for approval. Never let the language model be the final authority on money. The model decides what to do; your guardrails decide what it may actually do.

How does RAG fit into a support agent?

RAG (retrieval-augmented generation) is how the agent answers from your knowledge instead of from the model's frozen training data. At question time the agent retrieves the most relevant passages from your help center, policy docs, and past resolved tickets, then drafts an answer grounded in those passages and cites them. This keeps replies current the moment you update an article, lets you show the customer (and your QA team) the source, and sharply cuts the confident-but-wrong answers that destroy trust in support automation. It is the difference between a bot that guesses your refund window and one that quotes it.

When should the agent hand off to a human?

On three triggers. First, low confidence — retrieval came back thin, the question is ambiguous, or the model's own grounding check fails. Second, high stakes — the action exceeds a permission threshold, the account is flagged, or the topic is legal, billing dispute, or safety related. Third, sentiment — the customer is angry, has asked for a human, or the conversation has looped without progress. A good handoff is warm: it passes the full transcript, the agent's proposed action, and a one-line summary so the human starts informed instead of asking the customer to repeat everything.

How long does it take to ship a useful support agent?

A grounded, read-only agent that answers FAQs from your help center can be live in days, because the hard parts — retrieval and a decent prompt — are well understood. The real timeline is in the action layer and evaluation: wiring tools to your helpdesk and CRM, encoding permission rules, building an eval set from real tickets, and tuning until resolution rate is high enough to trust. Plan for a few weeks to go from a deflection assistant to an agent that actually resolves tier-one tickets end to end, then keep iterating with the transcripts it generates.

Blog · Tutorial

How to Build an AI Customer Support Agent (Step by Step)

Q: What is the single best metric for an AI support agent?

Resolution rate — the share of conversations the agent closes correctly without a human ever touching them. It is honest in a way that deflection rate and CSAT are not: a bot can deflect a ticket by frustrating someone into giving up, and CSAT can stay high while the agent quietly fails to actually fix anything. Resolution rate forces you to define what 'resolved' means (the customer's problem is solved, verified by no reopen within a window) and then measure it. Pair it with escalation rate and reopen rate so you can see whether you are buying resolution at the cost of bouncing hard tickets to humans.

Most support 'bots' deflect tickets. This walkthrough builds one that resolves them — grounded in your docs, wired to your helpdesk, allowed to take real actions inside guardrails, and measured by the only number that matters: resolution rate.

11 min read
Tutorial
Updated 2026

Start building free Support agent use case

Anyone can wire a chat window to a language model and call it a support agent. Building one your team actually trusts with customers — and with refunds — is a different job. This is that job, end to end.

I have watched a lot of support automation get shipped, and the failure pattern is almost always the same: teams optimize for deflection (fewer tickets reaching a human) instead of resolution (the customer's problem actually solved). A bot that answers in a confident, friendly tone and resolves nothing is worse than no bot, because it burns the customer's patience before they reach a person who can help.

So we are going to build in a specific order, and the order is the point. Goal and metric first. Then read access — tools that let the agent see the helpdesk, CRM, and knowledge base. Then a grounded reasoning loop with RAG so answers come from your content, not the model's imagination. Only then do we give the agent the power to act — and we put every action behind a permission check. Finally, guardrails, human handoff, and an evaluation loop that keeps the whole thing honest.

If you want the conceptual foundations underneath each step, the how to build AI agents guide covers the architecture; this post is the hands-on build for one concrete, high-value agent.

The shape of it

The six steps, in order

Each step depends on the one before it. Skipping ahead — actions before grounding, automation before measurement — is how support agents go wrong in public.

1 · Goal + metric
Decide exactly which tickets the agent owns and define resolution rate as the number you optimize. Everything downstream is judged against it.
2 · Connect tools (read)
Give the agent eyes: read access to the helpdesk ticket, the customer's CRM record and order history, and a searchable knowledge base.
3 · Reasoning loop + RAG
Run a reason → retrieve → answer loop so every reply is grounded in retrieved passages and can cite its sources.
4 · Actions behind permissions
Let the agent issue refunds, update orders, and reset access — but only inside policy bounds enforced as code, with every action logged.
5 · Guardrails + escalation
Catch low confidence, high stakes, and frustrated customers, and hand off warmly to a human with full context.
6 · Evaluate + iterate
Score real conversations on resolution, grounding, and citation accuracy, then feed failures back into prompts, retrieval, and rules.

Step 1

Define the goal and pick resolution rate

If you cannot name what 'solved' means, you cannot build an agent that solves it — and you certainly cannot tell whether it does.

Start narrow. Pick a slice of tickets the agent will own outright — say, order status, returns within policy, and password resets — and write down what a resolved ticket looks like for each. A resolution is the customer's problem genuinely solved, verified by the ticket not reopening within a window (say 72 hours) and no human follow-up required.

Then commit to resolution rate as the north star. It is harder to game than deflection rate and more honest than CSAT. Deflection rewards a bot for making people give up; CSAT can stay cheerful while nothing actually gets fixed. Resolution rate ties the agent's score to the outcome the business and the customer both want.

Track a few guardrail metrics alongside it so you do not optimize one number into the ground: escalation rate (are you bouncing too much to humans, or too little?), reopen rate (did the "resolution" actually hold?), and time to resolution. Set a target before you build — for example, autonomously resolve 60% of tier-one tickets with a reopen rate under 5% — so you have a finish line to aim at.

Resolution rate

North-star metric

solved, not just deflected

<5%

Reopen rate

did it actually hold?

72h

Verification window

no reopen = resolved

Tier-1

Initial scope

start narrow, expand later

Representative target, not a promise

Numbers like "60% autonomous resolution" are illustrative starting goals, not benchmarks. Your real baseline comes from measuring your own ticket mix. The discipline that matters is picking the target before you build, so the agent is steered by an outcome instead of vibes.

Step 2

Connect the tools the agent needs to see

Before an agent can resolve anything it has to perceive the situation. Give it read access first — actions come later, deliberately separated.

Helpdesk

Read the open ticket, the full conversation history, tags, and channel. This is the question and its context — the agent should never answer blind to what was already said.

CRM + orders

Look up the customer record: plan, lifetime value, account flags, and order history. 'What is the status of my order?' is unanswerable without it, and entitlement checks depend on it.

Knowledge base

A searchable, chunked index of help articles, policy docs, and past resolved tickets. This is the source of truth the agent retrieves from to ground its answers.

Each connection is a tool — a typed function with a clear name, a description the model can reason about, and a strict schema for its inputs and outputs. Keep the read tools and the write tools in separate buckets in your head and in your code; conflating "look up the order" with "refund the order" is how agents take actions they should only have been reading about.

Scope every read with the customer's identity. The agent should only be able to retrieve the record of the person it is talking to, enforced by your auth layer rather than by trusting the model to behave. We will come back to this hard when we add actions — see AI agent security for the threat model around tools that touch customer data.

tools.pypython

1tools = [  // the agent's read surface2    Tool(3        name="get_ticket",4        description="Fetch the open ticket + full history",5        params={"ticket_id": "string"},6    ),7    Tool(8        name="lookup_customer",9        description="CRM record, plan, flags, orders",10        params={"customer_id": "string"},11    ),12    Tool(13        name="search_kb",14        description="Semantic search over help docs + past tickets",15        params={"query": "string", "top_k": "int"},16    ),17]

Read tools defined with explicit schemas. Note search_kb is the retrieval entry point the RAG loop will call.

Step 3

Wire the reasoning loop with RAG

This is the brain. The agent reads the situation, retrieves grounding passages, drafts a cited answer, and checks itself before it speaks.

Perceive

Read ticket, customer, history

Reason

What does the customer need?

Retrieve

Pull top-k grounding passages

Ground-check

Is the answer supported?

Respond / act

Cited reply, or take an action

The support agent's inner loop. Retrieval grounds the answer; the grounding check decides whether to answer, retrieve again, or escalate.

The loop is deliberately boring: perceive, reason, retrieve, check, respond. When a ticket arrives, the agent reads it and the customer record, decides what is actually being asked, and calls search_kb to retrieve the most relevant passages from your knowledge base. RAG is what makes the next token come from your refund policy rather than the model's vague memory of refund policies in general.

The non-negotiable instruction in the prompt is: answer only from the retrieved context, cite the source, and if the context does not contain the answer, say so and escalate instead of inventing. That single rule is the line between a grounded support agent and a liability.

Add a lightweight grounding check after generation: is each claim in the draft supported by a retrieved passage? If retrieval came back thin or the check fails, the agent does not ship the answer — it either retrieves again with a sharper query or routes to a human. This is also exactly where the agent decides whether the right next move is a sentence or an action.

agent_loop.pypython

1def handle(ticket):  // one ticket, one loop2    customer = lookup_customer(ticket.customer_id)3    passages = search_kb(ticket.question, top_k=5)  // RAG retrieval4    if not passages:5        return escalate(ticket, reason="no_grounding")67    draft = llm(answer_prompt(ticket, customer, passages))8    if not grounded(draft, passages):  // faithfulness check9        return escalate(ticket, reason="low_confidence")1011    if draft.proposed_action:  // model wants to DO something12        return run_with_permission(draft.proposed_action, customer)13    return reply(ticket, draft.text, cites=draft.sources)

A trimmed support loop: retrieve, draft a grounded answer, verify, then either respond or hand off.

Step 4

Add actions behind permission checks

An agent that can only talk is a smarter FAQ. The leap to real value is letting it act — and the leap to real risk is letting it act without a gate.

The model is good at deciding what should happen. It is the wrong place to decide whether it is allowed. So we split those jobs. The model proposes an action — "refund order #4821, $38, reason: damaged on arrival" — and a separate, deterministic permission layer evaluates that proposal against policy written as code.

Encode your policy as explicit rules: amount thresholds, allowed order states, account-standing checks, and rate limits. A small refund on an eligible order for a customer in good standing runs autonomously and is logged. Anything outside the envelope — a larger sum, a flagged account, a second refund this week — returns a "needs approval" verdict and routes to a human. The model never touches money directly; it only ever asks the gate.

Make every action idempotent and auditable. Each one writes a record: who (the agent), what (the action and arguments), why (the cited reasoning), and the verdict. When something goes wrong — and it will — that trail is how you debug it and how you keep trust. The security model for tool-using agents lives or dies on this layer.

Model proposes

Reads context and picks the right action
Drafts the arguments (order, amount, reason)
Explains its reasoning for the audit log
Adapts to phrasing and edge cases

Guardrail disposes

Checks amount + rate limits deterministically
Verifies order state and account standing
Approves, denies, or requires human sign-off
Logs every verdict, idempotent by design

permission.pypython

1def run_with_permission(action, customer):  // the gate2    if action.type == "refund":3        if customer.flagged:  // account standing4            return needs_human(action, "flagged_account")5        if action.amount > AUTO_REFUND_LIMIT:  // $50, say6            return needs_human(action, "over_limit")7        if refunds_this_week(customer) >= 1:  // rate limit8            return needs_human(action, "rate_limited")9        result = refund(action.order_id, action.amount)10        audit_log(agent=True, action=action, verdict="auto")11        return result12    return needs_human(action, "unknown_action")

The permission layer is plain, testable code — not a prompt. The model can only ever request; this function decides.

Step 5

Guardrails and human-in-the-loop escalation

The mark of a mature support agent is not that it never needs a human — it is that it knows exactly when it does, and hands off without making the customer start over.

Low confidenceThin retrieval, failed check

High stakesOver limit, flagged, legal

FrustrationAnger or 'get me a human'

Warm handoffFull context to an agent

Three escalation triggers feed one warm handoff: the human receives the transcript, the proposed action, and a one-line summary.

When to hand off

Low confidence — Retrieval was thin, the question is ambiguous, or the grounding check failed.
High stakes — The action exceeds a permission threshold, the account is flagged, or the topic is billing, legal, or safety.
Negative sentiment — The customer is frustrated, has explicitly asked for a person, or the conversation is looping.
Repeated failure — Two attempts have not moved the ticket forward — stop trying and escalate.

What a warm handoff includes

The full transcript — So the human never asks the customer to repeat themselves.
A one-line summary — What the customer wants and what the agent already tried.
The proposed action — If the agent wanted to act but was blocked, surface it for one-click approval.
Cited sources — The passages the agent retrieved, so the human can verify fast.

Guardrails are layers, not a single prompt

Input filtering (PII, prompt-injection attempts in ticket text), output checks (grounding, tone, no leaked internal notes), and action gates (the permission layer) are three separate defenses. A jailbreak that slips past one should still hit the next. Treat the model as a capable but untrusted component and design around that, exactly as you would for any system handling customer data.

Step 6

Evaluate, then iterate forever

A support agent is not a feature you ship and forget. It is a system you measure, debug, and improve from the transcripts it generates every day.

Build an eval set from real, anonymized tickets — a few hundred to start, spanning easy FAQs, edge cases, and tickets that should escalate. Label each with the correct outcome. Now you can score changes instead of guessing whether a prompt tweak helped. Measure the stages separately: retrieval quality (did the right passage come back?), grounding (is every claim supported?), citation accuracy, and action correctness (did the gate approve the right things and block the rest?).

The biggest evaluation trap is grading only the final reply. A good answer can mask broken retrieval, and a bad answer can mask perfect retrieval with a weak prompt. Score the pieces so you fix the part that is actually wrong. Then close the loop: failing transcripts become new eval cases, gaps in the knowledge base get filled, and recurring escalations tell you the next action worth automating.

What to measure	Healthy	Investigate
Resolution rate	Trending up	Flat or falling
Reopen rate	Low + stable	Spiking
Grounding / faithfulness
Citation accuracy
Escalation rate	Calibrated	Too high or too low
Action correctness

That feedback loop is the whole craft. The agent you launch is a draft; the agent six weeks of real tickets later is the product. For the broader patterns behind this — planning, memory, multi-step tool use — the how to build AI agents guide is the companion read, and the customer support use case page has more on where this kind of agent pays off.

FAQ

Building a support agent, answered

Resolution rate — the share of conversations the agent closes correctly without a human ever touching them. It is honest in a way that deflection rate and CSAT are not: a bot can deflect a ticket by frustrating someone into giving up, and CSAT can stay high while the agent quietly fails to actually fix anything. Resolution rate forces you to define what 'resolved' means (the customer's problem is solved, verified by no reopen within a window) and then measure it. Pair it with escalation rate and reopen rate so you can see whether you are buying resolution at the cost of bouncing hard tickets to humans.

Keep reading

Go deeper on each piece of the build

Customer support use caseWhere support agents pay off, with patterns RAGGround answers in your own knowledge How to build AI agentsThe architecture under every step here AI agent toolsDefine helpdesk, CRM, and KB connections AI agent securityThe threat model for acting agents

build customer support agentAI support agent tutorialhow to build a support botRAG support agentcustomer service automationAI agent tutorial

Get started

Build a support agent that resolves, not just deflects

Connect your helpdesk, ground answers in your docs, and let your agent act safely inside guardrails. Start free, or fork a ready-made template.

Start building free Browse templates