AI Agent Evaluation: how to test and eval agents
A demo that works once proves nothing. Real agents take many steps, call tools with side effects, and answer differently every run — so you need evaluation that scores the whole trajectory, not just the last sentence.
- 13 min read
- Intermediate
- Updated 2026
AI agent evaluation is how you turn "it worked when I tried it" into a number you can trust — a repeatable measure of whether an agent completes tasks correctly, safely, and affordably across the messy range of inputs it will actually meet.
Evaluating a single language-model completion is already subtle. An agent raises the difficulty sharply: it reasons in a loop, decides which tools to call, threads state across many steps, and produces a different sequence of actions every run. Two transcripts can reach the same correct answer by completely different routes — and a fluent, confident final paragraph can sit on top of a trajectory that called the wrong API, ignored retrieved evidence, or burned ten redundant tool calls getting there.
That is why agent evals measure two things at once: the trajectory (what the agent did) and the outcome (what it produced). This guide covers why agent evaluation is genuinely hard, the split between offline and online evals, the core metrics — task success rate, tool-call accuracy, faithfulness, latency, and cost — how to use LLM-as-a-judge without fooling yourself, how to build a real eval dataset, and how to wire regression gates into CI so quality can't quietly slide.
Why evaluating agents is hard
The same properties that make agents powerful — autonomy, tools, and multi-step reasoning — are exactly what make them hard to grade.
A classic test has one input and one expected output. An agent has a trajectory: a branching sequence of decisions, tool calls, and observations that can unfold many valid ways. There is rarely a single golden path, so you cannot just diff against an expected transcript.
Agents are also non-deterministic. Sampling temperature, tool latency, and changing external data mean the same prompt yields different runs. A single pass that succeeds tells you little; you need repeated runs and a success rate, not a pass/fail bit.
Tools add side effects and compounding error: a small mistake in step two cascades through every later step, and a tool that writes data can't simply be replayed. Worse, the things you most care about — was the answer actually grounded? was the plan reasonable? — are fuzzy and need either a rubric or a model to judge them. Treating an agent like a deterministic function is the first mistake; good evaluation embraces the uncertainty instead.
Many valid paths
There's no single golden trajectory — two correct runs can take entirely different routes, so exact-match grading breaks down.
Non-determinism
Sampling and live data make every run different. You measure a success rate over repeats, not one pass/fail outcome.
Compounding error
An early wrong step poisons every later one. Long horizons mean small per-step errors multiply into task failure.
Fuzzy success criteria
Grounding, helpfulness, and plan quality resist exact matching — they need rubrics, references, or a judge model.
Offline vs online evals
Offline evals answer 'is this change safe to ship?'. Online evals answer 'did it actually help?'. You need both, and they feed each other.
| Dimension | Offline evals | Online evals |
|---|---|---|
| Runs against | A fixed, curated dataset | Live production traffic |
| Ground truth | Known, labeled upfront | Often delayed or absent |
| Reproducible | ||
| Best for | CI gates, A/B of prompts & models | Real-world impact, drift detection |
| Cost to run | Cheap, repeatable | Tied to real usage |
| Catches | Regressions before release | Surprises after release |
| Feedback signal | Pass/fail vs thresholds | User behavior, ratings, outcomes |
Offline evaluation is your lab bench. You assemble a dataset of recorded tasks with known answers, replay the agent against it in an isolated harness, and score every run the same way each time. Because it's deterministic in structure, it's perfect for comparing two prompts, two models, or two versions of a tool — and for the CI gates we cover below.
Online evaluation is the real world. In production the agent meets inputs no dataset anticipated, and ground truth is murky: you rarely know the "right" answer, so you lean on proxy signals — user thumbs, task completion, escalation rates, downstream conversions — plus sampled human review. The two regimes form a loop: surprising or failed production traces, surfaced through agent observability, become tomorrow's offline test cases.
The layers of agent evaluation
Evaluation isn't one test. It's a stack — from individual tool calls up to end-to-end business outcomes — and a healthy agent is measured at every layer.
Evaluate bottom-up to localize failures
When the top layer fails, the lower layers tell you why. A dropping task-success rate is just an alarm; component and trajectory evals are the diagnostics that point at the broken tool call or the wasteful loop. Without them you're left re-running the whole agent and guessing. This is the same staged-measurement discipline used in agentic workflows.
The metrics that matter
Five families of metric cover almost everything you need: did it succeed, did it use tools correctly, was it grounded, was it fast, and was it cheap.
Task success rate
The headline number: across repeated runs, what fraction of tasks reached a correct, complete outcome? Defined per task type with a clear pass condition.
Tool-call accuracy
Did the agent pick the right tool, with valid arguments, at the right moment? Catches the silent failures behind a plausible-looking answer.
Faithfulness / grounding
Is every claim supported by retrieved context or tool output, with nothing invented? The anti-hallucination metric for tool- and RAG-using agents.
Latency
End-to-end time per task, including every tool round-trip. Track the tail (p95) — agents that loop produce ugly long-tail latency.
Cost
Tokens and tool spend per completed task. The metric that decides whether an agent is economically viable at scale.
Representative eval scores (illustrative)
Runs per case
average over repeats, not one pass
Latency to watch
the tail, not just the mean
Cost per task
tokens + tool spend, completed work
No single number is enough. A 90% success rate can still hide a faithfulness problem that erodes trust, and a beautifully grounded agent that costs ten dollars per task may never ship. Read the metrics as a scorecard, and weight them to the job: a customer-support agent leans on faithfulness and refusal calibration, while a coding agent leans on task success and tool-call accuracy. Cost and latency tracking pairs directly with how you deploy and scale the agent in production.
Trajectory vs final-answer evaluation
Grade only the answer and you'll ship lucky guesses. Grade only the path and you'll miss a wrong conclusion. You need both, scored separately.
Final-answer evaluation asks: is the end result correct, complete, grounded, and relevant to the goal? It's what users feel, and it's the easiest to label. But it's blind to how the agent got there — a right answer reached by a wrong route is a failure waiting to recur the moment inputs shift.
Trajectory evaluation asks: did the agent take a sound path? Did it choose the right tools, pass correct arguments, avoid redundant or looping calls, and recover gracefully when a step failed? You can score it against a reference trajectory, by rules (no forbidden tool, no more than k calls), or with a judge that rates the plan's soundness.
The reason to separate them is diagnostic clarity: a good answer hiding a bad trajectory is fragile, and a bad answer hiding a good trajectory usually means the final prompt — not the agent's reasoning — is the thing to fix.
Final-answer eval — strengths & limits
- Matches what the user actually experiences
- Easy to label with reference answers
- Cheap to compute at scale
- But: rewards lucky guesses and hides fragility
Trajectory eval — strengths & limits
- Exposes wrong tools, bad args, wasteful loops
- Catches fragility before it reaches users
- Pinpoints which step broke
- But: harder to label; many paths can be valid
1def eval_run(trace, case): // trace = recorded agent steps2 tools = [s.tool for s in trace.steps]3 # 1 - trajectory: right tools, no waste4 tool_ok = set(case.expected_tools).issubset(tools)5 no_loops = len(tools) <= case.max_steps6 args_ok = all(valid_args(s) for s in trace.steps) // schema check78 # 2 - outcome: success + grounding, scored apart9 success = judge_success(trace.answer, case.reference)10 grounded = judge_faithful(trace.answer, trace.context) // LLM-as-judge1112 return Score(tool_ok, no_loops, args_ok, success, grounded)LLM-as-a-judge, and its caveats
A judge model can grade thousands of cases against a rubric for cents — but only if you treat the judge itself as a system that needs calibration and version control.
A grader you must calibrate
LLM-as-a-judge prompts a separate model to score an output against a rubric: rate faithfulness 1-5, decide if the claim is supported by context, or pick which of two answers is better. It scales the kind of nuanced judgment that exact-match metrics can't capture, and it's the only practical way to grade open-ended agent answers at volume.
But a judge is a model with biases. It tends to reward longer and more confident answers, can favor outputs from its own model family, is sensitive to option order in pairwise comparisons, and silently drifts when you change its prompt or its underlying model. An uncalibrated judge gives you precise-looking numbers that quietly lie.
- Calibrate against a human-labeled set before trusting scores.
- Pin the judge model and prompt; version them like code.
- Prefer reference-based or pairwise grading over lone 1-10 scores.
- Randomize answer order to cancel position bias.
Verbosity & confidence bias
Judges over-reward long, assertive answers. Anchor the rubric to correctness and grounding, not length or tone.
Self-preference bias
A judge can favor text from its own model family. Cross-check with humans, or use a different family as judge.
Drift over time
Change the judge prompt or model and your historical scores stop comparing. Pin and version everything.
Building an eval dataset
An eval is only as good as the cases it runs. A strong dataset is curated, labeled, versioned, and constantly fed by the failures you find in production.
1 · Source real tasks
Start from genuine user requests and production traces, not invented examples. Sampled, anonymized real traffic captures the distribution your agent actually faces.
2 · Cover the edges
Deliberately include hard, ambiguous, adversarial, and 'should refuse' cases. Edge cases — not the happy path — are where regressions hide.
3 · Label with ground truth
Attach a reference answer, expected tools, or a pass rubric to each case. For trajectories, record an acceptable path and any forbidden actions.
4 · Version it
Treat the dataset like code: commit it, review changes, and tag releases so a score from last month still means something today.
5 · Grow from failures
Every production miss surfaced by observability becomes a new labeled case. The dataset compounds, and the same bug can never regress unnoticed twice.
Beware a stale or leaky dataset
Two failure modes quietly ruin datasets. Staleness: it stops reflecting real traffic, so green scores no longer mean a healthy agent. Leakage: cases drift into prompts or fine-tuning data, and the agent learns the answers instead of the skill. Keep a held-out slice the agent never trains on, and refresh the set as your product and users evolve.
Regression testing in CI and human review
Evaluation only protects quality if it runs automatically on every change — and if a human still reviews the cases automation can't settle.
- Run evals on every PR — Replay the pinned dataset in an isolated harness with mocked or sandboxed tools so results are reproducible.
- Repeat each case — Agents are non-deterministic — run each case several times and compare score distributions, not single points.
- Gate on thresholds — Fail the build if task success drops below baseline, cost rises past budget, or any safety check regresses.
- Record tool responses — Seed sampling and capture tool outputs so a failing run can be replayed and debugged deterministically.
- Track scores over time — Trend success, cost, and latency across releases to catch slow drift, not just sudden breaks.
The goal is simple: a change that makes the agent worse should never merge silently. Wire the offline dataset into CI so every pull request produces a scorecard, and set thresholds that block the merge when key metrics regress. Because runs vary, compare distributions over several repeats rather than a single lucky pass.
Automation can't grade everything. Human-in-the-loop review stays essential for ambiguous cases, novel failure modes, safety-sensitive outputs, and calibrating your judge model. The durable pattern is a flywheel: automated evals catch the known regressions cheaply on every change, humans review a sampled and flagged slice, and their labels both feed the dataset and re-calibrate the LLM judge. Together with production observability, this turns one-off testing into a continuous quality system that holds up as your agentic workflows grow more complex.
Agent evaluation, answered
AI agent evaluation is the practice of measuring whether an autonomous agent does the right thing across a multi-step task — not just whether its final sentence sounds plausible. It scores both the trajectory (which tools the agent called, with what arguments, in what order) and the outcome (did the task succeed, was the answer grounded, how much did it cost and how long did it take). Unlike grading a single LLM completion, agent evaluation has to account for branching paths, tool side effects, and the fact that two different sequences of actions can both be correct.
Go deeper on shipping reliable agents
Measure your agent, then make it better
Build eval datasets, score trajectories and outcomes, and gate every release on quality. Free to start — no credit card required.