Evals · Testing autonomous agents

AI Agent Evaluation: how to test and eval agents

A demo that works once proves nothing. Real agents take many steps, call tools with side effects, and answer differently every run — so you need evaluation that scores the whole trajectory, not just the last sentence.

  • 13 min read
  • Intermediate
  • Updated 2026

AI agent evaluation is how you turn "it worked when I tried it" into a number you can trust — a repeatable measure of whether an agent completes tasks correctly, safely, and affordably across the messy range of inputs it will actually meet.

Evaluating a single language-model completion is already subtle. An agent raises the difficulty sharply: it reasons in a loop, decides which tools to call, threads state across many steps, and produces a different sequence of actions every run. Two transcripts can reach the same correct answer by completely different routes — and a fluent, confident final paragraph can sit on top of a trajectory that called the wrong API, ignored retrieved evidence, or burned ten redundant tool calls getting there.

That is why agent evals measure two things at once: the trajectory (what the agent did) and the outcome (what it produced). This guide covers why agent evaluation is genuinely hard, the split between offline and online evals, the core metrics — task success rate, tool-call accuracy, faithfulness, latency, and cost — how to use LLM-as-a-judge without fooling yourself, how to build a real eval dataset, and how to wire regression gates into CI so quality can't quietly slide.

The core difficulty

Why evaluating agents is hard

The same properties that make agents powerful — autonomy, tools, and multi-step reasoning — are exactly what make them hard to grade.

A classic test has one input and one expected output. An agent has a trajectory: a branching sequence of decisions, tool calls, and observations that can unfold many valid ways. There is rarely a single golden path, so you cannot just diff against an expected transcript.

Agents are also non-deterministic. Sampling temperature, tool latency, and changing external data mean the same prompt yields different runs. A single pass that succeeds tells you little; you need repeated runs and a success rate, not a pass/fail bit.

Tools add side effects and compounding error: a small mistake in step two cascades through every later step, and a tool that writes data can't simply be replayed. Worse, the things you most care about — was the answer actually grounded? was the plan reasonable? — are fuzzy and need either a rubric or a model to judge them. Treating an agent like a deterministic function is the first mistake; good evaluation embraces the uncertainty instead.

Many valid paths

There's no single golden trajectory — two correct runs can take entirely different routes, so exact-match grading breaks down.

Non-determinism

Sampling and live data make every run different. You measure a success rate over repeats, not one pass/fail outcome.

Compounding error

An early wrong step poisons every later one. Long horizons mean small per-step errors multiply into task failure.

Fuzzy success criteria

Grounding, helpfulness, and plan quality resist exact matching — they need rubrics, references, or a judge model.

Two evaluation regimes

Offline vs online evals

Offline evals answer 'is this change safe to ship?'. Online evals answer 'did it actually help?'. You need both, and they feed each other.

DimensionOffline evalsOnline evals
Runs againstA fixed, curated datasetLive production traffic
Ground truthKnown, labeled upfrontOften delayed or absent
Reproducible
Best forCI gates, A/B of prompts & modelsReal-world impact, drift detection
Cost to runCheap, repeatableTied to real usage
CatchesRegressions before releaseSurprises after release
Feedback signalPass/fail vs thresholdsUser behavior, ratings, outcomes

Offline evaluation is your lab bench. You assemble a dataset of recorded tasks with known answers, replay the agent against it in an isolated harness, and score every run the same way each time. Because it's deterministic in structure, it's perfect for comparing two prompts, two models, or two versions of a tool — and for the CI gates we cover below.

Online evaluation is the real world. In production the agent meets inputs no dataset anticipated, and ground truth is murky: you rarely know the "right" answer, so you lean on proxy signals — user thumbs, task completion, escalation rates, downstream conversions — plus sampled human review. The two regimes form a loop: surprising or failed production traces, surfaced through agent observability, become tomorrow's offline test cases.

A map of the space

The layers of agent evaluation

Evaluation isn't one test. It's a stack — from individual tool calls up to end-to-end business outcomes — and a healthy agent is measured at every layer.

Component evals
Single tool-call correctnessArgument & schema validityRetrieval relevance (RAG step)
Trajectory evals
Right tools, right orderNo redundant or looping stepsRecovery after a failed action
Task / outcome evals
Task success rateFaithfulness & groundingAnswer relevance to the goal
Experience & impact evals
Latency and cost per taskHuman review & user ratingsDownstream business outcomes
The agent evaluation stack. Lower layers test the building blocks (single steps and tool calls); upper layers test whether the whole task and its real-world impact actually land.

Evaluate bottom-up to localize failures

When the top layer fails, the lower layers tell you why. A dropping task-success rate is just an alarm; component and trajectory evals are the diagnostics that point at the broken tool call or the wasteful loop. Without them you're left re-running the whole agent and guessing. This is the same staged-measurement discipline used in agentic workflows.

What to actually measure

The metrics that matter

Five families of metric cover almost everything you need: did it succeed, did it use tools correctly, was it grounded, was it fast, and was it cheap.

Task success rate

The headline number: across repeated runs, what fraction of tasks reached a correct, complete outcome? Defined per task type with a clear pass condition.

Tool-call accuracy

Did the agent pick the right tool, with valid arguments, at the right moment? Catches the silent failures behind a plausible-looking answer.

Faithfulness / grounding

Is every claim supported by retrieved context or tool output, with nothing invented? The anti-hallucination metric for tool- and RAG-using agents.

Latency

End-to-end time per task, including every tool round-trip. Track the tail (p95) — agents that loop produce ugly long-tail latency.

Cost

Tokens and tool spend per completed task. The metric that decides whether an agent is economically viable at scale.

Representative eval scores (illustrative)

Task success rate88%
Tool-call accuracy93%
Faithfulness / grounding79%
Citation accuracy84%
Refusal calibration71%
Example only — not measured benchmark results. Use this shape to picture how a scorecard reads: a strong success rate that still hides a faithfulness gap worth fixing.
N×

Runs per case

average over repeats, not one pass

p95

Latency to watch

the tail, not just the mean

$

Cost per task

tokens + tool spend, completed work

No single number is enough. A 90% success rate can still hide a faithfulness problem that erodes trust, and a beautifully grounded agent that costs ten dollars per task may never ship. Read the metrics as a scorecard, and weight them to the job: a customer-support agent leans on faithfulness and refusal calibration, while a coding agent leans on task success and tool-call accuracy. Cost and latency tracking pairs directly with how you deploy and scale the agent in production.

Two complementary lenses

Trajectory vs final-answer evaluation

Grade only the answer and you'll ship lucky guesses. Grade only the path and you'll miss a wrong conclusion. You need both, scored separately.

Final-answer evaluation asks: is the end result correct, complete, grounded, and relevant to the goal? It's what users feel, and it's the easiest to label. But it's blind to how the agent got there — a right answer reached by a wrong route is a failure waiting to recur the moment inputs shift.

Trajectory evaluation asks: did the agent take a sound path? Did it choose the right tools, pass correct arguments, avoid redundant or looping calls, and recover gracefully when a step failed? You can score it against a reference trajectory, by rules (no forbidden tool, no more than k calls), or with a judge that rates the plan's soundness.

The reason to separate them is diagnostic clarity: a good answer hiding a bad trajectory is fragile, and a bad answer hiding a good trajectory usually means the final prompt — not the agent's reasoning — is the thing to fix.

Final-answer eval — strengths & limits

  • Matches what the user actually experiences
  • Easy to label with reference answers
  • Cheap to compute at scale
  • But: rewards lucky guesses and hides fragility

Trajectory eval — strengths & limits

  • Exposes wrong tools, bad args, wasteful loops
  • Catches fragility before it reaches users
  • Pinpoints which step broke
  • But: harder to label; many paths can be valid
trajectory_eval.pypython
1def eval_run(trace, case):  // trace = recorded agent steps2    tools = [s.tool for s in trace.steps]3    # 1 - trajectory: right tools, no waste4    tool_ok = set(case.expected_tools).issubset(tools)5    no_loops = len(tools) <= case.max_steps6    args_ok = all(valid_args(s) for s in trace.steps)  // schema check78    # 2 - outcome: success + grounding, scored apart9    success = judge_success(trace.answer, case.reference)10    grounded = judge_faithful(trace.answer, trace.context)  // LLM-as-judge1112    return Score(tool_ok, no_loops, args_ok, success, grounded)
A compact trajectory check: confirm the agent took a sound path, then score the final answer separately so each failure mode is visible.
Scaling judgment

LLM-as-a-judge, and its caveats

A judge model can grade thousands of cases against a rubric for cents — but only if you treat the judge itself as a system that needs calibration and version control.

Use it deliberately

A grader you must calibrate

LLM-as-a-judge prompts a separate model to score an output against a rubric: rate faithfulness 1-5, decide if the claim is supported by context, or pick which of two answers is better. It scales the kind of nuanced judgment that exact-match metrics can't capture, and it's the only practical way to grade open-ended agent answers at volume.

But a judge is a model with biases. It tends to reward longer and more confident answers, can favor outputs from its own model family, is sensitive to option order in pairwise comparisons, and silently drifts when you change its prompt or its underlying model. An uncalibrated judge gives you precise-looking numbers that quietly lie.

  • Calibrate against a human-labeled set before trusting scores.
  • Pin the judge model and prompt; version them like code.
  • Prefer reference-based or pairwise grading over lone 1-10 scores.
  • Randomize answer order to cancel position bias.
How agents use tools

Verbosity & confidence bias

Judges over-reward long, assertive answers. Anchor the rubric to correctness and grounding, not length or tone.

Self-preference bias

A judge can favor text from its own model family. Cross-check with humans, or use a different family as judge.

Drift over time

Change the judge prompt or model and your historical scores stop comparing. Pin and version everything.

The foundation

Building an eval dataset

An eval is only as good as the cases it runs. A strong dataset is curated, labeled, versioned, and constantly fed by the failures you find in production.

  1. 1 · Source real tasks

    Start from genuine user requests and production traces, not invented examples. Sampled, anonymized real traffic captures the distribution your agent actually faces.

  2. 2 · Cover the edges

    Deliberately include hard, ambiguous, adversarial, and 'should refuse' cases. Edge cases — not the happy path — are where regressions hide.

  3. 3 · Label with ground truth

    Attach a reference answer, expected tools, or a pass rubric to each case. For trajectories, record an acceptable path and any forbidden actions.

  4. 4 · Version it

    Treat the dataset like code: commit it, review changes, and tag releases so a score from last month still means something today.

  5. 5 · Grow from failures

    Every production miss surfaced by observability becomes a new labeled case. The dataset compounds, and the same bug can never regress unnoticed twice.

Beware a stale or leaky dataset

Two failure modes quietly ruin datasets. Staleness: it stops reflecting real traffic, so green scores no longer mean a healthy agent. Leakage: cases drift into prompts or fine-tuning data, and the agent learns the answers instead of the skill. Keep a held-out slice the agent never trains on, and refresh the set as your product and users evolve.

Make it continuous

Regression testing in CI and human review

Evaluation only protects quality if it runs automatically on every change — and if a human still reviews the cases automation can't settle.

  • Run evals on every PRReplay the pinned dataset in an isolated harness with mocked or sandboxed tools so results are reproducible.
  • Repeat each caseAgents are non-deterministic — run each case several times and compare score distributions, not single points.
  • Gate on thresholdsFail the build if task success drops below baseline, cost rises past budget, or any safety check regresses.
  • Record tool responsesSeed sampling and capture tool outputs so a failing run can be replayed and debugged deterministically.
  • Track scores over timeTrend success, cost, and latency across releases to catch slow drift, not just sudden breaks.

The goal is simple: a change that makes the agent worse should never merge silently. Wire the offline dataset into CI so every pull request produces a scorecard, and set thresholds that block the merge when key metrics regress. Because runs vary, compare distributions over several repeats rather than a single lucky pass.

Automation can't grade everything. Human-in-the-loop review stays essential for ambiguous cases, novel failure modes, safety-sensitive outputs, and calibrating your judge model. The durable pattern is a flywheel: automated evals catch the known regressions cheaply on every change, humans review a sampled and flagged slice, and their labels both feed the dataset and re-calibrate the LLM judge. Together with production observability, this turns one-off testing into a continuous quality system that holds up as your agentic workflows grow more complex.

FAQ

Agent evaluation, answered

AI agent evaluation is the practice of measuring whether an autonomous agent does the right thing across a multi-step task — not just whether its final sentence sounds plausible. It scores both the trajectory (which tools the agent called, with what arguments, in what order) and the outcome (did the task succeed, was the answer grounded, how much did it cost and how long did it take). Unlike grading a single LLM completion, agent evaluation has to account for branching paths, tool side effects, and the fact that two different sequences of actions can both be correct.

Keep learning

Go deeper on shipping reliable agents

Get started

Measure your agent, then make it better

Build eval datasets, score trajectories and outcomes, and gate every release on quality. Free to start — no credit card required.