What is the difference between a trace and a span?

A trace is the complete record of one agent run from the user's request to the final answer. A span is a single timed unit of work inside that trace — one LLM call, one tool call, one retrieval, one sub-agent step. Spans nest: a parent span (the overall agent turn) contains child spans for each model call and tool call, which may contain their own children. Together the spans form a tree that shows the order, duration, and parent-child relationship of every operation, so you can see not just that something was slow or wrong but where in the run it happened.

Which signals should I monitor for an AI agent?

Track token usage (prompt and completion separately), latency per step and end to end, cost per run, tool error and retry rates, step or loop counts, and outcome signals like task success and refusal rate. Token usage and cost reveal runaway prompts and expensive loops; latency breakdowns show whether the model or a tool is the bottleneck; tool errors and retries expose flaky integrations; and step counts catch agents stuck looping. Pair these operational signals with quality signals from evaluation so you can tell a fast, cheap run from a fast, cheap, wrong one.

Can I use OpenTelemetry for LLM and agent applications?

Yes. OpenTelemetry's trace-and-span model maps cleanly onto agent runs, and the community has standardized GenAI semantic conventions — attribute names for the model, token counts, prompt and completion content, tool names, and more — so traces are portable across backends. You wrap each LLM call and tool call in a span, attach those attributes, and export the spans to any OTel-compatible backend. Most LLM observability platforms either build on OpenTelemetry or accept OTLP, so instrumenting once lets you switch dashboards without re-instrumenting your code.

How does observability connect to agent evaluation?

Observability is the data layer that evaluation runs on top of. The traces you capture in production become the dataset you replay, score, and turn into regression tests. When you spot a bad run in a dashboard, you save its trace as a test case; when you change a prompt or model, you replay those traces and compare. Faithfulness checks, LLM-as-judge scores, and outcome labels all attach to spans, so a low score points straight at the step that caused it. In short, observability tells you what happened and evaluation tells you whether it was good — see /learn/ai-agent-evaluation for the scoring side.

Observability · Tracing & Monitoring

AI Agent Observability: tracing & monitoring

An agent's logic lives in its model outputs, not your source code — so when a run goes wrong, you can't read the bug, you have to reconstruct it. Observability records every reasoning step, LLM call, and tool call so you can see, measure, and debug what your agent actually did.

13 min read
Intermediate
Updated 2026

Trace your agents Evaluate agent quality

Observability is how you turn an opaque, probabilistic agent into a system you can actually inspect, measure, and trust in production.

A conventional service fails in legible ways: a stack trace, an error code, a line number. An AI agentdoes not. Its behavior emerges from a model deciding, turn by turn, which tool to call and what to say next. When the output is wrong, there is no line of code to blame — the "bug" is a chain of decisions across several LLM calls and tool invocations, any of which could have introduced the error. Agent observability is the discipline of recording that chain in enough detail that you can replay it, measure it, and find the step that broke.

The core artifact is the trace: a structured, time-ordered record of a single run, broken into nested spans for each model call, tool call, and retrieval. Layered on top are the signals you watch — token usage, latency, cost, tool errors, step counts — and the dashboards and alerts that surface trouble before users do. Done well, this same data feeds directly into evaluation and debugging.

This guide covers why agents demand observability, how traces and spans model a run, the signals worth tracking, structured logging and run replay, OpenTelemetry-style instrumentation for LLM apps, alerting and dashboards, and how all of it loops back into making your agent better.

The problem

Why agents are uniquely hard to observe

Agents combine non-determinism, multi-step reasoning, external tools, and natural-language failure — a combination that defeats traditional monitoring.

Classic monitoring assumes deterministic code: the same input yields the same path, errors throw, and a status code tells you what happened. None of those hold for an agent. The same prompt can take a different route on every run, "failure" often looks like a perfectly well-formed answer that is simply wrong, and the real logic lives inside model outputs you never wrote.

Worse, the work is multi-step and distributed. A single user request might fan out into a planning call, three tool calls, a retrieval, and a synthesis call — sometimes across several sub-agents. An error in step two only surfaces as nonsense in step five. Without a recorded trace, you are debugging by re-running and hoping the failure reproduces, which, for a non-deterministic system, it often won't.

Observability answers the questions monitoring can't: Which tool call failed? What exact prompt did the model see? Why did it loop seven times? Where did the cost spike? You can only answer those if you captured the run when it happened.

Non-determinism

Same input, different path. You can't reason about behavior from code alone — you have to observe actual runs.

Multi-step reasoning

Errors compound across planning, tool, and synthesis steps. The visible symptom is rarely where the fault began.

External tools

APIs, retrieval, and code execution introduce latency, flakiness, and failures the model must then react to.

Silent wrongness

A failed run can return HTTP 200 and a fluent, confident, incorrect answer. Uptime tells you nothing.

The core model

Traces and spans: anatomy of a run

Borrowed from distributed tracing and adapted for agents, the trace-and-span model is the foundation everything else builds on.

A trace represents one end-to-end run — a user's request through to the agent's final response. Inside it, every discrete unit of work becomes a span with a name, a start and end time, and a set of attributes. Spans nest into a tree: the top span is the agent turn; its children are the LLM calls and tool calls; a tool call that triggers retrieval has its own child spans. The shape of that tree is the shape of the agent's reasoning.

00:00.000root span
Trace start · user request received
Root span opens. Captures the user input, session and run IDs, and the agent + model configuration in scope.
00:00.180llm span
LLM call · plan the task
First model span. Records the full prompt, the response, prompt/completion tokens, latency, and which tools the model chose to call.
00:00.940tool span
Tool call · search_docs(query)
Tool span with arguments, return value, and status. A nested retrieval span shows the top-k passages and their scores.
00:01.610error span
Tool call · get_order(id) → error + retry
Tool span flagged as an error with the exception and stack. A second sibling span shows the automatic retry succeeding.
00:02.300llm span
LLM call · synthesize answer
Final model span combines retrieved context and tool results into the answer, recording tokens, cost, and any citations.
00:03.050complete
Trace end · response returned
Root span closes. Run-level totals roll up: 5 steps, total tokens, total cost, total latency, and outcome status.

Capture inputs and outputs, not just timings

A trace that only records durations tells you an agent was slow but never why it was wrong. The most valuable spans store the full prompt and completion, the tool arguments and results, and the retrieved context — the actual content the model saw and produced. That content is what lets you replay and debug the run later, and what feeds evaluation.

What to measure

The signals that matter

Operational signals tell you whether the agent is fast, cheap, and reliable. Pair them with quality signals to know whether it's also right.

Token usage

Prompt and completion tokens, per step and per run. Spikes reveal bloated context, runaway tool output, or an agent re-reading the same data each loop.

Latency

End-to-end and per-span. Breaking latency down by step shows whether the model, a tool, or retrieval is the real bottleneck.

Cost

Derived from tokens and model price per run, per user, and per feature. The fastest way to catch an expensive loop before the invoice does.

Tool errors & retries

Error rate and retry count per tool. Flaky integrations force the agent into recovery behavior that inflates steps, latency, and cost.

Step / loop counts

Steps per run and depth of the reasoning loop. A creeping average often means agents getting stuck or thrashing toward the goal.

Outcome quality

Task success, refusal rate, and faithfulness from evaluation. Operational health is necessary but not sufficient — a fast, cheap, wrong agent is still broken.

A representative agent run, by step

Plan task (LLM)1200 tok

search_docs (retrieval)350 tok

get_order (tool)180 tok

Reflect on results (LLM)900 tok

Synthesize answer (LLM)4100 tok

Illustrative per-step token spend for one customer-support run. Values are representative, not measured — synthesis dominates because it carries all retrieved context and tool results.

Read signals as a system, not in isolation. A latency spike with flat tokens points at a slow tool; a latency spike with a token spike points at a longer reasoning loop. Rising cost with steady success is waste to trim; rising cost with falling success is a regression to chase. The art is correlating operational and quality signals so a number on a dashboard always points at a specific span in a specific trace.

From logs to replay

Structured logging and replaying runs

Free-text logs are useless at agent scale. Structured, span-linked events turn every run into a record you can query, save, and replay exactly.

Log events, not sentences

Structured logging for agents

Each meaningful event — a model call, a tool call, a retry, a guardrail trip — is logged as a structured record carrying the trace ID, span ID, step number, and typed fields (model, tokens, latency, status). Because every event links to its span, you can pivot instantly from an aggregate metric to the exact run that produced it.

This is also what makes runs replayable. A trace that stores the inputs, the tool results, and the model responses is a complete, deterministic recording. You can step through it after the fact, feed it to a different model to compare, or pin it as a regression test — without re-hitting any live API.

Correlate everything by trace and span IDs.
Typed fields make logs queryable, not grep-able.
Stored inputs/outputs let you replay a run exactly.
Save a bad trace as a test case in one click.

How evaluation reuses traces

Capture runSpans + structured events

Store traceInputs, outputs, context

ReplayStep through or re-run offline

Pin as testLock in a regression case

The path from a single run to a durable regression test: capture, store, replay, and pin the runs that mattered.

trace.pypython

1with tracer.span("llm.call", trace_id=run.id) as span:  // open a span2    span.set(model="claude", step=run.step)3    resp = llm(prompt)  // the actual model call4    span.set(5        prompt=prompt,6        completion=resp.text,7        prompt_tokens=resp.usage.input,8        completion_tokens=resp.usage.output,9        latency_ms=span.elapsed(),10        status="ok",11    )1213log.info("llm.call", **span.fields)  // one structured event, linked to the span

A minimal span emitting a structured event. Every field is typed and linked by trace and span IDs, so the run is queryable and replayable later.

Open standards

OpenTelemetry-style instrumentation for LLM apps

You don't need a proprietary format to trace agents. OpenTelemetry's span model fits LLM apps, and its GenAI conventions make traces portable across any backend.

OpenTelemetry (OTel) is the open standard for traces, metrics, and logs. Its span abstraction maps almost one-to-one onto agent steps, so the move to LLM observability is mostly about agreeing on attribute names. The community's GenAI semantic conventions do exactly that: standard keys for the model, prompt and completion content, input and output token counts, the operation type, and the tool name.

The pattern is simple and provider-agnostic: wrap each LLM call and tool call in a span, set the standard GenAI attributes, and export the spans over OTLP to whatever backend you like — a dedicated LLM observability platform, your existing tracing stack, or both. Instrument once and you can swap dashboards without touching application code, which avoids the lock-in of bespoke SDKs.

Many agent frameworks now auto-instrument these spans for you, so a single call setup produces a full nested trace. Even then, it's worth adding custom spans around your own tools and guardrails so the trace reflects your agent, not just the framework's calls.

Trace: The full tree of spans for one agent run, sharing a single trace ID.
Span: One timed operation with a name, parent, status, and attributes.
Attribute: A typed key-value on a span — e.g. gen_ai.usage.input_tokens or gen_ai.request.model.
OTLP: The OpenTelemetry wire protocol used to export spans to a backend.
Exporter: The component that ships spans from your app to one or more observability backends.

Mind the sensitive content

Spans often capture full prompts and completions, which may contain PII or secrets. Treat trace storage as sensitive: redact or hash fields where needed, scope access, and apply retention limits. Good observability shouldn't become a new data-leak surface.

Operate it

Dashboards, alerting, and on-call

Traces explain a single run; dashboards and alerts watch the fleet. The goal is to catch regressions before users feel them — and jump straight to the trace that caused one.

1 · Dashboards
Aggregate the signals over time and slice them by model, tool, agent version, and user segment. Track p50/p95 latency, cost per run, tool error rates, average steps, and success rate side by side.
2 · Alerts
Set thresholds on what hurts: a tool error-rate spike, a cost-per-run jump after a deploy, p95 latency breaching SLA, or success rate dropping. Alert on rates and trends, not single noisy runs.
3 · Drill down
Every alert and chart links back to example traces. On-call clicks from 'cost is up 40%' straight to the runs driving it, then to the span — the bloated prompt or looping tool — responsible.
4 · Close the loop
Save the offending trace as a regression test, ship a fix (prompt, tool, or guardrail), and watch the dashboard confirm the metric recover. The fix is verified by the same system that caught it.

Concern	Traditional monitoring	Agent observability
Primary question	Is the service up?	What did the agent do, and was it right?
Unit of analysis	Request / endpoint	Trace of nested spans
Failure looks like	Error code, exception	Fluent but wrong answer (HTTP 200)
Key signals	CPU, memory, status codes	Tokens, cost, steps, tool errors, quality
Debugging method	Read the stack trace	Replay the run, inspect each span
Reproducible?
Feeds evaluation

Why it all pays off

How observability feeds evaluation and debugging

Observability isn't an end in itself. Its real value is the flywheel it powers: every captured run makes the next version of your agent measurably better.

Think of observability as the data layer and evaluation as the judgment layeron top of it. Traces are the raw material; scores, labels, and tests are what you build from them. Because every quality score attaches to a span, a low faithfulness score doesn't just say "this answer was wrong" — it points at the exact retrieval or synthesis step that caused it.

The loop is concrete. You spot a bad run in a dashboard, open its trace, and see the failing span. You save that trace to your eval set, change a prompt or swap a tool, and replay the saved traces to confirm the fix without a regression elsewhere. Over time your eval set becomes a library of real production failures — far more honest than synthetic test cases.

This is the same hygiene that makes any agent dependable, and it pairs with disciplined deployment: ship behind a version flag, watch the new version's traces against the old, and roll back the moment a signal regresses. Observability is what makes that comparison possible at all.

Trace everything in production — You can't debug or evaluate a run you didn't capture — instrument first.
Store inputs and outputs, not just timings — Content is what makes a trace replayable and scorable.
Turn real failures into tests — Promote bad traces to your eval set so they can't silently return.
Attach quality scores to spans — Let a low score point straight at the step that caused it.
Compare versions on the same traces — Replay a fixed set across model/prompt changes to catch regressions.
Alert on trends, drill to traces — Move from an aggregate signal to the offending span in seconds.

Trace per run

the unit you replay

Spans per agent turn

LLM + tool + retrieval

p95

Latency to watch

tails, not averages

100%

Runs instrumented

the observability goal

FAQ

Agent observability, answered

AI agent observability is the practice of instrumenting an agent so you can see exactly what it did on any run: every LLM call, every tool invocation, the prompts and responses, the tokens spent, the latency, and the decisions that strung them together. Where traditional monitoring tells you a service is up or down, observability lets you ask open-ended questions of a system whose behavior you couldn't fully predict in advance. Because an agent's logic emerges from model outputs rather than fixed code paths, the only way to understand a failure is to reconstruct the run step by step from recorded traces.

Keep learning

Go deeper on running agents in production

AI agent evaluationScore traces and turn them into tests AI agent deploymentShip, version, and roll back safely AI agent architectureThe structure observability instruments Platform statusLive uptime and incident history AI agent toolsThe tool calls your spans record LLM agentsThe reason–act loop you trace

AI agent observabilityLLM observabilityagent tracingspansmonitoring AI agentsdebugging agentstoken usageagent loggingLLM tracing

Get started

See exactly what your agents do

Trace every LLM call and tool call, watch the signals that matter, and replay any run to debug it. Free to start — no credit card required.

Start tracing free Browse templates

AI Agent Observability: tracing & monitoring

Why agents are uniquely hard to observe

Non-determinism

Multi-step reasoning

External tools

Silent wrongness

Traces and spans: anatomy of a run

Trace start · user request received

LLM call · plan the task

Tool call · search_docs(query)

Tool call · get_order(id) → error + retry

LLM call · synthesize answer

Trace end · response returned