Observability · Tracing & Monitoring

AI Agent Observability: tracing & monitoring

An agent's logic lives in its model outputs, not your source code — so when a run goes wrong, you can't read the bug, you have to reconstruct it. Observability records every reasoning step, LLM call, and tool call so you can see, measure, and debug what your agent actually did.

  • 13 min read
  • Intermediate
  • Updated 2026

Observability is how you turn an opaque, probabilistic agent into a system you can actually inspect, measure, and trust in production.

A conventional service fails in legible ways: a stack trace, an error code, a line number. An AI agentdoes not. Its behavior emerges from a model deciding, turn by turn, which tool to call and what to say next. When the output is wrong, there is no line of code to blame — the "bug" is a chain of decisions across several LLM calls and tool invocations, any of which could have introduced the error. Agent observability is the discipline of recording that chain in enough detail that you can replay it, measure it, and find the step that broke.

The core artifact is the trace: a structured, time-ordered record of a single run, broken into nested spans for each model call, tool call, and retrieval. Layered on top are the signals you watch — token usage, latency, cost, tool errors, step counts — and the dashboards and alerts that surface trouble before users do. Done well, this same data feeds directly into evaluation and debugging.

This guide covers why agents demand observability, how traces and spans model a run, the signals worth tracking, structured logging and run replay, OpenTelemetry-style instrumentation for LLM apps, alerting and dashboards, and how all of it loops back into making your agent better.

The problem

Why agents are uniquely hard to observe

Agents combine non-determinism, multi-step reasoning, external tools, and natural-language failure — a combination that defeats traditional monitoring.

Classic monitoring assumes deterministic code: the same input yields the same path, errors throw, and a status code tells you what happened. None of those hold for an agent. The same prompt can take a different route on every run, "failure" often looks like a perfectly well-formed answer that is simply wrong, and the real logic lives inside model outputs you never wrote.

Worse, the work is multi-step and distributed. A single user request might fan out into a planning call, three tool calls, a retrieval, and a synthesis call — sometimes across several sub-agents. An error in step two only surfaces as nonsense in step five. Without a recorded trace, you are debugging by re-running and hoping the failure reproduces, which, for a non-deterministic system, it often won't.

Observability answers the questions monitoring can't: Which tool call failed? What exact prompt did the model see? Why did it loop seven times? Where did the cost spike? You can only answer those if you captured the run when it happened.

Non-determinism

Same input, different path. You can't reason about behavior from code alone — you have to observe actual runs.

Multi-step reasoning

Errors compound across planning, tool, and synthesis steps. The visible symptom is rarely where the fault began.

External tools

APIs, retrieval, and code execution introduce latency, flakiness, and failures the model must then react to.

Silent wrongness

A failed run can return HTTP 200 and a fluent, confident, incorrect answer. Uptime tells you nothing.

The core model

Traces and spans: anatomy of a run

Borrowed from distributed tracing and adapted for agents, the trace-and-span model is the foundation everything else builds on.

A trace represents one end-to-end run — a user's request through to the agent's final response. Inside it, every discrete unit of work becomes a span with a name, a start and end time, and a set of attributes. Spans nest into a tree: the top span is the agent turn; its children are the LLM calls and tool calls; a tool call that triggers retrieval has its own child spans. The shape of that tree is the shape of the agent's reasoning.

  1. 00:00.000root span

    Trace start · user request received

    Root span opens. Captures the user input, session and run IDs, and the agent + model configuration in scope.

  2. 00:00.180llm span

    LLM call · plan the task

    First model span. Records the full prompt, the response, prompt/completion tokens, latency, and which tools the model chose to call.

  3. 00:00.940tool span

    Tool call · search_docs(query)

    Tool span with arguments, return value, and status. A nested retrieval span shows the top-k passages and their scores.

  4. 00:01.610error span

    Tool call · get_order(id) → error + retry

    Tool span flagged as an error with the exception and stack. A second sibling span shows the automatic retry succeeding.

  5. 00:02.300llm span

    LLM call · synthesize answer

    Final model span combines retrieved context and tool results into the answer, recording tokens, cost, and any citations.

  6. 00:03.050complete

    Trace end · response returned

    Root span closes. Run-level totals roll up: 5 steps, total tokens, total cost, total latency, and outcome status.

Capture inputs and outputs, not just timings

A trace that only records durations tells you an agent was slow but never why it was wrong. The most valuable spans store the full prompt and completion, the tool arguments and results, and the retrieved context — the actual content the model saw and produced. That content is what lets you replay and debug the run later, and what feeds evaluation.

What to measure

The signals that matter

Operational signals tell you whether the agent is fast, cheap, and reliable. Pair them with quality signals to know whether it's also right.

Token usage

Prompt and completion tokens, per step and per run. Spikes reveal bloated context, runaway tool output, or an agent re-reading the same data each loop.

Latency

End-to-end and per-span. Breaking latency down by step shows whether the model, a tool, or retrieval is the real bottleneck.

Cost

Derived from tokens and model price per run, per user, and per feature. The fastest way to catch an expensive loop before the invoice does.

Tool errors & retries

Error rate and retry count per tool. Flaky integrations force the agent into recovery behavior that inflates steps, latency, and cost.

Step / loop counts

Steps per run and depth of the reasoning loop. A creeping average often means agents getting stuck or thrashing toward the goal.

Outcome quality

Task success, refusal rate, and faithfulness from evaluation. Operational health is necessary but not sufficient — a fast, cheap, wrong agent is still broken.

A representative agent run, by step

Plan task (LLM)1200 tok
search_docs (retrieval)350 tok
get_order (tool)180 tok
Reflect on results (LLM)900 tok
Synthesize answer (LLM)4100 tok
Illustrative per-step token spend for one customer-support run. Values are representative, not measured — synthesis dominates because it carries all retrieved context and tool results.

Read signals as a system, not in isolation. A latency spike with flat tokens points at a slow tool; a latency spike with a token spike points at a longer reasoning loop. Rising cost with steady success is waste to trim; rising cost with falling success is a regression to chase. The art is correlating operational and quality signals so a number on a dashboard always points at a specific span in a specific trace.

From logs to replay

Structured logging and replaying runs

Free-text logs are useless at agent scale. Structured, span-linked events turn every run into a record you can query, save, and replay exactly.

Log events, not sentences

Structured logging for agents

Each meaningful event — a model call, a tool call, a retry, a guardrail trip — is logged as a structured record carrying the trace ID, span ID, step number, and typed fields (model, tokens, latency, status). Because every event links to its span, you can pivot instantly from an aggregate metric to the exact run that produced it.

This is also what makes runs replayable. A trace that stores the inputs, the tool results, and the model responses is a complete, deterministic recording. You can step through it after the fact, feed it to a different model to compare, or pin it as a regression test — without re-hitting any live API.

  • Correlate everything by trace and span IDs.
  • Typed fields make logs queryable, not grep-able.
  • Stored inputs/outputs let you replay a run exactly.
  • Save a bad trace as a test case in one click.
How evaluation reuses traces
Capture runSpans + structured events
Store traceInputs, outputs, context
ReplayStep through or re-run offline
Pin as testLock in a regression case
The path from a single run to a durable regression test: capture, store, replay, and pin the runs that mattered.
trace.pypython
1with tracer.span("llm.call", trace_id=run.id) as span:  // open a span2    span.set(model="claude", step=run.step)3    resp = llm(prompt)  // the actual model call4    span.set(5        prompt=prompt,6        completion=resp.text,7        prompt_tokens=resp.usage.input,8        completion_tokens=resp.usage.output,9        latency_ms=span.elapsed(),10        status="ok",11    )1213log.info("llm.call", **span.fields)  // one structured event, linked to the span
A minimal span emitting a structured event. Every field is typed and linked by trace and span IDs, so the run is queryable and replayable later.
Open standards

OpenTelemetry-style instrumentation for LLM apps

You don't need a proprietary format to trace agents. OpenTelemetry's span model fits LLM apps, and its GenAI conventions make traces portable across any backend.

OpenTelemetry (OTel) is the open standard for traces, metrics, and logs. Its span abstraction maps almost one-to-one onto agent steps, so the move to LLM observability is mostly about agreeing on attribute names. The community's GenAI semantic conventions do exactly that: standard keys for the model, prompt and completion content, input and output token counts, the operation type, and the tool name.

The pattern is simple and provider-agnostic: wrap each LLM call and tool call in a span, set the standard GenAI attributes, and export the spans over OTLP to whatever backend you like — a dedicated LLM observability platform, your existing tracing stack, or both. Instrument once and you can swap dashboards without touching application code, which avoids the lock-in of bespoke SDKs.

Many agent frameworks now auto-instrument these spans for you, so a single call setup produces a full nested trace. Even then, it's worth adding custom spans around your own tools and guardrails so the trace reflects your agent, not just the framework's calls.

Trace
The full tree of spans for one agent run, sharing a single trace ID.
Span
One timed operation with a name, parent, status, and attributes.
Attribute
A typed key-value on a span — e.g. gen_ai.usage.input_tokens or gen_ai.request.model.
OTLP
The OpenTelemetry wire protocol used to export spans to a backend.
Exporter
The component that ships spans from your app to one or more observability backends.

Mind the sensitive content

Spans often capture full prompts and completions, which may contain PII or secrets. Treat trace storage as sensitive: redact or hash fields where needed, scope access, and apply retention limits. Good observability shouldn't become a new data-leak surface.

Operate it

Dashboards, alerting, and on-call

Traces explain a single run; dashboards and alerts watch the fleet. The goal is to catch regressions before users feel them — and jump straight to the trace that caused one.

  1. 1 · Dashboards

    Aggregate the signals over time and slice them by model, tool, agent version, and user segment. Track p50/p95 latency, cost per run, tool error rates, average steps, and success rate side by side.

  2. 2 · Alerts

    Set thresholds on what hurts: a tool error-rate spike, a cost-per-run jump after a deploy, p95 latency breaching SLA, or success rate dropping. Alert on rates and trends, not single noisy runs.

  3. 3 · Drill down

    Every alert and chart links back to example traces. On-call clicks from 'cost is up 40%' straight to the runs driving it, then to the span — the bloated prompt or looping tool — responsible.

  4. 4 · Close the loop

    Save the offending trace as a regression test, ship a fix (prompt, tool, or guardrail), and watch the dashboard confirm the metric recover. The fix is verified by the same system that caught it.

ConcernTraditional monitoringAgent observability
Primary questionIs the service up?What did the agent do, and was it right?
Unit of analysisRequest / endpointTrace of nested spans
Failure looks likeError code, exceptionFluent but wrong answer (HTTP 200)
Key signalsCPU, memory, status codesTokens, cost, steps, tool errors, quality
Debugging methodRead the stack traceReplay the run, inspect each span
Reproducible?
Feeds evaluation
Why it all pays off

How observability feeds evaluation and debugging

Observability isn't an end in itself. Its real value is the flywheel it powers: every captured run makes the next version of your agent measurably better.

Think of observability as the data layer and evaluation as the judgment layeron top of it. Traces are the raw material; scores, labels, and tests are what you build from them. Because every quality score attaches to a span, a low faithfulness score doesn't just say "this answer was wrong" — it points at the exact retrieval or synthesis step that caused it.

The loop is concrete. You spot a bad run in a dashboard, open its trace, and see the failing span. You save that trace to your eval set, change a prompt or swap a tool, and replay the saved traces to confirm the fix without a regression elsewhere. Over time your eval set becomes a library of real production failures — far more honest than synthetic test cases.

This is the same hygiene that makes any agent dependable, and it pairs with disciplined deployment: ship behind a version flag, watch the new version's traces against the old, and roll back the moment a signal regresses. Observability is what makes that comparison possible at all.

  • Trace everything in productionYou can't debug or evaluate a run you didn't capture — instrument first.
  • Store inputs and outputs, not just timingsContent is what makes a trace replayable and scorable.
  • Turn real failures into testsPromote bad traces to your eval set so they can't silently return.
  • Attach quality scores to spansLet a low score point straight at the step that caused it.
  • Compare versions on the same tracesReplay a fixed set across model/prompt changes to catch regressions.
  • Alert on trends, drill to tracesMove from an aggregate signal to the offending span in seconds.
1

Trace per run

the unit you replay

5+

Spans per agent turn

LLM + tool + retrieval

p95

Latency to watch

tails, not averages

100%

Runs instrumented

the observability goal

FAQ

Agent observability, answered

AI agent observability is the practice of instrumenting an agent so you can see exactly what it did on any run: every LLM call, every tool invocation, the prompts and responses, the tokens spent, the latency, and the decisions that strung them together. Where traditional monitoring tells you a service is up or down, observability lets you ask open-ended questions of a system whose behavior you couldn't fully predict in advance. Because an agent's logic emerges from model outputs rather than fixed code paths, the only way to understand a failure is to reconstruct the run step by step from recorded traces.

Get started

See exactly what your agents do

Trace every LLM call and tool call, watch the signals that matter, and replay any run to debug it. Free to start — no credit card required.