AI Agent Observability: tracing & monitoring
An agent's logic lives in its model outputs, not your source code — so when a run goes wrong, you can't read the bug, you have to reconstruct it. Observability records every reasoning step, LLM call, and tool call so you can see, measure, and debug what your agent actually did.
- 13 min read
- Intermediate
- Updated 2026
Observability is how you turn an opaque, probabilistic agent into a system you can actually inspect, measure, and trust in production.
A conventional service fails in legible ways: a stack trace, an error code, a line number. An AI agentdoes not. Its behavior emerges from a model deciding, turn by turn, which tool to call and what to say next. When the output is wrong, there is no line of code to blame — the "bug" is a chain of decisions across several LLM calls and tool invocations, any of which could have introduced the error. Agent observability is the discipline of recording that chain in enough detail that you can replay it, measure it, and find the step that broke.
The core artifact is the trace: a structured, time-ordered record of a single run, broken into nested spans for each model call, tool call, and retrieval. Layered on top are the signals you watch — token usage, latency, cost, tool errors, step counts — and the dashboards and alerts that surface trouble before users do. Done well, this same data feeds directly into evaluation and debugging.
This guide covers why agents demand observability, how traces and spans model a run, the signals worth tracking, structured logging and run replay, OpenTelemetry-style instrumentation for LLM apps, alerting and dashboards, and how all of it loops back into making your agent better.
Why agents are uniquely hard to observe
Agents combine non-determinism, multi-step reasoning, external tools, and natural-language failure — a combination that defeats traditional monitoring.
Classic monitoring assumes deterministic code: the same input yields the same path, errors throw, and a status code tells you what happened. None of those hold for an agent. The same prompt can take a different route on every run, "failure" often looks like a perfectly well-formed answer that is simply wrong, and the real logic lives inside model outputs you never wrote.
Worse, the work is multi-step and distributed. A single user request might fan out into a planning call, three tool calls, a retrieval, and a synthesis call — sometimes across several sub-agents. An error in step two only surfaces as nonsense in step five. Without a recorded trace, you are debugging by re-running and hoping the failure reproduces, which, for a non-deterministic system, it often won't.
Observability answers the questions monitoring can't: Which tool call failed? What exact prompt did the model see? Why did it loop seven times? Where did the cost spike? You can only answer those if you captured the run when it happened.
Non-determinism
Same input, different path. You can't reason about behavior from code alone — you have to observe actual runs.
Multi-step reasoning
Errors compound across planning, tool, and synthesis steps. The visible symptom is rarely where the fault began.
External tools
APIs, retrieval, and code execution introduce latency, flakiness, and failures the model must then react to.
Silent wrongness
A failed run can return HTTP 200 and a fluent, confident, incorrect answer. Uptime tells you nothing.
Traces and spans: anatomy of a run
Borrowed from distributed tracing and adapted for agents, the trace-and-span model is the foundation everything else builds on.
A trace represents one end-to-end run — a user's request through to the agent's final response. Inside it, every discrete unit of work becomes a span with a name, a start and end time, and a set of attributes. Spans nest into a tree: the top span is the agent turn; its children are the LLM calls and tool calls; a tool call that triggers retrieval has its own child spans. The shape of that tree is the shape of the agent's reasoning.
- 00:00.000root span
Trace start · user request received
Root span opens. Captures the user input, session and run IDs, and the agent + model configuration in scope.
- 00:00.180llm span
LLM call · plan the task
First model span. Records the full prompt, the response, prompt/completion tokens, latency, and which tools the model chose to call.
- 00:00.940tool span
Tool call · search_docs(query)
Tool span with arguments, return value, and status. A nested retrieval span shows the top-k passages and their scores.
- 00:01.610error span
Tool call · get_order(id) → error + retry
Tool span flagged as an error with the exception and stack. A second sibling span shows the automatic retry succeeding.
- 00:02.300llm span
LLM call · synthesize answer
Final model span combines retrieved context and tool results into the answer, recording tokens, cost, and any citations.
- 00:03.050complete
Trace end · response returned
Root span closes. Run-level totals roll up: 5 steps, total tokens, total cost, total latency, and outcome status.
Capture inputs and outputs, not just timings
A trace that only records durations tells you an agent was slow but never why it was wrong. The most valuable spans store the full prompt and completion, the tool arguments and results, and the retrieved context — the actual content the model saw and produced. That content is what lets you replay and debug the run later, and what feeds evaluation.
The signals that matter
Operational signals tell you whether the agent is fast, cheap, and reliable. Pair them with quality signals to know whether it's also right.
Token usage
Prompt and completion tokens, per step and per run. Spikes reveal bloated context, runaway tool output, or an agent re-reading the same data each loop.
Latency
End-to-end and per-span. Breaking latency down by step shows whether the model, a tool, or retrieval is the real bottleneck.
Cost
Derived from tokens and model price per run, per user, and per feature. The fastest way to catch an expensive loop before the invoice does.
Tool errors & retries
Error rate and retry count per tool. Flaky integrations force the agent into recovery behavior that inflates steps, latency, and cost.
Step / loop counts
Steps per run and depth of the reasoning loop. A creeping average often means agents getting stuck or thrashing toward the goal.
Outcome quality
Task success, refusal rate, and faithfulness from evaluation. Operational health is necessary but not sufficient — a fast, cheap, wrong agent is still broken.
A representative agent run, by step
Read signals as a system, not in isolation. A latency spike with flat tokens points at a slow tool; a latency spike with a token spike points at a longer reasoning loop. Rising cost with steady success is waste to trim; rising cost with falling success is a regression to chase. The art is correlating operational and quality signals so a number on a dashboard always points at a specific span in a specific trace.
Structured logging and replaying runs
Free-text logs are useless at agent scale. Structured, span-linked events turn every run into a record you can query, save, and replay exactly.
Structured logging for agents
Each meaningful event — a model call, a tool call, a retry, a guardrail trip — is logged as a structured record carrying the trace ID, span ID, step number, and typed fields (model, tokens, latency, status). Because every event links to its span, you can pivot instantly from an aggregate metric to the exact run that produced it.
This is also what makes runs replayable. A trace that stores the inputs, the tool results, and the model responses is a complete, deterministic recording. You can step through it after the fact, feed it to a different model to compare, or pin it as a regression test — without re-hitting any live API.
- Correlate everything by trace and span IDs.
- Typed fields make logs queryable, not grep-able.
- Stored inputs/outputs let you replay a run exactly.
- Save a bad trace as a test case in one click.
1with tracer.span("llm.call", trace_id=run.id) as span: // open a span2 span.set(model="claude", step=run.step)3 resp = llm(prompt) // the actual model call4 span.set(5 prompt=prompt,6 completion=resp.text,7 prompt_tokens=resp.usage.input,8 completion_tokens=resp.usage.output,9 latency_ms=span.elapsed(),10 status="ok",11 )1213log.info("llm.call", **span.fields) // one structured event, linked to the spanOpenTelemetry-style instrumentation for LLM apps
You don't need a proprietary format to trace agents. OpenTelemetry's span model fits LLM apps, and its GenAI conventions make traces portable across any backend.
OpenTelemetry (OTel) is the open standard for traces, metrics, and logs. Its span abstraction maps almost one-to-one onto agent steps, so the move to LLM observability is mostly about agreeing on attribute names. The community's GenAI semantic conventions do exactly that: standard keys for the model, prompt and completion content, input and output token counts, the operation type, and the tool name.
The pattern is simple and provider-agnostic: wrap each LLM call and tool call in a span, set the standard GenAI attributes, and export the spans over OTLP to whatever backend you like — a dedicated LLM observability platform, your existing tracing stack, or both. Instrument once and you can swap dashboards without touching application code, which avoids the lock-in of bespoke SDKs.
Many agent frameworks now auto-instrument these spans for you, so a single call setup produces a full nested trace. Even then, it's worth adding custom spans around your own tools and guardrails so the trace reflects your agent, not just the framework's calls.
- Trace
- The full tree of spans for one agent run, sharing a single trace ID.
- Span
- One timed operation with a name, parent, status, and attributes.
- Attribute
- A typed key-value on a span — e.g. gen_ai.usage.input_tokens or gen_ai.request.model.
- OTLP
- The OpenTelemetry wire protocol used to export spans to a backend.
- Exporter
- The component that ships spans from your app to one or more observability backends.
Mind the sensitive content
Spans often capture full prompts and completions, which may contain PII or secrets. Treat trace storage as sensitive: redact or hash fields where needed, scope access, and apply retention limits. Good observability shouldn't become a new data-leak surface.
Dashboards, alerting, and on-call
Traces explain a single run; dashboards and alerts watch the fleet. The goal is to catch regressions before users feel them — and jump straight to the trace that caused one.
1 · Dashboards
Aggregate the signals over time and slice them by model, tool, agent version, and user segment. Track p50/p95 latency, cost per run, tool error rates, average steps, and success rate side by side.
2 · Alerts
Set thresholds on what hurts: a tool error-rate spike, a cost-per-run jump after a deploy, p95 latency breaching SLA, or success rate dropping. Alert on rates and trends, not single noisy runs.
3 · Drill down
Every alert and chart links back to example traces. On-call clicks from 'cost is up 40%' straight to the runs driving it, then to the span — the bloated prompt or looping tool — responsible.
4 · Close the loop
Save the offending trace as a regression test, ship a fix (prompt, tool, or guardrail), and watch the dashboard confirm the metric recover. The fix is verified by the same system that caught it.
| Concern | Traditional monitoring | Agent observability |
|---|---|---|
| Primary question | Is the service up? | What did the agent do, and was it right? |
| Unit of analysis | Request / endpoint | Trace of nested spans |
| Failure looks like | Error code, exception | Fluent but wrong answer (HTTP 200) |
| Key signals | CPU, memory, status codes | Tokens, cost, steps, tool errors, quality |
| Debugging method | Read the stack trace | Replay the run, inspect each span |
| Reproducible? | ||
| Feeds evaluation |
How observability feeds evaluation and debugging
Observability isn't an end in itself. Its real value is the flywheel it powers: every captured run makes the next version of your agent measurably better.
Think of observability as the data layer and evaluation as the judgment layeron top of it. Traces are the raw material; scores, labels, and tests are what you build from them. Because every quality score attaches to a span, a low faithfulness score doesn't just say "this answer was wrong" — it points at the exact retrieval or synthesis step that caused it.
The loop is concrete. You spot a bad run in a dashboard, open its trace, and see the failing span. You save that trace to your eval set, change a prompt or swap a tool, and replay the saved traces to confirm the fix without a regression elsewhere. Over time your eval set becomes a library of real production failures — far more honest than synthetic test cases.
This is the same hygiene that makes any agent dependable, and it pairs with disciplined deployment: ship behind a version flag, watch the new version's traces against the old, and roll back the moment a signal regresses. Observability is what makes that comparison possible at all.
- Trace everything in production — You can't debug or evaluate a run you didn't capture — instrument first.
- Store inputs and outputs, not just timings — Content is what makes a trace replayable and scorable.
- Turn real failures into tests — Promote bad traces to your eval set so they can't silently return.
- Attach quality scores to spans — Let a low score point straight at the step that caused it.
- Compare versions on the same traces — Replay a fixed set across model/prompt changes to catch regressions.
- Alert on trends, drill to traces — Move from an aggregate signal to the offending span in seconds.
Trace per run
the unit you replay
Spans per agent turn
LLM + tool + retrieval
Latency to watch
tails, not averages
Runs instrumented
the observability goal
Agent observability, answered
AI agent observability is the practice of instrumenting an agent so you can see exactly what it did on any run: every LLM call, every tool invocation, the prompts and responses, the tokens spent, the latency, and the decisions that strung them together. Where traditional monitoring tells you a service is up or down, observability lets you ask open-ended questions of a system whose behavior you couldn't fully predict in advance. Because an agent's logic emerges from model outputs rather than fixed code paths, the only way to understand a failure is to reconstruct the run step by step from recorded traces.
Go deeper on running agents in production
See exactly what your agents do
Trace every LLM call and tool call, watch the signals that matter, and replay any run to debug it. Free to start — no credit card required.