Blog · Engineering

How to Reduce AI Agent Hallucinations in Production

You can't delete hallucinations, but you can engineer them down to a manageable, measurable residue. Here's how — ranked by impact, written for the people who get paged when an agent confidently does the wrong thing.

  • 11 min read
  • Engineering
  • Updated 2026

Every team that ships an agent eventually meets the same demon: the model says something false with total conviction, and the system acts on it. The fix is not a magic prompt. It's a stack of defenses.

A hallucination is any output a model presents as fact that isn't supported by its inputs or by reality — see the hallucination glossary entry for the precise definition. In a plain chatbot that's an embarrassing sentence. In an agent — a model that calls tools, reads results, and takes real actions — it's a wrong database write, an email sent to the wrong person, or a cheerful "All set!" reported over a task that quietly failed.

This post is deliberately honest: you cannot eliminate hallucinations, because the same mechanism that lets a model generalize is the mechanism that lets it confabulate. What you can do is push the rate down hard, contain the damage, and measure the residue so it never silently grows. We'll rank the techniques by real-world impact — grounding, schema-validated tool calls, citations, self-critique, structured outputs, confidence gating, and evaluation — and be clear about what each one does and doesn't buy you.

The root cause

Why agents still hallucinate — and where it shifts

A model is a fluent guesser. It optimizes for plausible text, not for truth, and it has no built-in sense of 'I actually don't know this.' Agents inherit that flaw and add new places for it to hide.

At its core a language model predicts the next token. Fluency and fabrication come from the exact same machinery — there is no separate "honesty circuit" you can flip on. When the model is uncertain, it doesn't go quiet; it produces the most statistically plausible continuation, which is often a confident, well-formed lie. That's the uncomfortable foundation everything else is built on.

For chatbots, the hallucination lives in the prose you can read. Agents move the problem somewhere more dangerous: into the tool calls and the agent's reading of their results. Because the model writes function arguments and interprets outputs, it can fabricate at every junction of the function-calling loop — and those junctions aren't shown to the user.

The single most insidious agentic failure is the false success: a tool returns an error, a partial result, or nothing, and the agent narrates a tidy "Done — I've updated the record" anyway. The user trusts it. Nobody finds out until the data is wrong downstream.

Fabricated tool arguments

The model invents an ID, a date format, or a parameter the API never defined — a plausible-looking value that fails or, worse, hits the wrong record.

Phantom tools & endpoints

It calls a function that doesn't exist or misnames one, confident the capability is there when it isn't.

Misread tool results

A valid result is returned, but the agent extracts the wrong field or inverts the meaning, then reasons on the misreading.

False 'success'

The action failed or partially completed, yet the agent reports a clean success. The most expensive hallucination because it hides itself.

The mental model

Defense in depth, not a silver bullet

No single layer is reliable on its own. Stack them, and each one catches a different slice of error the others miss — the same logic that keeps planes in the air.

Grounding
RAG retrievalAuthoritative contextAnswer-from-source only
Validated actions
Schema-checked argsVerify tool successReject phantom calls
Attribution
Inline citationsSource-check claimsFlag the uncited
Self-critique
Reflection passVerifier modelRe-try on doubt
Confidence gate
ThresholdsAbstain or escalateHuman-in-the-loop
Evaluation
Faithfulness scoringRegression suiteCatch drift
A layered mitigation stack. Each layer narrows the space for a hallucination to survive; the gaps that slip past one layer are caught by the next. Lower layers shape the input; upper layers verify the output.

Read the stack from the bottom up. Grounding and validated actions shape what goes into the model and constrain what comes out of its tools — they prevent the most hallucinations. The middle layers — attribution and self-critique — inspect the answer for unsupported claims. The top layers — confidence gating and evaluation — decide what's safe to ship and prove the whole thing is still working. The rest of this post walks them in impact order.

Technique 1 · Highest impact (knowledge)

Grounding with retrieval (RAG)

If the model must answer from a passage you supply — and is told to abstain when the passage is silent — fabricated facts collapse. This is the biggest single lever for knowledge questions.

A model left to its own parametric memory is recalling a blurry, frozen average of public text. Ask it about your refund policy and it guesses. Retrieval-augmented generation fixes this by fetching the relevant passages from your own data at query time and putting them in the prompt, so the answer is conditioned on real evidence rather than memory. Our deep dive on RAG for agents covers the full pipeline.

The grounding instruction is what does the work: answer only from the context, and if it isn't there, say you don't know. That single rule converts the model from a know-it-all into a careful reader. It won't help if retrieval returns the wrong passage — garbage in, grounded garbage out — which is why retrieval quality and grounding are evaluated together.

Grounding is necessary but not sufficient. It addresses the knowledge axis of hallucination; it does nothing for fabricated tool arguments or false success. That's the next layer's job.

ground.pypython
1SYSTEM = (  // the grounding contract2  "Answer ONLY from <context>.\n"3  "If the answer is not in <context>,\n"4  "reply exactly: 'I don't know.'\n"  // abstention is allowed5  "Never use outside knowledge."6)78ctx = retrieve(question, k=4)  // RAG: fetch real passages9prompt = f"<context>\n{ctx}\n</context>\n\nQ: {question}"10answer = llm(system=SYSTEM, user=prompt)1112if answer.strip() == "I don't know.":  // no source → no guess13    escalate_or_clarify(question)
The grounding contract: pass authoritative context, forbid outside knowledge, and make abstention an explicit, allowed answer.

Grounding is only as good as retrieval

The generator can only ground in what retrieval hands it. Invest in chunking, hybrid search, and re-ranking before you tune the prompt — and measure whether the right passage actually made the top-k. A perfect grounding instruction over a missed passage still produces a confident wrong answer.

Technique 2 · Highest impact (actions)

Schema-validated tool calls & verified success

For agents, the biggest reliability win isn't about prose at all — it's refusing to execute a malformed call and never trusting a tool result you didn't actually confirm.

Don't trust, verify

Validate the call, then confirm the outcome

A model writing function arguments is a model that can hallucinate function arguments. The fix is to treat every tool call as untrusted input: validate it against a strict JSON Schema before it runs. A missing field, a wrong type, or an invented parameter is rejected and fed back to the model as an error to repair — never executed blindly.

Then close the loop on the result. Don't let the agent declare victory off its own narration; check the tool's actual return value or status. If the API returned an error or an empty result, the agent must see that and react, not paper over it with a confident summary. This single discipline kills most false-success hallucinations.

  • Reject calls that don't match the tool's schema.
  • Feed validation errors back so the model self-corrects.
  • Verify the return value before reporting success.
  • Constrain enums and IDs to known, allowed values.
How function calling works
validate.tstypescript
1const call = model.toolCall();  // model-proposed args2const ok = schema.safeParse(call.args);3if (!ok.success) {  // guard 1: shape4  return repair(model, ok.error);5}67const res = await tools[call.name](ok.data);8if (res.status !== "ok") {  // guard 2: real success9  return model.observe(res.error);10}11return model.observe(res.data);  // guard 3: act on truth
Validate against the tool schema, run, then confirm the outcome — three guards on one call.

Schema validation also pairs with the broader question of agent security: a call you wouldn't let a hallucinating model make is often the same call you wouldn't let a compromised one make. Whitelisting tools, constraining arguments, and confirming outcomes are reliability and safety controls at the same time.

Technique 3 · Make claims checkable

Citations and source-checking

Force the model to attribute every claim to a retrieved passage, then verify the citation actually supports it. Unsupported sentences become visible — to your code and to your users.

Citations turn grounding from a hope into a contract. When the model must tag each statement with the source it came from, two good things happen: users can verify the answer themselves, and your system gets a machine-checkable signal. Any sentence that carries no citation, or cites a passage that doesn't actually contain the claim, is a candidate hallucination you can flag, suppress, or route to review.

The subtle failure here is the plausible citation — the model attaches a real-looking source that doesn't support the claim. So citation alone isn't enough; you need a source-check step that confirms the cited passage truly entails the statement. That check can be a cheap entailment model or an LLM-as-judge comparing claim to source.

  • Require a citation per claimEvery factual sentence must point to a retrieved passage by id.
  • Flag uncited sentencesTreat any unattributed claim as suspect and downweight or hide it.
  • Verify entailmentConfirm the cited passage actually supports the claim, not just shares keywords.
  • Surface sources to usersLinked citations let people audit the answer and build warranted trust.
  • Reject invented referencesValidate that every cited id exists in the retrieved set.
Technique 4 · Catch your own errors

Self-critique and reflection passes

A second look — by the model on its own output, or by a separate verifier — catches a meaningful share of errors the first pass made. It costs latency and tokens, so spend it where stakes are high.

  1. 1 · Draft

    The agent produces a first answer or plan from the grounded context, exactly as it would without any critique step.

  2. 2 · Critique

    A reflection pass — same model or a dedicated verifier — checks each claim against the sources, looks for unsupported statements, and asks 'what would make this wrong?'

  3. 3 · Revise or retry

    Flagged claims are removed, corrected, or sent back for re-retrieval. If the verifier can't confirm the answer, the agent abstains instead of shipping a guess.

Reflection works because spotting an error is easier than avoiding it in one shot — a verifier with a narrow job ("is this claim supported by the source, yes or no?") is more reliable than the generator juggling everything at once. Using a separate model or a fresh context for the critique helps, since the original draft's reasoning won't bias the check.

Be honest about the limits: self-critique can rubber-stamp its own mistakes, especially when the same model reviews itself with the same blind spots. It's a real reduction, not a guarantee — best reserved for high-stakes steps where the extra latency and cost are justified.

Where reflection pays off

  • High-stakes answers where a wrong fact is expensive.
  • Verifying tool plans before any side effect runs.
  • Catching unsupported claims a citation check can score.
  • Letting the agent abstain when the verifier is unsure.

Where it disappoints

  • Adds latency and token cost on every guarded step.
  • Same-model critique can share the original blind spots.
  • Can over-correct and strip out correct, well-grounded claims.
  • No help if the underlying retrieval was wrong to begin with.
Techniques 5 & 6 · Constrain and gate

Structured outputs and confidence thresholds

Two complementary controls: shrink the space the model can hallucinate in, and refuse to act when its confidence is too low.

Constrained decoding and structured outputs remove whole categories of error by construction. If the model must emit JSON matching a schema, it can't return prose where you expected a field. If a value must be one of a fixed enum, it can't invent a category. Constraining the form of the output shrinks the surface where a fabrication can live — you stop spending validation effort on malformed shapes and focus it on the content.

Confidence thresholds with human review accept that some answers shouldn't ship automatically. Estimate confidence — from model signals, retrieval scores, verifier agreement, or self-reported uncertainty — and gate on it. Above the bar, act. Below it, abstain, ask a clarifying question, or escalate to a person. The art is calibration: too strict and the agent is useless, too loose and the gate does nothing.

ControlStopsCost
Schema-constrained outputMalformed / off-type resultsLow
Enum / allowed-value limitsInvented categories & idsLow
Confidence gateLow-certainty auto-actionsMedium
Human-in-the-loop reviewHigh-stakes mistakesHigh

Structure constrains form, not truth

A perfectly schema-valid object can still contain a fabricated value. Structured outputs stop malformed hallucinations, not wrong ones — pair them with grounding and validation so the fields are both well-formed and true.

How the levers compare

Representative impact by technique

A rough ordering of how much each layer tends to move the needle. Treat these as illustrative — your numbers depend entirely on your task, data, and baseline.

Illustrative reduction in hallucination rate by technique

Grounding with RAG (knowledge questions)55%
Schema-validated tool calls + success checks48%
Citations + source verification35%
Self-critique / reflection pass28%
Structured outputs / constrained decoding22%
Confidence gate + human review (residual)18%
Illustrative only — not measured research. Bars show the rough relative reduction each layer tends to contribute on a typical agent task; effects overlap and are not additive. Always measure on your own evaluation set.

The shape matters more than the exact heights. Grounding and validated tools sit at the top because they prevent hallucinations at the source; the verification layers further down catch a smaller, harder residue. And notice the last bar never reaches zero — confidence gating and human review exist precisely because the layers above them leak. There is always a remainder. Plan for it.

Technique 7 · The one that keeps the rest honest

Evaluation: catch regressions before users do

Every mitigation above is a hypothesis until you measure it. Evaluation is what turns 'we added a critique step' into 'we cut unfaithful answers from 9% to 3% and held it there.'

The trap is shipping a clever prompt or a new model and assuming it helped. Models drift, prompts interact, and a fix for one failure mode can quietly worsen another. The only defense is a standing evaluation harness — a labeled set of real queries with known answers and known traps — scored automatically on every change.

For hallucination specifically, measure faithfulness (is every claim supported by a source?), tool-call validity, and false-success rate. An LLM-as-judge can score grounding at scale; a small set of hand-checked cases keeps the judge honest. Track the numbers over time so a regression shows up in CI, not in a customer complaint.

  • Faithfulness scoreFraction of claims supported by retrieved sources — your core hallucination metric.
  • Tool-call validityShare of tool calls that pass schema validation and target real endpoints.
  • False-success rateHow often the agent reports success on a tool call that actually failed.
  • Abstention calibrationDoes the agent say 'I don't know' exactly when it should — not too much, not too little?
  • Regression gate in CIBlock any change that pushes a key metric below its threshold.

The honest finish line

You will never see a zero. A mature agent has a known, small, monitored hallucination rate — like a known error budget — with each layer carrying its share and evaluation proving the budget holds. That's not failure. That's engineering reliability into a probabilistic system, which is the only kind of reliability available here.

FAQ

Hallucination questions, answered

No — and any vendor who promises zero hallucinations is selling something. A language model is a probabilistic next-token predictor; under the right (or wrong) conditions it will always be able to produce a fluent, confident, false statement. The honest goal is not elimination but suppression and containment: drive the hallucination rate down with grounding and validation, then catch the residue with evaluation and human review before it reaches a user or a side-effecting tool. Engineering reliability here looks like aviation safety — layered defenses, not a single fix.

Get started

Ship agents that admit what they don't know

Grounding, schema-validated tools, and built-in evaluation — the reliability stack from this post, ready to assemble. Free to start, no credit card required.