Does using a cheaper model always hurt quality?

No — that assumption is what keeps bills high. Most agent steps are easy: classifying intent, extracting a field, deciding which tool to call, formatting a result. A small or distilled model handles those at a fraction of the price with no perceptible quality loss. The trick is model routing: send the easy majority of steps to a cheap model and reserve a frontier model for the genuinely hard reasoning. Quality only suffers when you route a hard step to a model that can't do it, which is why routing should be measured against an evaluation set rather than guessed. Done well, routing cuts spend with quality held flat or even improved, because the strong model is no longer drowning in trivial calls.

Is prompt caching worth setting up?

For agents, almost always. Agents re-send a large, stable prefix — the system prompt, tool definitions, few-shot examples, and often a fixed knowledge block — on every single turn. Prompt caching lets the provider charge a steep discount for that repeated prefix instead of full price each time, and the savings scale with how long and how reused your prefix is. The one rule is ordering: put the stable content first and the volatile content (the latest user turn, fresh retrieval) last, so the cached prefix stays byte-identical across calls. For a multi-step agent that reuses the same instructions dozens of times per task, caching is one of the highest-return changes you can make for a few hours of work.

How do I stop an agent from looping forever and burning money?

Treat the loop like any other unbounded process: give it a budget and a brake. Set a hard maximum on iterations, define explicit stop conditions (a confidence threshold, a 'task complete' signal, repeated identical tool calls), and detect cycles where the agent keeps trying the same failing action. Add a per-task token or dollar ceiling that aborts gracefully and hands off to a human or a fallback. Without these guards a single malformed task or a tool that keeps erroring can spin through hundreds of paid steps before anyone notices — and those runaway tasks, not the median request, are what blow up a monthly bill.

Where should I start if I want quick savings?

Measure first, then pick the cheap wins. Add observability so you can see cost per task and find where the tokens actually go — usually a handful of tasks and steps dominate. Then ship the changes with the best effort-to-savings ratio: turn on prompt caching, trim the context you re-send each turn, cache deterministic tool results, and cap loop iterations. Those four require no model changes and rarely touch quality. Save the deeper work — model routing and distilled models — for after you've proven the easy wins on your own numbers, because routing needs an evaluation set to do safely.

Blog · Engineering

9 Ways to Cut AI Agent Costs Without Losing Quality

Agent bills don't grow because the model is expensive — they grow because a loop re-reads its own history, retrieves too much, and never knows when to stop. Here are nine tactics that shrink the spend while holding output quality flat, each with the trade-off spelled out.

12 min read
Engineering
Updated 2026

Start building free Set up cost observability

An AI agent that works is easy to fall in love with and easy to overpay for. The fix is rarely a cheaper provider — it's understanding where the tokens actually go and refusing to pay for the ones that buy you nothing.

Here is the uncomfortable shape of an agent bill: the model price per token is fixed and small, but an agent multiplies that price by the conversation length and again by the number of loop iterations. Every turn re-sends the system prompt, the tool schemas, the accumulated history, and a pile of retrieved context — then pays again for what it generates. A task the user experiences as one question can quietly cost ten model calls, each fatter than the last.

The good news is that almost none of that growth is buying quality. You can route easy steps to cheap models, stop re-sending stale context, cache the parts that never change, and put a ceiling on the loop — and the user won't notice anything except a faster, cheaper agent. Cost and quality are not the same dial. This guide walks nine tactics, ordered roughly by effort-to-savings ratio, and is honest about the trade-off each one carries. Pair it with agent observability so you can prove the savings on your own numbers rather than ours.

Read the bill first

Where the money actually goes

Before optimizing anything, get a mental model of the cost. In an agent loop, input tokens compound far faster than output tokens — and a few runaway tasks dominate the total.

Picture a five-step agent. On step one it sends a 2,000-token instruction block plus the question. By step five it's re-sending that same block plus four rounds of tool calls, tool results, and reasoning — the prompt has tripled, and you've paid the full input price for it on every step. The model's final sentence is cheap; the context you dragged along to produce it is not.

Two patterns follow from this. First, input tokens are the enemy, and they're the ones you control most directly — by caching, trimming, and retrieving less. Second, cost is wildly skewed: the median task is cheap, but a long tail of tasks that loop too many times or pull in too much context can account for most of the spend. You optimize the bill by attacking the tail, not the median.

That's also why measurement comes first. You cannot cut what you cannot see, and intuition about token cost is almost always wrong. Tactic nine — observability and budgets — is last in this list only because it frames the other eight; in practice it's where you should start.

Step 1Prompt + tools + question

Step 2–4Re-send history + tool results

Growing prefixInput tokens compound each turn

Final answerCheap output, expensive context

Why agent cost compounds: each loop re-sends a growing prompt, so input tokens dominate long before the final answer is generated.

The playbook

Nine ways to cut agent costs

Each card states the move and the why. The trade-offs come right after — none of these are free, but most are close.

1 · Model routing

Send easy steps — classification, extraction, formatting, tool selection — to a cheap model, and reserve a frontier model for hard reasoning. Most agent calls are easy, so most of your spend is overpaying for capability you don't need.

2 · Trim the context

Don't re-send the whole transcript every turn. Summarize older turns, drop tool outputs you've already used, and keep only what the next decision needs. The prompt stops growing linearly with the conversation.

3 · Prompt + response caching

Cache the stable prompt prefix (system, tools, examples) so the provider charges a fraction for it, and cache full responses for repeated identical inputs. Agents reuse the same prefix dozens of times per task.

4 · Cap loops + stop conditions

Set a hard iteration limit, define when the task is 'done', and detect cycles where the agent repeats a failing action. One runaway task can cost more than a thousand normal ones.

5 · Smaller / distilled models

Where a small or distilled model meets your quality bar, use it everywhere — not just on routed easy steps. Distillation bakes a big model's behavior on your task into a cheaper one you can run far more cheaply.

6 · Batch + parallelize tools

When an agent needs several independent tool calls, fire them together instead of one slow serial round-trip each. Fewer model turns to orchestrate the work means fewer paid passes through the loop.

7 · Retrieve less, rank better

Stuffing twenty chunks into the prompt costs tokens and dilutes attention. Retrieve a broad candidate pool cheaply, re-rank, and pass only the best three or four. Better precision, smaller prompt.

8 · Cache tool results

Deterministic or slow-changing tool calls — a product lookup, a pricing table, an embedding — shouldn't be re-run every time. Cache by input so the agent reads the answer instead of paying to recompute it.

9 · Measure + budget

Track cost per task with observability, attribute it to steps and tools, and set per-task and per-tenant budgets that alert or abort. You can't manage — or trust — a number you don't record.

Stack them, don't pick one

These tactics compound. Caching shrinks the prefix; trimming shrinks the variable part; routing makes every remaining token cheaper; capping loops kills the tail. Applied together on a typical agent they routinely take a bill down by more than half — but the order matters. Turn on measurement, then the no-quality-risk wins (caching, trimming, tool-result caching, loop caps), then the changes that need an evaluation set (routing, distillation).

The two biggest levers

Routing and context trimming, up close

Most agents leave the largest savings on these two tables. Both are about paying for capability and context only when you actually use them.

Tactic 1 · Model routing

A cheap model for the easy 80%

An agent loop is a stream of decisions, and most of them are mundane: parse this, classify that, pick the next tool, format the output. A small model nails those for a tiny fraction of frontier pricing. Routing means a lightweight classifier (or even a rule) decides, per step, which model to call — escalating to the frontier model only for the genuinely hard reasoning.

The trade-off is real but manageable: route a hard step to a weak model and quality drops, and the router itself adds a little latency and complexity. The discipline that makes it safe is evaluation — score each candidate model on a labeled set of your real steps so you route on evidence, not vibes. See how the reasoning loop decides what it needs in our guide to LLM agents.

Cheap model handles classification, extraction, formatting.
Frontier model reserved for hard, multi-hop reasoning.
Route on an evaluation set, not on intuition.
Trade-off: a wrong route hurts quality — measure it.

How LLM agents reason

Classify stepEasy or hard?

Cheap modelMost steps land here

Frontier modelHard reasoning only

Same qualityAt a fraction of cost

A router inspects each step and sends easy work to a cheap model, escalating only hard reasoning to a frontier model.

Context trimming (tactic 2) attacks the other axis. A naive agent appends every turn to the prompt forever, so by the end of a long task it's re-sending pages of tool output it will never look at again. Trimming replaces that with a rolling window: keep the system prompt and the last few turns verbatim, summarize the older middle into a compact note, and discard tool results once their conclusion is captured. The agent keeps its working memory; you stop paying to re-read its diary. The trade-off is that an over-aggressive summary can drop a detail the agent later needs — so summarize with a model instruction that preserves decisions, IDs, and open questions, and lean on real retrieval for facts rather than hoarding them in the context window.

The no-quality-risk wins

Caching, loop caps, and cached tools

Tactics 3, 4, and 8 share a property worth gold: they cut cost with essentially zero risk to output quality. Ship these first.

Prompt and response caching (tactic 3) exploits the most wasteful thing agents do — re-sending an identical prefix on every turn. Order your prompt so the stable content comes first (system instructions, tool schemas, few-shot examples, any fixed knowledge) and the volatile content comes last (the newest user turn, fresh retrieval). Then the provider can charge a steep discount for the unchanged prefix on every subsequent call. Layer on a response cache keyed by exact input for the repeated, deterministic prompts that show up in any real workload. The only trade-off is hygiene: a stale cache serves yesterday's answer, so set sensible TTLs and bust the cache when the underlying data changes.

Capping loops and adding stop conditions (tactic 4) is insurance against the tail that wrecks budgets. Give every task a hard iteration limit, an explicit completion signal, and cycle detection so an agent that keeps calling the same erroring tool gets stopped instead of spinning. Add a per-task token ceiling that aborts gracefully. The trade-off — a genuinely hard task occasionally hitting the cap before finishing — is handled with a clean fallback or human hand-off, which is far cheaper than an unbounded loop. This pairs directly with agent orchestration, where budgets and stop logic live.

Caching tool results (tactic 8) finishes the set. Many tool calls are deterministic or slow-changing — a SKU lookup, a tax table, an embedding of a fixed document. Cache them by input and the agent reads the answer for free instead of paying the model to orchestrate a re-fetch. The trade-off is identical to prompt caching: invalidate when the source changes, and never cache anything with side effects or per-user variance.

agent_step.pypython

1messages = [  // stable prefix FIRST so it caches2    cached(system_prompt),  // provider discounts repeat prefix3    cached(tool_schemas),4    *trimmed_history,  // rolling window, not the full log5    user_turn,  // volatile content LAST6]78if task.tokens_spent > task.budget:  // tactic 4: hard ceiling9    return handoff("budget reached")1011if (hit := tool_cache.get(call.key)):  // tactic 8: cached tool result12    result = hit13else:14    result = run_tool(call)15    tool_cache.set(call.key, result, ttl=3600)

Three cheap wins in one step: a cached stable prefix, a per-task budget guard, and a cached deterministic tool.

Fewer, sharper calls

Batch tool calls and slim your retrieval

Tactics 6 and 7 cut cost by reducing both the number of model turns and the size of each prompt — and they often improve quality as a side effect.

Batching and parallelizing tool calls (tactic 6) attacks turn count. When an agent needs three independent lookups, a naive loop does three serial round-trips, each one a paid model turn to decide on and then read the next call. If the calls don't depend on each other, the agent can request them together and a single turn consumes all three results — fewer passes through the loop, lower latency, lower cost. The trade-off is that you must be sure the calls really are independent; parallelizing steps that depend on each other produces wrong results faster, not cheaper.

Retrieving less but ranking better (tactic 7) attacks prompt size. Cramming twenty chunks into context to be safe is a double tax: you pay for the tokens, and the extra noise actually degrades the answer as the model loses the signal in the middle. The fix is a two-stage retrieval stack — recall a broad candidate pool cheaply, re-rank it, and pass only the top three or four passages. You send a smaller prompt and get a sharper answer. Our deep dive on RAG covers the chunking and re-ranking that make this work; the cost lever is simply choosing precision over volume.

Approach	Naive	Optimized
Tool calls	3 serial turns	1 batched turn
Retrieved chunks	20 'just in case'	3–4 re-ranked
Prompt size	Large, noisy	Small, precise
Model turns / task	High	Low
Answer quality
Cost per task	High	Low

Cheaper and better at once

These two are the rare optimizations with no quality trade-off in the typical case — fewer turns and a tighter, better-ranked context usually raise answer quality while cutting spend. They're a free lunch you have to cook yourself.

How the levers compare

Illustrative savings per tactic

A rough ordering of how much each tactic tends to move an agent bill. Treat the heights as illustrative — your real numbers depend entirely on your workload, baseline, and prompt shape.

Illustrative cost reduction by tactic

Model routing (cheap model for easy steps)45%

Prompt + response caching40%

Context trimming / summarization35%

Smaller / distilled models where they suffice32%

Cache tool results25%

Retrieve less, rank better22%

Batch + parallelize tool calls18%

Cap loops + stop conditions (tail control)15%

Illustrative only — not measured research. Bars show the rough relative reduction each tactic tends to contribute on a typical multi-step agent; effects overlap and are not additive. Always measure on your own observability data before and after.

The ranking is about shape, not precision. Routing and caching sit at the top because they touch every step of every task; trimming and distillation cut the size of what's left; tool-result caching and leaner retrieval trim specific hotspots; loop caps look small on average but save you from the catastrophic tail that no average captures. Note these bars overlap — caching and trimming both shrink the prompt, so you can't simply add them. The only way to know your numbers is to measure before and after with observability.

Tactic 9, the one that frames the rest

Measure cost per task, then budget it

Everything above is guesswork until you can see a per-task cost number, attribute it to steps and tools, and enforce a ceiling. This is where you actually start.

The single most expensive mistake in agent engineering is optimizing blind. Token cost is deeply unintuitive — the step you assume is cheap is often the one re-sending a 6,000-token prefix, and the model you assume is the problem is fine. Without measurement you'll tune the wrong thing and congratulate yourself for it.

Instrument every run so you record cost per task, broken down by step, model, and tool, and tag it by tenant or feature so you can see which workloads dominate. Then set budgets: a per-task ceiling that aborts gracefully, alerts when a tenant's spend spikes, and a dashboard that makes regressions obvious after a prompt change or model swap. Budgets turn cost from a monthly surprise into a controlled input. This is the backbone of deploying agents responsibly, and it's covered end to end in our observability guide.

Record cost per task — Sum input + output tokens × price across every step of the run, not just the final call.
Attribute to steps and tools — Break the number down so you can see which step or tool is the hotspot.
Tag by tenant and feature — Find the workloads that dominate — cost is almost always skewed.
Set per-task budgets — A hard ceiling that aborts gracefully and hands off when hit.
Alert on spikes and regressions — Catch a runaway tenant or a prompt change that quietly doubled the bill.
Re-measure after every change — Prove each optimization on your own numbers before trusting it.

10×

Loop multiplier

one question, many paid turns

80%

Steps that are 'easy'

routable to a cheap model

>50%

Typical stacked savings

illustrative, measure your own

Place to start

observability + budgets

FAQ

Cutting agent costs, answered

Tokens, multiplied by trips. Every step of an agent loop sends the system prompt, the accumulated conversation, the tool schemas, and the retrieved context to the model, then pays again for the generated output. Because an agent re-sends most of that context on every turn, a five-step task can cost ten times a single chat completion even though the user asked one question. The biggest line item is almost never the final answer — it's the compounding input tokens of a loop that re-reads its own history, plus the occasional runaway task that loops far more times than it should. Cost control is therefore mostly about shrinking what you re-send and capping how many times you send it.

Keep reading

Go deeper on building lean agents

AI agent observabilityTrace cost per task and find the hotspots RAGRetrieve less and rank better LLM agentsThe reason–act loop you're routing AI agent deploymentShip agents with budgets and guardrails AI agent orchestrationWhere loop caps and stop logic live

reduce AI agent costsLLM cost optimizationAI agent token costsmodel routingcut LLM costsagent cost controlLLMOps cost

Get started

Build agents that are cheap by design

Cache the prefix, route the easy steps, cap the loop, and watch the cost per task in real time. Free to start — no credit card required.

Start building free Read the docs