Deploying AI Agents to Production
A demo that works on your laptop is not a product. Taking an agent to production means running its reasoning loop reliably, persisting state, controlling cost and concurrency, and watching every run — without surprising your users or your finance team.
- 14 min read
- Advanced
- Updated 2026
The gap between an agent that works in a notebook and one that serves thousands of users is not the model — it is everything around the model. Deployment is where prompts, tools, state, cost, and safety stop being someone's afternoon and start being an operational system.
An ordinary web service takes a request, does some work, and returns a response. An AI agent is different in ways that matter for operations: it runs a multi-step loop, it calls tools that mutate real systems, it carries memory across turns, and it spends money on every token it generates. A single user message can fan out into a dozen model calls and tool invocations that take minutes to resolve. None of that fits the tidy request-response assumptions most infrastructure was built around.
That mismatch is why deploying agents has grown into its own discipline — often called LLMOps. It blends classic DevOps with the peculiar demands of non-deterministic, token-priced software: how to host a loop that can run for minutes, how to keep spend bounded, how to give an agent just enough permission to be useful but not enough to be dangerous, and how to know, on every release, whether the thing got better or quietly worse.
This guide walks the whole path: the dev → staging → prod pipeline, hosting options from serverless to queue workers, state and memory persistence, concurrency and cost control, secrets and tool permissions, observability and evals in CI/CD, rollouts and rollbacks, and where humans should stay in the loop.
Dev → staging → production
Every change to an agent should flow through the same gates. The point of the pipeline is to make non-determinism boring: catch regressions in staging, not in front of users.
1 · Development
Build and iterate locally against recorded fixtures and a golden set of cases the agent must pass. Pin the model version, prompt, and tool schema so the run is reproducible — an agent that changes underneath you can't be debugged.
2 · Evals in CI
On every pull request, run the eval suite: task success, grounding, tool-call correctness, latency, and cost. Treat a regression like a failing unit test — it blocks the merge. This is the gate that keeps subjective 'it feels better' out of the release decision.
3 · Staging
Run against production-like tools wired to sandboxes or test tenants, so the agent exercises real integrations without real consequences. Replay anonymized production traffic to surface the long tail of weird inputs your golden set missed.
4 · Canary → production
Promote the bundle to a small slice of live traffic, watch the same metrics in real time, then ramp to 100% if healthy. Keep the previous bundle warm so a rollback is a config flip, not a redeploy.
The unit of release is a bundle, not a commit
An agent's behavior is the product of its prompt, its tool set, and its model version together. Change any one and behavior can swing. Version those three as a single immutable bundle with an ID, so every eval result, trace, and rollback refers to exactly the same thing. This is the single most useful habit in agent evaluation and operations.
Hosting: serverless, containers, and queues
The first architectural decision is shaped by one question: how long does a single agent run take? Get that right and the rest of the stack falls into place.
Short, bounded turns — a quick classification, a single retrieve-and-answer step — fit serverless functions beautifully. They autoscale to zero, you pay per invocation, and a spiky workload costs nothing while idle. The catch is execution limits: most serverless platforms cap how long a function can run and how long it can stream, which collides head-on with an agent that loops for minutes.
For those longer runs, reach for long-lived containers or a queue-and-worker pattern. A thin API accepts the request, drops a job on a queue, and returns immediately; a pool of worker containers pulls jobs and executes the full agent loop for as long as it needs, streaming progress over a websocket or storing results for the client to poll. This decouples the user-facing latency budget from the agent's actual wall-clock time and lets you scale workers independently of the API.
Most mature deployments end up hybrid: serverless at the edge for fast turns and cheap fan-out, a durable queue in the middle, and container workers for the heavy, long-running reasoning. Whatever you pick, the workers must be stateless — all run state lives outside the process, which is the subject of the next section.
| Need | Serverless | Containers | Queue + workers |
|---|---|---|---|
| Short bounded turns | |||
| Multi-minute runs | |||
| Scales to zero | |||
| Streams long output | |||
| Smooths traffic spikes | |||
| Retries + back-pressure |
API enqueues, workers execute
Separating ingestion from execution is the single most reliable pattern for production agents. The API stays fast and cheap; the workers absorb slow tools and long loops; the queue gives you retries, back-pressure, and a natural place to enforce concurrency limits.
Because workers hold no state, you can add or kill them freely, run them across zones for resilience, and replay a failed run by re-enqueuing its job. Pair this with a deduplication key so a retried message never double-executes a side effect.
- Thin API returns a run ID instantly.
- Durable queue buffers spikes and retries failures.
- Stateless workers scale horizontally.
- Results streamed or polled by run ID.
State and memory persistence
A worker can die mid-run, scale away, or hand a conversation to a different instance next turn. If state lives in memory, all of that loses data. Externalize everything.
Production agents carry several kinds of state, and each needs a durable home outside the worker process. Conversation state — the running message history and tool results — must survive a reconnect or a worker swap. Working memory — the scratchpad, plan, and intermediate findings within a single run — should be checkpointed so a crashed run can resume instead of starting over. And long-term memory — facts the agent should recall across sessions — belongs in a database or a vector store the agent queries on demand.
The practical rule is that any worker should be able to pick up any run by loading its state from storage. A relational or document database handles structured run state and checkpoints; an object store holds large artifacts; a vector store backs retrieval and long-term recall. Checkpoint after each step so a failure costs you one step, not the whole run — and so you can insert a human approval pause and resume exactly where you left off.
Treat memory as something to curate, not just append. Summarize and prune old turns to keep context small and cheap, expire stale facts, and keep provenance on stored memories so you can trust and audit what the agent recalls. Read more in the deep dive on observing agent state.
Concurrency, rate limits, and cost control
Agents fail in two expensive directions at once: they hammer downstream APIs into rate limits, and they burn tokens in loops nobody capped. Both need hard guardrails.
Step & budget caps
Cap the maximum reasoning steps and tool calls per run, and set a hard token budget that aborts a runaway loop. An agent that can't stop itself will, eventually, cost you a fortune.
Concurrency limits
Throttle how many runs and tool calls fire at once, per tenant and globally, so one busy customer can't starve everyone else or trip your provider's rate limits.
Rate-limit handling
Expect 429s from model and tool APIs. Back off exponentially, retry idempotently, and queue overflow rather than dropping it — the queue is your shock absorber.
Context trimming
Summarize and prune history instead of replaying every turn. Token cost scales with context size on every single step, so a bloated context taxes the whole run.
Prompt caching
Cache stable system prompts and tool definitions so you don't pay to re-read them each call. For long-running agents this is one of the largest, easiest savings.
Model routing
Send easy steps to a smaller, cheaper model and reserve the frontier model for hard reasoning. Routing by difficulty often cuts cost sharply with no quality loss.
Step cap
abort runaway loops
Budget ceiling
cap spend by customer
Rate-limit retries
queue, don't drop
Runs cost-attributed
trace spend per run
Attribute cost per run and per tenant
You cannot control what you cannot see. Tag every model and tool call with a run ID and tenant ID, then total the spend per run. This turns "the agent is expensive" into "this one tool loops three times for power users on the enterprise plan" — a problem you can actually fix. Cost is a first-class metric in agent observability.
Secrets and tool permissions
An agent's tools are its hands, and its secrets are its keys. The blast radius of a confused or manipulated agent is exactly the set of permissions you handed it.
Never put credentials in prompts, code, or model context. Inject them at runtime from a secrets manager, scope them tightly, and rotate them on a schedule. The model should be able to invoke a tool that uses a credential without ever seeing the credential itself — the secret lives in the tool's execution environment, not in the conversation.
Then scope each tool to least privilege. A retrieval tool should read one index, not your whole warehouse; a write tool should touch one tenant's records, not everyone's. Because an agent decides for itself which tools to call — and can be steered by prompt injection from untrusted content — assume any tool it holds can be triggered by an adversary, and grant accordingly. Pair narrow permissions with input validation on tool arguments and allow-lists on destinations.
Finally, separate read tools from write tools, and gate the dangerous writes. Reversible reads can run freely; irreversible or high-impact actions should require an explicit approval step, covered below. Log every tool call with its arguments and result so you have an audit trail when something goes wrong.
Tool permission done right
- Secrets injected from a vault, never in context
- Each tool scoped to least privilege
- Read and write tools cleanly separated
- Destructive actions gated behind approval
- Every tool call logged with args and result
Anti-patterns to avoid
- A single all-powerful admin key shared by every tool
- Credentials pasted into the system prompt
- Tools that can touch any tenant's data
- Untrusted web content fed straight into a tool
- No record of what the agent actually did
Observability and evals in CI/CD
Non-deterministic software can regress without a single line of code changing — a model update or a prompt tweak is enough. The only defense is to measure every run and every release.
1eval_gate: // runs on every PR2 bundle: agent@${{ git.sha }} // prompt + tools + model3 suite: ./evals/golden_set.jsonl4 thresholds:5 task_success: ">= 0.90" // did it finish the job?6 grounding: ">= 0.95" // claims backed by sources7 tool_call_accuracy: ">= 0.92"8 p95_latency_ms: "<= 8000"9 cost_per_run_usd: "<= 0.12"10 on_regression: block_merge // fail the build, not prodTrace every run in production
Step-level traces
Capture each reasoning step, tool call, input, and output so you can replay exactly what the agent did when a run goes sideways.
Live metrics
Track task success, latency, error rate, and cost per run as time series, sliced by bundle version and tenant.
Alerts on drift
Page someone when grounding drops, cost spikes, or tool errors climb — model and data drift are silent without alarms.
Close the loop with production data
Offline evals catch known failures; production catches the unknown ones. Sample real runs — especially failures, refusals, and low-confidence answers — and feed them back into the golden set so your eval suite grows toward the cases that actually break. This flywheel, where production traces become tomorrow's tests, is what separates an agent that decays from one that improves.
Lean on an LLM-as-judge and rubric evals for the subjective dimensions humans can't score at scale, and reserve human review for the ambiguous tail. The full methodology lives in our guide to agent observability.
Rollouts, rollbacks, and human-in-the-loop
The last mile is releasing changes without breaking trust, and keeping a person in the loop wherever a mistake would be costly or irreversible.
Because the agent bundle changes far more often than your code, treat rollout as a routine, low-drama event. Ship the new bundle to a canary — a small slice of live traffic — and watch the same metrics you gate on in CI, now against real users. If task success, grounding, latency, and cost all hold, ramp the rollout in stages to 100%. If anything regresses, roll back by flipping traffic to the previous bundle, which you kept warm precisely for this moment. A rollback should be a config change measured in seconds, never a redeploy measured in minutes.
Human-in-the-loop checkpoints are the safety valve for actions that can't be undone. The agent reasons and prepares the action, then pauses and surfaces a clear approval request before the tool actually fires — a refund, a deletion, an outbound message, a production change. Reversible, low-stakes steps run autonomously; consequential ones wait for a person. As eval scores and trust grow, widen the agent's autonomy deliberately, and always keep an audit trail of who or what approved each action.
Production-readiness checklist
- Prompt, tools, and model versioned as one bundle — Every trace, eval, and rollback points to a single bundle ID.
- Hosting matches run length — Serverless for short turns, queue workers for long, stateless runs.
- All state externalized and checkpointed — Any worker can resume any run from durable storage.
- Step caps and per-tenant budgets enforced — No run can loop or spend without a hard ceiling.
- Concurrency and rate-limit handling in place — Backoff, retries, and queueing absorb 429s gracefully.
- Secrets injected from a vault, never in context — The model invokes tools without ever seeing credentials.
- Tools scoped to least privilege — Read and write separated; destructive writes gated.
- Eval gate blocks regressions in CI — Task success, grounding, latency, and cost thresholds enforced.
- Tracing and alerting live in production — Step-level traces with alarms on drift and cost spikes.
- Canary, rollback, and human approvals wired — Ship to a slice, watch, ramp, and revert in seconds.
Deploying agents, answered
Deploying an AI agent means promoting it from a developer's laptop to a managed environment where real users and real systems depend on it. Unlike a stateless web endpoint, an agent runs multi-step loops, calls tools that touch live data, holds state across turns, and incurs per-token cost on every reasoning step. Production deployment therefore covers hosting and scaling the run loop, persisting memory and state, controlling concurrency and spend, locking down secrets and tool permissions, and wiring observability and evals so you can detect regressions before users do. In short: it is the operational discipline (often called LLMOps) of running an agent reliably, safely, and affordably at scale.
Go deeper on running agents in production
Ship an agent your ops team can trust
Hosting, state, evals, canaries, and human approvals — built in, so you deploy agents like real software. Free to start, no credit card required.