Should I run agents on serverless functions or containers?

It depends on how long a single agent run takes. Short, bounded interactions — a one-shot classification, a quick retrieval-and-answer turn — fit serverless functions well: they autoscale to zero and you pay per invocation. But many agents loop for minutes, call slow tools, and stream tokens, which collides with serverless timeout and execution limits. For those, run a long-lived container or a worker pool that pulls runs off a queue, so a single agent execution can take as long as it needs without holding an HTTP connection open. A common production shape is a thin serverless API that enqueues work and a fleet of container workers that execute the agent loop.

How do I control the cost of agents in production?

Agent cost is driven by tokens, and tokens are driven by loop length, context size, and model choice. Control it on several fronts: cap the maximum number of reasoning steps and tool calls per run; trim and summarize context instead of replaying the full history; use prompt caching for stable system prompts and tool definitions; route easy steps to a smaller, cheaper model and reserve the frontier model for hard reasoning; and set hard per-user and per-tenant budget limits that abort a runaway run. Crucially, attribute spend per run and per tenant in your observability so you can see which agents and which users are expensive, then optimize the worst offenders.

How do canary releases and rollbacks work for agents?

Treat the agent's prompt, tool set, and model version as a single versioned bundle. A canary release routes a small slice of traffic — say five percent — to the new bundle while the rest stays on the known-good version, and you watch evals, task success, latency, and cost on the canary in real time. If the metrics hold, you ramp the rollout; if grounding drops or cost spikes, you roll back instantly by flipping traffic to the previous bundle. Because prompts and configs change far more often than code, version them deliberately and keep the last good bundle one switch away.

Where should humans stay in the loop for production agents?

Insert human-in-the-loop checkpoints wherever an action is irreversible, expensive, or high-risk — issuing refunds, deleting records, sending external messages, executing trades, or modifying infrastructure. The pattern is to let the agent reason and prepare the action, then pause and surface a clear approval request to a person before the tool actually fires. Low-risk, easily reversible steps can run autonomously. As trust and eval scores grow you can widen autonomy, but you should always keep an audit trail of who or what approved each consequential action.

LLMOps · Deployment

Deploying AI Agents to Production

A demo that works on your laptop is not a product. Taking an agent to production means running its reasoning loop reliably, persisting state, controlling cost and concurrency, and watching every run — without surprising your users or your finance team.

14 min read
Advanced
Updated 2026

Deploy your first agent Read the docs

The gap between an agent that works in a notebook and one that serves thousands of users is not the model — it is everything around the model. Deployment is where prompts, tools, state, cost, and safety stop being someone's afternoon and start being an operational system.

An ordinary web service takes a request, does some work, and returns a response. An AI agent is different in ways that matter for operations: it runs a multi-step loop, it calls tools that mutate real systems, it carries memory across turns, and it spends money on every token it generates. A single user message can fan out into a dozen model calls and tool invocations that take minutes to resolve. None of that fits the tidy request-response assumptions most infrastructure was built around.

That mismatch is why deploying agents has grown into its own discipline — often called LLMOps. It blends classic DevOps with the peculiar demands of non-deterministic, token-priced software: how to host a loop that can run for minutes, how to keep spend bounded, how to give an agent just enough permission to be useful but not enough to be dangerous, and how to know, on every release, whether the thing got better or quietly worse.

This guide walks the whole path: the dev → staging → prod pipeline, hosting options from serverless to queue workers, state and memory persistence, concurrency and cost control, secrets and tool permissions, observability and evals in CI/CD, rollouts and rollbacks, and where humans should stay in the loop.

The promotion path

Dev → staging → production

Every change to an agent should flow through the same gates. The point of the pipeline is to make non-determinism boring: catch regressions in staging, not in front of users.

DevPrototype + golden tests

Evals (CI)Gate on task success

StagingReal tools, fake stakes

CanarySmall traffic slice

ProductionFull rollout + monitor

The agent deployment pipeline. A versioned bundle moves through evals and a canary before full rollout — and a rollback path runs back the other way.

1 · Development
Build and iterate locally against recorded fixtures and a golden set of cases the agent must pass. Pin the model version, prompt, and tool schema so the run is reproducible — an agent that changes underneath you can't be debugged.
2 · Evals in CI
On every pull request, run the eval suite: task success, grounding, tool-call correctness, latency, and cost. Treat a regression like a failing unit test — it blocks the merge. This is the gate that keeps subjective 'it feels better' out of the release decision.
3 · Staging
Run against production-like tools wired to sandboxes or test tenants, so the agent exercises real integrations without real consequences. Replay anonymized production traffic to surface the long tail of weird inputs your golden set missed.
4 · Canary → production
Promote the bundle to a small slice of live traffic, watch the same metrics in real time, then ramp to 100% if healthy. Keep the previous bundle warm so a rollback is a config flip, not a redeploy.

The unit of release is a bundle, not a commit

An agent's behavior is the product of its prompt, its tool set, and its model version together. Change any one and behavior can swing. Version those three as a single immutable bundle with an ID, so every eval result, trace, and rollback refers to exactly the same thing. This is the single most useful habit in agent evaluation and operations.

Where the loop runs

Hosting: serverless, containers, and queues

The first architectural decision is shaped by one question: how long does a single agent run take? Get that right and the rest of the stack falls into place.

Short, bounded turns — a quick classification, a single retrieve-and-answer step — fit serverless functions beautifully. They autoscale to zero, you pay per invocation, and a spiky workload costs nothing while idle. The catch is execution limits: most serverless platforms cap how long a function can run and how long it can stream, which collides head-on with an agent that loops for minutes.

For those longer runs, reach for long-lived containers or a queue-and-worker pattern. A thin API accepts the request, drops a job on a queue, and returns immediately; a pool of worker containers pulls jobs and executes the full agent loop for as long as it needs, streaming progress over a websocket or storing results for the client to poll. This decouples the user-facing latency budget from the agent's actual wall-clock time and lets you scale workers independently of the API.

Most mature deployments end up hybrid: serverless at the edge for fast turns and cheap fan-out, a durable queue in the middle, and container workers for the heavy, long-running reasoning. Whatever you pick, the workers must be stateless — all run state lives outside the process, which is the subject of the next section.

Need	Serverless	Containers	Queue + workers
Short bounded turns
Multi-minute runs
Scales to zero
Streams long output
Smooths traffic spikes
Retries + back-pressure

The durable shape

API enqueues, workers execute

Separating ingestion from execution is the single most reliable pattern for production agents. The API stays fast and cheap; the workers absorb slow tools and long loops; the queue gives you retries, back-pressure, and a natural place to enforce concurrency limits.

Because workers hold no state, you can add or kill them freely, run them across zones for resilience, and replay a failed run by re-enqueuing its job. Pair this with a deduplication key so a retried message never double-executes a side effect.

Thin API returns a run ID instantly.
Durable queue buffers spikes and retries failures.
Stateless workers scale horizontally.
Results streamed or polled by run ID.

Explore deployment SDKs

RequestUser or system call

APIValidate + enqueue run

QueueBuffer, retry, throttle

WorkerRun the agent loop

ResultStream / poll by ID

Decouple user latency from agent wall-clock time: enqueue fast, execute in workers, return by run ID.

Nothing in the process

State and memory persistence

A worker can die mid-run, scale away, or hand a conversation to a different instance next turn. If state lives in memory, all of that loses data. Externalize everything.

Production agents carry several kinds of state, and each needs a durable home outside the worker process. Conversation state — the running message history and tool results — must survive a reconnect or a worker swap. Working memory — the scratchpad, plan, and intermediate findings within a single run — should be checkpointed so a crashed run can resume instead of starting over. And long-term memory — facts the agent should recall across sessions — belongs in a database or a vector store the agent queries on demand.

The practical rule is that any worker should be able to pick up any run by loading its state from storage. A relational or document database handles structured run state and checkpoints; an object store holds large artifacts; a vector store backs retrieval and long-term recall. Checkpoint after each step so a failure costs you one step, not the whole run — and so you can insert a human approval pause and resume exactly where you left off.

Treat memory as something to curate, not just append. Summarize and prune old turns to keep context small and cheap, expire stale facts, and keep provenance on stored memories so you can trust and audit what the agent recalls. Read more in the deep dive on observing agent state.

Conversation state

Message historyTool resultsSurvives reconnects

Run checkpoints

Per-step snapshotsResume after crashPause for approval

Long-term memory

Cross-session factsVector + relationalProvenance kept

Artifacts

Files & outputsObject storageReferenced by ID

The persistence stack: nothing the agent needs to survive a restart lives inside the worker.

Keep it bounded

Concurrency, rate limits, and cost control

Agents fail in two expensive directions at once: they hammer downstream APIs into rate limits, and they burn tokens in loops nobody capped. Both need hard guardrails.

Step & budget caps

Cap the maximum reasoning steps and tool calls per run, and set a hard token budget that aborts a runaway loop. An agent that can't stop itself will, eventually, cost you a fortune.

Concurrency limits

Throttle how many runs and tool calls fire at once, per tenant and globally, so one busy customer can't starve everyone else or trip your provider's rate limits.

Rate-limit handling

Expect 429s from model and tool APIs. Back off exponentially, retry idempotently, and queue overflow rather than dropping it — the queue is your shock absorber.

Context trimming

Summarize and prune history instead of replaying every turn. Token cost scales with context size on every single step, so a bloated context taxes the whole run.

Prompt caching

Cache stable system prompts and tool definitions so you don't pay to re-read them each call. For long-running agents this is one of the largest, easiest savings.

Model routing

Send easy steps to a smaller, cheaper model and reserve the frontier model for hard reasoning. Routing by difficulty often cuts cost sharply with no quality loss.

per-run

Step cap

abort runaway loops

per-tenant

Budget ceiling

cap spend by customer

429→ backoff

Rate-limit retries

queue, don't drop

100%

Runs cost-attributed

trace spend per run

Attribute cost per run and per tenant

You cannot control what you cannot see. Tag every model and tool call with a run ID and tenant ID, then total the spend per run. This turns "the agent is expensive" into "this one tool loops three times for power users on the enterprise plan" — a problem you can actually fix. Cost is a first-class metric in agent observability.

Least privilege

Secrets and tool permissions

An agent's tools are its hands, and its secrets are its keys. The blast radius of a confused or manipulated agent is exactly the set of permissions you handed it.

Never put credentials in prompts, code, or model context. Inject them at runtime from a secrets manager, scope them tightly, and rotate them on a schedule. The model should be able to invoke a tool that uses a credential without ever seeing the credential itself — the secret lives in the tool's execution environment, not in the conversation.

Then scope each tool to least privilege. A retrieval tool should read one index, not your whole warehouse; a write tool should touch one tenant's records, not everyone's. Because an agent decides for itself which tools to call — and can be steered by prompt injection from untrusted content — assume any tool it holds can be triggered by an adversary, and grant accordingly. Pair narrow permissions with input validation on tool arguments and allow-lists on destinations.

Finally, separate read tools from write tools, and gate the dangerous writes. Reversible reads can run freely; irreversible or high-impact actions should require an explicit approval step, covered below. Log every tool call with its arguments and result so you have an audit trail when something goes wrong.

Tool permission done right

Secrets injected from a vault, never in context
Each tool scoped to least privilege
Read and write tools cleanly separated
Destructive actions gated behind approval
Every tool call logged with args and result

Anti-patterns to avoid

A single all-powerful admin key shared by every tool
Credentials pasted into the system prompt
Tools that can touch any tenant's data
Untrusted web content fed straight into a tool
No record of what the agent actually did

Know before users do

Observability and evals in CI/CD

Non-deterministic software can regress without a single line of code changing — a model update or a prompt tweak is enough. The only defense is to measure every run and every release.

ci/agent-evals.yamlyaml

1eval_gate:  // runs on every PR2  bundle: agent@${{ git.sha }}  // prompt + tools + model3  suite: ./evals/golden_set.jsonl4  thresholds:5    task_success: ">= 0.90"  // did it finish the job?6    grounding: ">= 0.95"  // claims backed by sources7    tool_call_accuracy: ">= 0.92"8    p95_latency_ms: "<= 8000"9    cost_per_run_usd: "<= 0.12"10  on_regression: block_merge  // fail the build, not prod

An eval gate in CI: every change to the agent bundle must clear the thresholds before it can ship.

Trace every run in production

Step-level traces

Capture each reasoning step, tool call, input, and output so you can replay exactly what the agent did when a run goes sideways.

Live metrics

Track task success, latency, error rate, and cost per run as time series, sliced by bundle version and tenant.

Alerts on drift

Page someone when grounding drops, cost spikes, or tool errors climb — model and data drift are silent without alarms.

Close the loop with production data

Offline evals catch known failures; production catches the unknown ones. Sample real runs — especially failures, refusals, and low-confidence answers — and feed them back into the golden set so your eval suite grows toward the cases that actually break. This flywheel, where production traces become tomorrow's tests, is what separates an agent that decays from one that improves.

Lean on an LLM-as-judge and rubric evals for the subjective dimensions humans can't score at scale, and reserve human review for the ambiguous tail. The full methodology lives in our guide to agent observability.

Ship safely

Rollouts, rollbacks, and human-in-the-loop

The last mile is releasing changes without breaking trust, and keeping a person in the loop wherever a mistake would be costly or irreversible.

Because the agent bundle changes far more often than your code, treat rollout as a routine, low-drama event. Ship the new bundle to a canary — a small slice of live traffic — and watch the same metrics you gate on in CI, now against real users. If task success, grounding, latency, and cost all hold, ramp the rollout in stages to 100%. If anything regresses, roll back by flipping traffic to the previous bundle, which you kept warm precisely for this moment. A rollback should be a config change measured in seconds, never a redeploy measured in minutes.

Human-in-the-loop checkpoints are the safety valve for actions that can't be undone. The agent reasons and prepares the action, then pauses and surfaces a clear approval request before the tool actually fires — a refund, a deletion, an outbound message, a production change. Reversible, low-stakes steps run autonomously; consequential ones wait for a person. As eval scores and trust grow, widen the agent's autonomy deliberately, and always keep an audit trail of who or what approved each action.

Canary5% of live traffic

Watch metricsSuccess, cost, latency

Human gateApprove risky actions

RampStage up to 100%

Rollback readyFlip to last good bundle

A canary rollout with a human gate on consequential actions and a one-flip path back to the last good bundle.

Before you flip the switch

Production-readiness checklist

Prompt, tools, and model versioned as one bundle — Every trace, eval, and rollback points to a single bundle ID.
Hosting matches run length — Serverless for short turns, queue workers for long, stateless runs.
All state externalized and checkpointed — Any worker can resume any run from durable storage.
Step caps and per-tenant budgets enforced — No run can loop or spend without a hard ceiling.
Concurrency and rate-limit handling in place — Backoff, retries, and queueing absorb 429s gracefully.
Secrets injected from a vault, never in context — The model invokes tools without ever seeing credentials.
Tools scoped to least privilege — Read and write separated; destructive writes gated.
Eval gate blocks regressions in CI — Task success, grounding, latency, and cost thresholds enforced.
Tracing and alerting live in production — Step-level traces with alarms on drift and cost spikes.
Canary, rollback, and human approvals wired — Ship to a slice, watch, ramp, and revert in seconds.

FAQ

Deploying agents, answered

Deploying an AI agent means promoting it from a developer's laptop to a managed environment where real users and real systems depend on it. Unlike a stateless web endpoint, an agent runs multi-step loops, calls tools that touch live data, holds state across turns, and incurs per-token cost on every reasoning step. Production deployment therefore covers hosting and scaling the run loop, persisting memory and state, controlling concurrency and spend, locking down secrets and tool permissions, and wiring observability and evals so you can detect regressions before users do. In short: it is the operational discipline (often called LLMOps) of running an agent reliably, safely, and affordably at scale.

Keep learning

Go deeper on running agents in production

AI agent observabilityTrace, measure, and alert on every run AI agent securityPermissions, prompt injection, and blast radius AI agent evaluationGolden sets, judges, and CI eval gates LLM agentsThe reason-act loop you are deploying Platform docsDeploy, scale, and monitor on aiagentics SDKsWire agents into your stack and CI/CD

deploy AI agentsAI agents in productionproduction agentsscaling AI agentsagent infrastructureagent deploymentLLMOpsagent hosting

Get started

Ship an agent your ops team can trust

Hosting, state, evals, canaries, and human approvals — built in, so you deploy agents like real software. Free to start, no credit card required.

Start deploying free Read the docs