What is the difference between inference and training?

Training adjusts the model's weights by processing huge datasets — it is expensive, done once (or periodically), and produces the model. Inference uses those frozen weights to answer a single request. Training builds the brain; inference is the brain thinking a thought. You train rarely and run inference constantly.

Why does inference cost and latency matter for AI agents?

An agent makes many model calls per task — one for each step in its reasoning loop — so every call's price and speed multiply. Inference cost is usually billed per token, and latency is the time to first and last token. Optimizing prompt length, model choice, and the number of steps directly controls how fast and affordable an agent is.

Glossary

Inference

Inference is running a trained model to turn inputs into outputs — the moment a model actually does its job. Every prompt you send to an LLM triggers one inference pass.

Glossary
Updated 2026

Start building free Deep dive: LLM agents

Inference is the process of running a trained model on new inputs to generate outputs — the live, production phase of a model, distinct from the training phase that created it.

When you send a prompt to a model, the system performs a forward pass: your tokens flow through the network's frozen weights and out comes a prediction. For an autoregressive large language model, inference is iterative — the model predicts one token, appends it, and predicts the next, repeating until it emits a stop signal. The weights never change during this; inference only reads them.

Why it matters: inference is where the real-world bill is paid. Unlike training — a costly, one-time event — inference happens on every single request, so its cost per token and latency define the economics of any AI feature. Longer prompts mean more tokens to process, and a bigger context window filled with history raises both price and response time. Picking a smaller model or trimming the prompt are the most direct levers you have.

A concrete example: an AI agent solving a task in five reasoning steps issues five separate inference calls. Halve the steps or switch cheap steps to a smaller model, and you roughly halve the agent's cost and wait time — without retraining anything. That is why teams obsess over inference efficiency long after the model itself is finished.

Related terms

Terms that surround inference

Large language model: The trained network that inference runs — predicting tokens to generate language. See /glossary/large-language-model.
Context window: The token budget each inference call works within; larger windows raise cost and latency. See /glossary/context-window.
Fine-tuning: Further training a base model on specialized data so later inference is more accurate for your domain. See /glossary/fine-tuning.

FAQ

Inference, answered

Inference is the act of running an already-trained model to produce outputs from new inputs. For a language model, that means feeding in a prompt and getting back generated tokens. It is the 'use' phase of a model, as opposed to training, which is the 'learn' phase.

Keep learning

Go deeper

LLM agents: reasoning + tool useHow agents stack many inference calls into a loop Large language modelThe model that inference runs Fine-tuningSpecialize a model before inference

Get started

Run inference inside real AI agents

Wire model calls into a tool-using loop and ship in minutes. Free to start — no credit card required.

Start building free Read the LLM agents guide