Inference
Inference is running a trained model to turn inputs into outputs — the moment a model actually does its job. Every prompt you send to an LLM triggers one inference pass.
- Glossary
- Updated 2026
Inference is the process of running a trained model on new inputs to generate outputs — the live, production phase of a model, distinct from the training phase that created it.
When you send a prompt to a model, the system performs a forward pass: your tokens flow through the network's frozen weights and out comes a prediction. For an autoregressive large language model, inference is iterative — the model predicts one token, appends it, and predicts the next, repeating until it emits a stop signal. The weights never change during this; inference only reads them.
Why it matters: inference is where the real-world bill is paid. Unlike training — a costly, one-time event — inference happens on every single request, so its cost per token and latency define the economics of any AI feature. Longer prompts mean more tokens to process, and a bigger context window filled with history raises both price and response time. Picking a smaller model or trimming the prompt are the most direct levers you have.
A concrete example: an AI agent solving a task in five reasoning steps issues five separate inference calls. Halve the steps or switch cheap steps to a smaller model, and you roughly halve the agent's cost and wait time — without retraining anything. That is why teams obsess over inference efficiency long after the model itself is finished.
Inference, answered
Inference is the act of running an already-trained model to produce outputs from new inputs. For a language model, that means feeding in a prompt and getting back generated tokens. It is the 'use' phase of a model, as opposed to training, which is the 'learn' phase.
Run inference inside real AI agents
Wire model calls into a tool-using loop and ship in minutes. Free to start — no credit card required.