Why do AI agents need memory if LLMs already have a context window?

A context window is finite — even a 200K-token window can't hold an entire knowledge base, a month of chat history, and a 10-step task trace at once. Stuffing everything into the prompt is also slow and expensive, and models pay less attention to information buried in the middle of a long context. Memory solves this by keeping the prompt small: the agent stores everything externally and retrieves only the few most relevant chunks for each step.

What is RAG (retrieval-augmented generation)?

RAG is a pattern where, instead of relying only on what the model learned during training, you fetch relevant information at query time and inject it into the prompt before generation. The query is embedded into a vector, a vector database returns the most similar stored chunks, and those chunks are added to the prompt as grounded context. RAG is the backbone of long-term memory and the main way agents stay accurate, current, and grounded in private or domain-specific data.

What is a vector database and why does agent memory use one?

A vector database stores embeddings — numeric representations of text where similar meanings sit close together in high-dimensional space — and supports fast nearest-neighbor search. Agent memory uses one because it lets you retrieve by meaning rather than exact keywords: a question about 'cancelling my plan' can surface a document about 'subscription termination' even with no shared words. Popular options include pgvector, Pinecone, Weaviate, Qdrant, Chroma, and Milvus.

When should I use RAG versus fine-tuning or a long context window?

Use RAG when knowledge changes often, must be cited, or is too large to fit in a prompt — it is the default for grounded, up-to-date answers. Use fine-tuning to teach a model a style, format, or skill rather than facts. Use a long context window for one-off tasks where all the relevant material is already at hand. In practice, production agents combine them: a long window for the active task, RAG for durable knowledge, and occasionally fine-tuning for behavior.

Agent internals · Memory & retrieval

AI agent memory, RAG, and vector databases

Without memory, an agent forgets everything the moment its context window fills up. This guide explains how AI agent memory works — short-term vs long-term, the RAG retrieval pipeline, embeddings, and the vector databases that let agents recall facts and ground their answers in your data.

11 min read
Intermediate
Updated 2026

Build a grounded agent Explore agent tools

AI agent memory is everything an agent can remember beyond a single model call. A bare large language model (LLM) is stateless — it sees only what you put in the prompt and forgets it the instant the response is returned. Memory is the scaffolding that turns that stateless predictor into something that can recall a user's preferences, reuse the result of a previous step, and ground its answers in private, up-to-date knowledge.

The reason memory is unavoidable comes down to one hard limit: the context window. Every model can only read a fixed number of tokens at once. Even a generous 200K-token window cannot hold an entire product catalog, months of conversation history, and a long multi-step task trace simultaneously. And the more you cram in, the slower and pricier each call becomes — while the model pays measurably less attention to information stranded in the middle of a huge prompt.

The fix is to keep the prompt small and store everything else externally, then retrieve only the few most relevant pieces for each step. That retrieval pattern — embed a query, search a vector database, and inject the best matches back into the prompt — is called RAG (retrieval-augmented generation), and it is the engine behind nearly every agent's long-term memory. It pairs naturally with tools and the reasoning loop you see in LLM agents.

The taxonomy

The types of AI agent memory

Borrowing from cognitive science, agent memory is usually split into five kinds — each answering a different question about what the agent should retain.

Short-term / working

The live context window for the current task — the system prompt, recent messages, and intermediate results. Fast and exact, but volatile: it vanishes when the window fills or the run ends.

Long-term memory

Durable knowledge persisted outside the model, typically as embeddings in a vector store. Retrieved on demand so the agent can recall facts and history from days or weeks ago.

Episodic memory

A record of specific past events: previous conversations, tool calls, and the outcomes of earlier task runs. Lets an agent say 'last time this approach failed' and adapt.

Semantic memory

General facts and domain knowledge — documentation, policies, product specs — independent of any single event. This is the classic target of a RAG knowledge base.

Procedural memory

Learned how-to knowledge: reusable skills, prompt recipes, and successful workflows the agent can apply again without re-deriving them from scratch.

User / entity memory

A structured profile of who the agent is helping — names, preferences, account state, and prior decisions — so interactions stay personalized and consistent over time.

How they work together

A well-built agent uses several memory types at once: short-term memory drives the current step, semantic memory grounds its facts, episodic memory helps it learn from past runs, and user memory keeps the experience personal. RAG is the retrieval mechanism that makes the long-term kinds usable inside a finite prompt.

Retrieval-augmented generation

The RAG retrieval pipeline, step by step

Long-term memory comes alive through retrieval. Here is the path a query takes from raw text to a grounded, augmented prompt.

QueryUser goal or sub-task

EmbedText → vector

Vector searchTop-k nearest neighbors

RerankScore & trim to the best

Augment promptInject retrieved context

GenerateGrounded answer

The RAG pipeline: each query is embedded, matched against the vector store, reranked, and folded into the prompt before generation.

Each stage earns its place. Embedding converts the query into a vector so it can be compared by meaning. Vector search returns the top-k most similar chunks — usually 5 to 20 candidates. A reranker then re-scores those candidates with a more precise model and keeps only the few that truly matter, which dramatically cuts noise and token cost. Finally, those chunks are augmented into the prompt as grounded context, and the model generates an answer it can actually cite.

Skipping the rerank step is the single most common cause of "RAG that almost works" — raw vector similarity is good at recall but weak at precision, so a second-pass reranker is what makes retrieval trustworthy in production.

Embeddings & vector stores

How agents store and retrieve a memory

Under the hood, every long-term memory is a vector. Writing one is an upsert; recalling one is a similarity search.

The mechanics

From text to vector and back again

An embedding model maps a piece of text to a list of numbers — a 768- or 1536-dimensional vector — positioned so that semantically similar text lands nearby. Storing a memory means embedding it once and writing the vector (plus metadata) to the database.

Recall is the reverse: embed the incoming query, ask the vector store for its nearest neighbors by cosine similarity, and you get back the most relevant memories — even when they share no literal keywords with the query.

Chunk documents into focused passages before embedding
Attach metadata (source, date, user) for filtered retrieval
Use cosine similarity for fast nearest-neighbor search
Cite sources back to the user for trust and auditability

See agent tools that read & write memory

memory.tstypescript

1// 1. Store a memory2const vector = await embed("User prefers email over phone");3await store.upsert({4  id: "mem_8821", vector,5  metadata: { userId, kind: "preference" },6});78// 2. Recall relevant memories9const q = await embed("How should I contact them?");10const hits = await store.search({ vector: q, topK: 5 });11const context = hits.map(h => h.text).join("\n");12// → inject `context` into the prompt

Storing a memory is an upsert; recalling one is a top-k similarity search.

Choosing an approach

RAG vs fine-tuning vs long context

There are three ways to get knowledge into a model. They solve different problems — and the best agents use them together.

Dimension	RAG	Fine-tuning	Long context
Best for	Facts & private data	Style & behavior	One-off, self-contained tasks
Knowledge stays current
Update cost	Low (re-index)	High (retrain)	None
Can cite sources
Scales to large corpora
Cost per query	Low	Low	High
Setup effort	Medium	High	Low

The rule of thumb: RAG for facts, fine-tuning for behavior, long context for convenience. If your knowledge changes weekly or must be cited, reach for RAG. If you need the model to speak in a specific format or follow a niche skill, fine-tune. If everything the task needs already fits in the prompt, a long context window is the simplest path. Production agents routinely blend all three.

Key terms

AI agent memory glossary

The vocabulary you'll meet across every memory and retrieval system.

Embedding: A numeric vector that represents the meaning of a piece of text, image, or other data. Similar meanings produce vectors that sit close together, enabling search by semantics rather than keywords.
Vector store / database: A database optimized for storing embeddings and running fast nearest-neighbor search. Examples include pgvector, Pinecone, Weaviate, Qdrant, and Chroma.
Chunking: Splitting source documents into smaller, focused passages before embedding. Good chunking (with sensible size and overlap) is one of the biggest levers on retrieval quality.
Retrieval: The act of fetching the most relevant stored chunks for a given query — typically a top-k similarity search, optionally followed by a reranking pass for precision.
Context window: The maximum number of tokens a model can read in a single call. Its finite size is the core reason agents externalize memory instead of keeping everything in the prompt.
Top-k & reranking: Top-k returns the k nearest vectors to a query; reranking re-scores those candidates with a stronger model to keep only the few that are genuinely relevant.

FAQ

AI agent memory, answered

AI agent memory is the set of mechanisms an agent uses to retain and recall information beyond a single model call. It spans short-term working memory (the live context window for the current task) and long-term memory — usually facts, past interactions, and documents stored in a vector database and retrieved on demand. Memory is what lets an agent remember a user's preferences from last week, recall the outcome of a previous step, and ground its answers in your private data instead of guessing.

Keep learning

Go deeper on building agents

AI agent toolsGive agents the ability to act LLM agents explainedReasoning + tool use + memory How to build AI agentsA step-by-step guide Agentic workflows & patternsReAct, planning, reflection Multi-agent systemsShared memory across a team All learning guidesThe full curriculum

AI agent memoryRAGretrieval augmented generationvector databaseembeddingsagent contextsemantic searchchunking

Get started

Build agents that remember

Add RAG-powered long-term memory and grounded retrieval to your agents in minutes. Free to start — no credit card required.

Start building free Browse templates