Agent internals · Memory & retrieval

AI agent memory, RAG, and vector databases

Without memory, an agent forgets everything the moment its context window fills up. This guide explains how AI agent memory works — short-term vs long-term, the RAG retrieval pipeline, embeddings, and the vector databases that let agents recall facts and ground their answers in your data.

  • 11 min read
  • Intermediate
  • Updated 2026

AI agent memory is everything an agent can remember beyond a single model call. A bare large language model (LLM) is stateless — it sees only what you put in the prompt and forgets it the instant the response is returned. Memory is the scaffolding that turns that stateless predictor into something that can recall a user's preferences, reuse the result of a previous step, and ground its answers in private, up-to-date knowledge.

The reason memory is unavoidable comes down to one hard limit: the context window. Every model can only read a fixed number of tokens at once. Even a generous 200K-token window cannot hold an entire product catalog, months of conversation history, and a long multi-step task trace simultaneously. And the more you cram in, the slower and pricier each call becomes — while the model pays measurably less attention to information stranded in the middle of a huge prompt.

The fix is to keep the prompt small and store everything else externally, then retrieve only the few most relevant pieces for each step. That retrieval pattern — embed a query, search a vector database, and inject the best matches back into the prompt — is called RAG (retrieval-augmented generation), and it is the engine behind nearly every agent's long-term memory. It pairs naturally with tools and the reasoning loop you see in LLM agents.

The taxonomy

The types of AI agent memory

Borrowing from cognitive science, agent memory is usually split into five kinds — each answering a different question about what the agent should retain.

Short-term / working

The live context window for the current task — the system prompt, recent messages, and intermediate results. Fast and exact, but volatile: it vanishes when the window fills or the run ends.

Long-term memory

Durable knowledge persisted outside the model, typically as embeddings in a vector store. Retrieved on demand so the agent can recall facts and history from days or weeks ago.

Episodic memory

A record of specific past events: previous conversations, tool calls, and the outcomes of earlier task runs. Lets an agent say 'last time this approach failed' and adapt.

Semantic memory

General facts and domain knowledge — documentation, policies, product specs — independent of any single event. This is the classic target of a RAG knowledge base.

Procedural memory

Learned how-to knowledge: reusable skills, prompt recipes, and successful workflows the agent can apply again without re-deriving them from scratch.

User / entity memory

A structured profile of who the agent is helping — names, preferences, account state, and prior decisions — so interactions stay personalized and consistent over time.

How they work together

A well-built agent uses several memory types at once: short-term memory drives the current step, semantic memory grounds its facts, episodic memory helps it learn from past runs, and user memory keeps the experience personal. RAG is the retrieval mechanism that makes the long-term kinds usable inside a finite prompt.

Retrieval-augmented generation

The RAG retrieval pipeline, step by step

Long-term memory comes alive through retrieval. Here is the path a query takes from raw text to a grounded, augmented prompt.

QueryUser goal or sub-task
EmbedText → vector
Vector searchTop-k nearest neighbors
RerankScore & trim to the best
Augment promptInject retrieved context
GenerateGrounded answer
The RAG pipeline: each query is embedded, matched against the vector store, reranked, and folded into the prompt before generation.

Each stage earns its place. Embedding converts the query into a vector so it can be compared by meaning. Vector search returns the top-k most similar chunks — usually 5 to 20 candidates. A reranker then re-scores those candidates with a more precise model and keeps only the few that truly matter, which dramatically cuts noise and token cost. Finally, those chunks are augmented into the prompt as grounded context, and the model generates an answer it can actually cite.

Skipping the rerank step is the single most common cause of "RAG that almost works" — raw vector similarity is good at recall but weak at precision, so a second-pass reranker is what makes retrieval trustworthy in production.

Embeddings & vector stores

How agents store and retrieve a memory

Under the hood, every long-term memory is a vector. Writing one is an upsert; recalling one is a similarity search.

The mechanics

From text to vector and back again

An embedding model maps a piece of text to a list of numbers — a 768- or 1536-dimensional vector — positioned so that semantically similar text lands nearby. Storing a memory means embedding it once and writing the vector (plus metadata) to the database.

Recall is the reverse: embed the incoming query, ask the vector store for its nearest neighbors by cosine similarity, and you get back the most relevant memories — even when they share no literal keywords with the query.

  • Chunk documents into focused passages before embedding
  • Attach metadata (source, date, user) for filtered retrieval
  • Use cosine similarity for fast nearest-neighbor search
  • Cite sources back to the user for trust and auditability
See agent tools that read & write memory
memory.tstypescript
1// 1. Store a memory2const vector = await embed("User prefers email over phone");3await store.upsert({4  id: "mem_8821", vector,5  metadata: { userId, kind: "preference" },6});78// 2. Recall relevant memories9const q = await embed("How should I contact them?");10const hits = await store.search({ vector: q, topK: 5 });11const context = hits.map(h => h.text).join("\n");12// → inject `context` into the prompt
Storing a memory is an upsert; recalling one is a top-k similarity search.
Choosing an approach

RAG vs fine-tuning vs long context

There are three ways to get knowledge into a model. They solve different problems — and the best agents use them together.

DimensionRAGFine-tuningLong context
Best forFacts & private dataStyle & behaviorOne-off, self-contained tasks
Knowledge stays current
Update costLow (re-index)High (retrain)None
Can cite sources
Scales to large corpora
Cost per queryLowLowHigh
Setup effortMediumHighLow

The rule of thumb: RAG for facts, fine-tuning for behavior, long context for convenience. If your knowledge changes weekly or must be cited, reach for RAG. If you need the model to speak in a specific format or follow a niche skill, fine-tune. If everything the task needs already fits in the prompt, a long context window is the simplest path. Production agents routinely blend all three.

Key terms

AI agent memory glossary

The vocabulary you'll meet across every memory and retrieval system.

Embedding
A numeric vector that represents the meaning of a piece of text, image, or other data. Similar meanings produce vectors that sit close together, enabling search by semantics rather than keywords.
Vector store / database
A database optimized for storing embeddings and running fast nearest-neighbor search. Examples include pgvector, Pinecone, Weaviate, Qdrant, and Chroma.
Chunking
Splitting source documents into smaller, focused passages before embedding. Good chunking (with sensible size and overlap) is one of the biggest levers on retrieval quality.
Retrieval
The act of fetching the most relevant stored chunks for a given query — typically a top-k similarity search, optionally followed by a reranking pass for precision.
Context window
The maximum number of tokens a model can read in a single call. Its finite size is the core reason agents externalize memory instead of keeping everything in the prompt.
Top-k & reranking
Top-k returns the k nearest vectors to a query; reranking re-scores those candidates with a stronger model to keep only the few that are genuinely relevant.
FAQ

AI agent memory, answered

AI agent memory is the set of mechanisms an agent uses to retain and recall information beyond a single model call. It spans short-term working memory (the live context window for the current task) and long-term memory — usually facts, past interactions, and documents stored in a vector database and retrieved on demand. Memory is what lets an agent remember a user's preferences from last week, recall the outcome of a previous step, and ground its answers in your private data instead of guessing.

Get started

Build agents that remember

Add RAG-powered long-term memory and grounded retrieval to your agents in minutes. Free to start — no credit card required.