AI agent memory, RAG, and vector databases
Without memory, an agent forgets everything the moment its context window fills up. This guide explains how AI agent memory works — short-term vs long-term, the RAG retrieval pipeline, embeddings, and the vector databases that let agents recall facts and ground their answers in your data.
- 11 min read
- Intermediate
- Updated 2026
AI agent memory is everything an agent can remember beyond a single model call. A bare large language model (LLM) is stateless — it sees only what you put in the prompt and forgets it the instant the response is returned. Memory is the scaffolding that turns that stateless predictor into something that can recall a user's preferences, reuse the result of a previous step, and ground its answers in private, up-to-date knowledge.
The reason memory is unavoidable comes down to one hard limit: the context window. Every model can only read a fixed number of tokens at once. Even a generous 200K-token window cannot hold an entire product catalog, months of conversation history, and a long multi-step task trace simultaneously. And the more you cram in, the slower and pricier each call becomes — while the model pays measurably less attention to information stranded in the middle of a huge prompt.
The fix is to keep the prompt small and store everything else externally, then retrieve only the few most relevant pieces for each step. That retrieval pattern — embed a query, search a vector database, and inject the best matches back into the prompt — is called RAG (retrieval-augmented generation), and it is the engine behind nearly every agent's long-term memory. It pairs naturally with tools and the reasoning loop you see in LLM agents.
The types of AI agent memory
Borrowing from cognitive science, agent memory is usually split into five kinds — each answering a different question about what the agent should retain.
Short-term / working
The live context window for the current task — the system prompt, recent messages, and intermediate results. Fast and exact, but volatile: it vanishes when the window fills or the run ends.
Long-term memory
Durable knowledge persisted outside the model, typically as embeddings in a vector store. Retrieved on demand so the agent can recall facts and history from days or weeks ago.
Episodic memory
A record of specific past events: previous conversations, tool calls, and the outcomes of earlier task runs. Lets an agent say 'last time this approach failed' and adapt.
Semantic memory
General facts and domain knowledge — documentation, policies, product specs — independent of any single event. This is the classic target of a RAG knowledge base.
Procedural memory
Learned how-to knowledge: reusable skills, prompt recipes, and successful workflows the agent can apply again without re-deriving them from scratch.
User / entity memory
A structured profile of who the agent is helping — names, preferences, account state, and prior decisions — so interactions stay personalized and consistent over time.
How they work together
A well-built agent uses several memory types at once: short-term memory drives the current step, semantic memory grounds its facts, episodic memory helps it learn from past runs, and user memory keeps the experience personal. RAG is the retrieval mechanism that makes the long-term kinds usable inside a finite prompt.
The RAG retrieval pipeline, step by step
Long-term memory comes alive through retrieval. Here is the path a query takes from raw text to a grounded, augmented prompt.
Each stage earns its place. Embedding converts the query into a vector so it can be compared by meaning. Vector search returns the top-k most similar chunks — usually 5 to 20 candidates. A reranker then re-scores those candidates with a more precise model and keeps only the few that truly matter, which dramatically cuts noise and token cost. Finally, those chunks are augmented into the prompt as grounded context, and the model generates an answer it can actually cite.
Skipping the rerank step is the single most common cause of "RAG that almost works" — raw vector similarity is good at recall but weak at precision, so a second-pass reranker is what makes retrieval trustworthy in production.
How agents store and retrieve a memory
Under the hood, every long-term memory is a vector. Writing one is an upsert; recalling one is a similarity search.
From text to vector and back again
An embedding model maps a piece of text to a list of numbers — a 768- or 1536-dimensional vector — positioned so that semantically similar text lands nearby. Storing a memory means embedding it once and writing the vector (plus metadata) to the database.
Recall is the reverse: embed the incoming query, ask the vector store for its nearest neighbors by cosine similarity, and you get back the most relevant memories — even when they share no literal keywords with the query.
- Chunk documents into focused passages before embedding
- Attach metadata (source, date, user) for filtered retrieval
- Use cosine similarity for fast nearest-neighbor search
- Cite sources back to the user for trust and auditability
1// 1. Store a memory2const vector = await embed("User prefers email over phone");3await store.upsert({4 id: "mem_8821", vector,5 metadata: { userId, kind: "preference" },6});78// 2. Recall relevant memories9const q = await embed("How should I contact them?");10const hits = await store.search({ vector: q, topK: 5 });11const context = hits.map(h => h.text).join("\n");12// → inject `context` into the promptRAG vs fine-tuning vs long context
There are three ways to get knowledge into a model. They solve different problems — and the best agents use them together.
| Dimension | RAG | Fine-tuning | Long context |
|---|---|---|---|
| Best for | Facts & private data | Style & behavior | One-off, self-contained tasks |
| Knowledge stays current | |||
| Update cost | Low (re-index) | High (retrain) | None |
| Can cite sources | |||
| Scales to large corpora | |||
| Cost per query | Low | Low | High |
| Setup effort | Medium | High | Low |
The rule of thumb: RAG for facts, fine-tuning for behavior, long context for convenience. If your knowledge changes weekly or must be cited, reach for RAG. If you need the model to speak in a specific format or follow a niche skill, fine-tune. If everything the task needs already fits in the prompt, a long context window is the simplest path. Production agents routinely blend all three.
AI agent memory glossary
The vocabulary you'll meet across every memory and retrieval system.
- Embedding
- A numeric vector that represents the meaning of a piece of text, image, or other data. Similar meanings produce vectors that sit close together, enabling search by semantics rather than keywords.
- Vector store / database
- A database optimized for storing embeddings and running fast nearest-neighbor search. Examples include pgvector, Pinecone, Weaviate, Qdrant, and Chroma.
- Chunking
- Splitting source documents into smaller, focused passages before embedding. Good chunking (with sensible size and overlap) is one of the biggest levers on retrieval quality.
- Retrieval
- The act of fetching the most relevant stored chunks for a given query — typically a top-k similarity search, optionally followed by a reranking pass for precision.
- Context window
- The maximum number of tokens a model can read in a single call. Its finite size is the core reason agents externalize memory instead of keeping everything in the prompt.
- Top-k & reranking
- Top-k returns the k nearest vectors to a query; reranking re-scores those candidates with a stronger model to keep only the few that are genuinely relevant.
AI agent memory, answered
AI agent memory is the set of mechanisms an agent uses to retain and recall information beyond a single model call. It spans short-term working memory (the live context window for the current task) and long-term memory — usually facts, past interactions, and documents stored in a vector database and retrieved on demand. Memory is what lets an agent remember a user's preferences from last week, recall the outcome of a previous step, and ground its answers in your private data instead of guessing.
Go deeper on building agents
Build agents that remember
Add RAG-powered long-term memory and grounded retrieval to your agents in minutes. Free to start — no credit card required.