Vector databases · Semantic search

Vector databases for AI agents

When an agent needs to find knowledge by meaning rather than by keyword, a vector database is the engine that makes it fast. This guide covers embeddings, similarity metrics, ANN indexes, hybrid search, and how to pick a store you won't regret.

  • 13 min read
  • Intermediate
  • Updated 2026

A vector database is the part of an AI stack that lets an agent search by meaning — finding the right passage, memory, or product from millions of candidates, even when none of the words match.

Traditional databases were built for exact answers: this customer ID, this date range, this status. AI agents need a different question answered — what is most similar to this? — over fuzzy, human content where the wording is never identical. A vector database answers that question by storing embeddings (numeric representations of meaning) and finding the ones nearest a query in a high-dimensional space.

This is the workhorse behind retrieval-augmented generation and agent memory. An agent embeds a question, the database returns the closest chunks in a few milliseconds, and those passages ground the model's answer. Without it, retrieval would mean scanning every record on every query — impossible at scale.

Below we build the picture from the ground up: what a vector and a vector space actually are, the similarity metrics that define "close," the approximate-nearest-neighbour indexes (HNSW, IVF) that keep search fast, metadata filtering and hybrid search, how all of this powers RAG and memory, and a neutral comparison to help you choose a vector database.

The foundation

Embeddings and vector space

Before you can search by meaning, you have to turn meaning into numbers. That is what an embedding model does — and what a vector database is built to store.

An embedding is a fixed-length list of numbers — a vector — produced by a model trained so that text with similar meaning lands at nearby coordinates. A typical text embedding has several hundred to a couple thousand dimensions; each dimension is an axis the model learned, capturing some abstract facet of meaning you can't name but that collectively encodes the gist.

Picture a map where every sentence is a pin. "Reset my password" and "I can't log in" sit almost on top of each other; "best pizza in town" is far away on the other side of the city. The database never reads the words — it only sees coordinates — yet because the embedding model placed the pins meaningfully, geometric closeness is semantic closeness.

Crucially, the same model must embed both your stored documents and the incoming query, so they share one coordinate system. Mix embedding models and the geometry stops lining up — a classic, silent source of bad retrieval.

Vector

A list of floats, e.g. 768 or 1,536 numbers, that encodes one chunk of text, image, or audio as a point in space.

Vector space

The high-dimensional coordinate system all your vectors live in. Distance in this space approximates difference in meaning.

Embedding model

The neural network that maps content to vectors. The same model must embed both documents and queries.

Dimensionality

More dimensions can capture finer meaning but cost more storage and compute. Models fix this number for you.

The curse (and gift) of high dimensions

In hundreds of dimensions, intuition from 2-D maps breaks down — almost everything is roughly equidistant, which is exactly why brute-force comparison is slow and why we lean on specialized indexes. The gift is capacity: those dimensions give the model room to separate subtle shades of meaning that a keyword index never could. Read more on the building block in our embeddings glossary.

Defining 'close'

Similarity metrics: cosine, dot, Euclidean

A vector database is only as meaningful as its definition of distance. Three metrics dominate, and choosing the right one is mostly about matching how your embedding model was trained.

Cosine similarity

Measures the angle between two vectors, ignoring their length. Two passages about the same topic point the same direction regardless of how long they are. The safe default for most text embeddings — especially when vectors are normalized.

Dot product

Like cosine but also rewards magnitude, so longer or more confident vectors score higher. Correct when your model was trained with it or when vectors aren't normalized. Often the fastest to compute at scale.

Euclidean (L2)

Straight-line distance between two points in the space. Smaller is closer. A natural fit when magnitude genuinely carries meaning, such as some image or clustering embeddings.

The practical advice is simple: use the metric the embedding model's authors recommend. Most modern text models are tuned for cosine similarity, and many libraries normalize vectors so that cosine and dot product become equivalent. The mistake to avoid is picking a metric arbitrarily — a model trained for cosine but queried with raw Euclidean distance will quietly return worse neighbours, and you'll blame your chunking or prompts instead of the geometry.

Every vector database lets you declare the metric per index or collection. Set it once, at creation time, to match your model — changing it later usually means rebuilding the index.

Making it fast

Approximate nearest neighbour: HNSW and IVF

Comparing a query against every vector is exact but hopeless at scale. ANN indexes trade a sliver of recall for orders-of-magnitude speed — and they're the real engineering inside a vector database.

Exact (brute force)Compare against every vector — slow
HNSW graphHop greedily from coarse to fine
IVF clustersScan only the nearest partitions
Top-k in msHigh recall at low latency
Why ANN exists: exact search scales linearly with your corpus, so an index that skips most comparisons is the only way to stay fast as data grows.

HNSW (Hierarchical Navigable Small World) builds a layered graph of vectors. Search starts in a sparse top layer, greedily hopping toward the query, then drops into denser layers to refine — like zooming in on a map from country to street. It delivers excellent recall and latency and is the default in most modern stores; the cost is higher memory and slower inserts, tuned by parameters like M (connections per node) and efSearch (how hard to look).

IVF (Inverted File index) takes a different tack: it clusters all vectors into partitions, and at query time only searches the few partitions nearest the query. It's lighter on memory and fast to build, which is why it scales well to very large corpora; the trade-off is that a result on a partition boundary can be missed, controlled by nprobe (how many partitions to scan). IVF is often paired with product quantization (IVF-PQ) to compress vectors and shrink memory further.

Both are knobs on the same dial: recall versus speed versus memory. You rarely build these by hand — you choose the index type and set a couple of parameters, and the database handles the rest.

top-k

Results per query

usually 3–20, not the whole set

ms

Typical query latency

single-digit at million scale

>95%

Recall target

tune ef / nprobe to hit it

3

Core trade-offs

recall · speed · memory

Recall is a dial, not a guarantee

ANN means approximate: a small fraction of true nearest neighbours may be missed. For most agent retrieval this is invisible, but if you're doing deduplication or compliance lookups where every match matters, raise the search effort or fall back to exact search on the candidate set. Always measure recall against a labeled set before trusting defaults.

Beyond pure similarity

Metadata filtering and hybrid search

Real retrieval is rarely 'just find the nearest vectors.' You also need to scope results by attributes and to catch exact terms that pure semantics miss.

Two ideas that make retrieval production-grade

Filter to scope, blend to recall

Every vector you store carries metadata — tenant, source, document type, date, access level. Metadata filtering restricts the similarity search to records that match those attributes, so an agent never retrieves another customer's data or a draft that should stay hidden. Done well, filtering happens inside the index rather than after, keeping it fast.

Hybrid search blends two retrieval styles: dense vector search for meaning and sparse keyword search (BM25) for exact, rare, or symbolic terms — a part number, an error code, a person's name. The two result sets are merged, often with a re-ranker on top, so you catch both the paraphrase and the literal match. This combination is what separates a demo from a system people trust.

  • Filter by tenant, date, type, or access level.
  • Dense vectors capture meaning and paraphrase.
  • Keyword search nails exact, rare, symbolic terms.
  • Re-ranking lifts the truly relevant chunk to the top.
See it inside RAG
QueryEmbed + tokenize
Vector searchNearest by meaning
Keyword searchBM25 exact terms
Merge + filterDedupe, apply metadata
Re-rankCross-encoder picks top-k
A hybrid retrieval path: recall broadly from both signals, filter by metadata, then re-rank for precision.
What it powers

How vector DBs drive RAG and agent memory

The same store serves two of the most important agent capabilities: grounding answers in your knowledge, and remembering across turns and sessions.

EmbedChunks → vectors
IndexStore with metadata
QueryEmbed the question
RetrieveTop-k nearest records
The lifecycle a vector database supports: embed content once, index it, then embed each query to retrieve the closest records for the agent to use.
  1. Powering RAG

    Documents are chunked, embedded, and indexed. When the agent gets a question, it embeds the query, retrieves the nearest chunks, and feeds them to the model as grounding context — so answers cite real sources instead of guessing.

  2. Powering memory

    Past messages, facts, and observations are embedded and stored. Instead of replaying an entire history, the agent recalls only the memories most similar to the current situation — long-term, searchable recall that outlasts the context window.

  3. As a tool

    In an agent loop the vector store is exposed as a retrieval tool the model can call when it judges it needs evidence — routing, re-querying, and reading results as part of its reasoning.

The mechanics are identical for both jobs: text in, vector out, nearest vectors back. What differs is what you store. For RAG it's your knowledge base — docs, tickets, policies. For agent memory it's the agent's own experience — what the user said last week, decisions it made, facts it learned.

This is why the vector database is often described as an agent's long-term memory. The context window is working memory: small, fast, forgotten after the turn. The vector store is durable recall: vast, persistent, searchable by meaning. Wire it up as one of your agent tools and the model can reach far more knowledge than any single prompt could hold.

Quality still depends on hygiene — sensible chunking, deduplicated sources, expired stale records, and provenance on every vector — so the agent recalls the right thing rather than the most recent or the most repeated.

See it in code

Storing and querying vectors

Strip away the marketing and the API is small: insert vectors with metadata, then query by a vector plus a filter. Here it is with pgvector inside plain SQL.

vectors.sqlsql
1CREATE EXTENSION IF NOT EXISTS vector;  // enable pgvector2CREATE TABLE chunks (3  id      bigserial PRIMARY KEY,4  source  text,5  embedding vector(1536),  // match your model's dims6  content text7);89CREATE INDEX ON chunks  // build an ANN index10  USING hnsw (embedding vector_cosine_ops);  // cosine metric1112SELECT content, source  // the retrieve step13FROM chunks14WHERE source = 'handbook'  // metadata filter15ORDER BY embedding <=> :query_vec  // <=> is cosine distance16LIMIT 5;  // top-k nearest chunks
pgvector turns Postgres into a vector database — an HNSW index, a metadata column, and a similarity query with a filter, all in SQL you already know.

That's the entire contract of a vector database: declare the dimensionality and metric, build an index, and order by distance with an optional filter. Managed services wrap this in a REST or SDK call instead of SQL, but the shape is the same. If you already run Postgres, pgvector lets you keep relational data and vectors in one place — no second system to operate. Larger or higher-throughput workloads often graduate to a dedicated store; see the comparison below and our vector database glossary.

Choosing one

pgvector, Pinecone, Weaviate, Qdrant, Milvus, Chroma

There is no single best vector database — only the best fit for your scale, team, and operational appetite. Here is a neutral look at six popular options.

Vector storeOpen sourceFully managedHybrid searchMetadata filtering
pgvector
Pinecone
Weaviate
Qdrant
Milvus
Chroma

Read the table as a starting point, not a verdict — capabilities move fast and "~" means it depends on the tier or version. The honest summary of where each tends to shine:

  • pgvector — an extension that adds vectors to Postgres. Ideal when you already run Postgres and want one system for relational data and embeddings; managed via your existing Postgres host.
  • Pinecone — a fully managed, serverless vector service. You trade openness for zero operations and predictable scaling; popular for teams that want to ship without running infrastructure.
  • Weaviate — open source with a managed cloud, built-in hybrid search, and optional modules that embed content for you.
  • Qdrant — open source and managed, known for a clean API, rich metadata filtering, and strong performance per dollar.
  • Milvus — open source and built for very large, distributed deployments with multiple index types; more moving parts to operate.
  • Chroma — lightweight and developer-friendly, excellent for prototyping and local RAG before you scale out.
  • Start where your data already livesIf you run Postgres, try pgvector before adding a new system.
  • Match scale to architectureMillions of vectors suit most stores; billions point to distributed engines like Milvus.
  • Decide managed vs self-hostedWeigh operational cost and control against the convenience of a managed service.
  • Confirm hybrid + filtering supportIf you need keyword blending and tenant isolation, verify them on your chosen tier.
  • Test recall and latency on your dataBenchmark with your own embeddings and queries, not vendor numbers.
  • Plan for re-indexingChanging model, dimensions, or metric means rebuilding — budget for it.
Operate it well

Scaling and cost considerations

Vector search is cheap per query and surprisingly expensive in aggregate. A few realities decide whether your bill and your latency stay sane.

Memory is the cost driver

HNSW keeps vectors in RAM for speed, so a billion 1,536-dim floats is a lot of expensive memory. Quantization (PQ, scalar, binary) and IVF-on-disk trade a little recall for big savings.

Tune the recall dial

Higher ef / nprobe means better recall but more compute per query. Find the lowest setting that meets your quality bar — defaults are often more aggressive than you need.

Smaller vectors, lower bills

Fewer dimensions, quantized storage, and reduced top-k all cut memory and latency. Some models support shortening embeddings with minimal quality loss.

Writes and re-indexing

High insert rates and frequent re-indexing add load. Batch upserts, and remember that changing your embedding model forces a full rebuild.

The first surprise teams hit is that storage, not query volume, dominates the bill. ANN indexes are memory-hungry by design, and embeddings are large. Before you scale up the hardware, scale down the data: prune stale chunks, deduplicate near-identical vectors, and consider quantization to shrink each vector's footprint.

The second surprise is that recall is something you buy. Pushing recall from 95% to 99.9% can multiply query cost for a gain your users never notice. Treat recall as a tunable budget, measured against a labeled question set, rather than a number to maximize.

Finally, plan around the embedding model. Its dimensionality fixes your storage cost, its metric fixes your index, and switching it means re-embedding and re-indexing everything. Choosing the model is, in practice, choosing the long-run economics of your vector database. Wire retrieval in cleanly as one of your agent tools and these costs stay isolated and easy to reason about.

ANN
Approximate nearest neighbour — fast similarity search that trades a little recall for large speed gains.
HNSW
A graph-based ANN index offering high recall and low latency at the cost of memory.
IVF
A clustering-based ANN index that scans only the nearest partitions; lighter on memory.
Quantization
Compressing vectors (e.g. product quantization) to cut memory, with a small recall trade-off.
Top-k
The number of most-similar records a query returns, typically a handful for agent retrieval.
Hybrid search
Combining dense vector search with sparse keyword search for both meaning and exact terms.
FAQ

Vector databases, answered

A vector database is a store built to index and search high-dimensional vectors — the numeric embeddings that represent the meaning of text, images, or audio. Instead of matching exact keywords, it finds the items whose vectors are closest to a query vector, so 'cancel my plan' can retrieve a passage about 'ending a subscription' even with no shared words. It combines an approximate-nearest-neighbour index for speed with metadata filtering, so an AI agent can fetch the most semantically relevant records out of millions in a few milliseconds.

Keep learning

Go deeper on retrieval and memory

Get started

Give your agent a memory it can search

Connect a vector store, embed your knowledge, and let your agent retrieve the right context by meaning. Free to start — no credit card required.