What Is the Difference Between Relevance and Recall in Vector Search?

Relevance and recall are related, but they are not the same thing in vector search.

Recall measures whether the search system found the expected items. Relevance measures whether the returned items are useful for the user, query, or downstream application.

Short Answer

Recall is a coverage metric. It asks how many desired results were retrieved.

Relevance is a quality judgment. It asks whether the retrieved results actually match the user’s intent.

A vector search system can have high recall and still return results that feel irrelevant. It can also return highly relevant top results while missing other relevant items deeper in the collection.

What Recall Means in Vector Search

Recall measures how many target items appear in the returned results.

In approximate nearest neighbor search, recall often means the fraction of exact nearest neighbors found by the approximate index.

For example, if exact search says the true top 10 nearest vectors are A through J, and the ANN index returns 8 of those 10, the index has 80% recall at 10 for that query.

What Relevance Means

Relevance measures whether a result satisfies the information need behind the query.

In user-facing search, a relevant result is one that helps the user complete the task. In RAG, a relevant result contains evidence that helps the model answer correctly. In recommendations, a relevant item is something the user is likely to value.

Relevance is usually judged with labels, clicks, expert review, task success, or model-assisted evaluation.

The Core Difference

Recall is about retrieving items from a defined target set.

Relevance is about whether those items are good results.

The target set for recall may be exact nearest neighbors, labeled relevant documents, gold documents, or expected evidence chunks. The judgment for relevance depends on the application.

Two Types of Recall

The word recall is used in two common ways.

In ANN benchmarking, recall usually compares approximate vector results with exact nearest-neighbor results.

In information retrieval, recall usually measures how many known relevant documents were returned out of all relevant documents.

These are related but not identical. ANN recall asks whether the index approximated exact vector search well. IR recall asks whether the search experience found the documents humans consider relevant.

Why ANN Recall Is Not the Same as Relevance

Exact nearest vectors are not always the most useful documents.

If the embedding model represents the query poorly, exact nearest-neighbor search can confidently retrieve the wrong things. If chunks are too broad or too small, the nearest vector may not contain useful evidence. If metadata filters remove important documents, the remaining nearest vectors may be technically close but contextually weak.

High ANN recall means the index found what exact vector search would have found. It does not prove that exact vector search was the right retrieval strategy.

Why Relevance Can Be High When Recall Is Imperfect

Search systems often need only a few strong results.

If a user sees five results and the first three are excellent, the experience may be good even if the index missed several exact nearest neighbors.

This is especially true when many documents are near-duplicates or when multiple chunks contain the same answer.

Why Recall Still Matters

Recall matters because low recall can silently cap retrieval quality.

If the correct document never enters the candidate set, reranking, answer generation, and UI design cannot recover it.

For RAG systems, poor recall means the language model may never receive the evidence needed to answer correctly.

Recall at K

Recall is usually measured at a result cutoff, written as Recall@K.

Recall@10 evaluates the top 10 results. Recall@100 evaluates the top 100 results.

The cutoff matters. A system may have poor Recall@10 but strong Recall@100, which means the right material exists in the candidate pool but ranking needs improvement.

Precision and Relevance

Precision measures how many returned results are relevant.

If a search returns 10 results and 7 are useful, precision at 10 is 70%.

Precision is often closer to the visible user experience than recall, because users mainly judge what appears in front of them.

Ranking Metrics

Relevance is also affected by order.

A system that puts the best result first is more useful than one that puts the same result at position 20.

Metrics such as nDCG, MRR, and success at K capture ranking quality better than raw recall alone.

Vector Similarity Is Not Relevance

Vector similarity measures closeness in embedding space.

Relevance measures usefulness for the query.

These often correlate, but they can diverge when the query is ambiguous, the corpus has domain-specific terminology, the embedding model misses rare terms, or the user needs exact facts rather than semantic similarity.

Hybrid Search and Relevance

Hybrid search can improve relevance by combining dense vector similarity with keyword matching.

Vector search captures meaning. Keyword search captures exact terms, names, codes, and rare phrases.

A hybrid system may improve relevance even if pure vector ANN recall stays the same, because the retrieval strategy is using additional signals.

Reranking and Relevance

Reranking improves relevance by scoring candidates more carefully after first-stage retrieval.

The first stage should retrieve a broad candidate set. The reranker then reorders those candidates based on a stronger query-document comparison.

This means recall and relevance work together: candidate recall gives the reranker good material, and reranking improves the final visible order.

Filters and Relevance

Metadata filters can improve relevance by enforcing business rules, permissions, tenant boundaries, freshness, region, product line, or document type.

Filters can also reduce recall if the system searches too narrowly or applies filtering after candidate generation.

Filtered search should be evaluated separately because unfiltered recall and relevance can hide production problems.

Examples

Suppose a user searches for “refund policy for enterprise annual contracts.”

A high-recall vector index may retrieve the exact nearest chunks, but those chunks might discuss consumer refunds because the embedding model treats the wording as similar.

A more relevant result may include the exact phrase “enterprise annual contract” even if it is not the closest vector overall.

This is why relevance often needs metadata, keyword signals, reranking, or domain-specific embeddings.

How to Evaluate Both

Evaluate recall and relevance together.

  • Use exact nearest-neighbor comparison to measure ANN recall.
  • Use labeled judgments to measure retrieval relevance.
  • Use Recall@K when coverage matters.
  • Use Precision@K when result cleanliness matters.
  • Use nDCG or MRR when ranking order matters.
  • Use RAG answer evaluation when retrieved context feeds generation.

Common Mistakes

Common mistakes include:

  • treating vector distance as a relevance score
  • assuming high ANN recall guarantees good search quality
  • measuring only top-k relevance without checking candidate recall
  • benchmarking without filters used in production
  • judging RAG quality only from final answers
  • changing embeddings without rebuilding relevance tests

Practical Rule

Use recall to check whether the retrieval system is finding enough candidates.

Use relevance metrics to check whether those candidates satisfy the user’s intent.

For production search and RAG, you usually need both: enough recall to avoid missing important evidence, and enough relevance to avoid filling the top results with plausible but unhelpful matches.

Summary

Recall is about coverage. Relevance is about usefulness.

In vector search, ANN recall tells you how closely approximate search matches exact nearest-neighbor search. Search relevance tells you whether the returned documents are actually good answers for the query.

The strongest retrieval systems measure both, then tune embeddings, indexing, filters, hybrid search, and reranking based on where the failure occurs.