Precision, Recall, and MRR for Retrieval Evaluation

Precision, recall, and mean reciprocal rank are common metrics for evaluating retrieval systems. They help teams understand whether search results contain relevant documents, whether important documents are missing, and whether the best result appears near the top.

These metrics are useful for search engines, recommendation systems, vector databases, hybrid search, and RAG applications where retrieval quality affects answer quality.

Short Answer

Use precision, recall, and MRR to measure different aspects of retrieval quality.

  • Precision at k measures how many of the top k retrieved items are relevant.
  • Recall at k measures how many known relevant items were retrieved in the top k.
  • MRR measures how high the first relevant result appears in the ranking.

Precision rewards clean result sets. Recall rewards finding enough of the right evidence. MRR rewards putting a useful result near the top.

Why Retrieval Metrics Matter

In RAG systems, the language model can only ground its answer in the context it receives. If retrieval misses the right document, returns noisy chunks, or buries the best evidence too low, the final answer may become incomplete, irrelevant, or unsupported.

Retrieval metrics make these failures visible before they become generation failures.

What At K Means

Most retrieval metrics are measured at k.

The value k is the number of top results being evaluated. For example, precision at 5 evaluates only the first five retrieved results.

This matters because many systems pass only a limited number of chunks to the model. A relevant document at rank 50 may not help a RAG answer if only the top 5 chunks are used.

Relevance Labels

Precision, recall, and MRR require relevance labels.

A relevance label says whether a document, chunk, passage, product, or record is relevant to a query.

Labels may be:

  • binary, such as relevant or not relevant
  • graded, such as highly relevant, somewhat relevant, or irrelevant
  • human-labeled by reviewers
  • derived from logs or clicks
  • generated by an LLM judge and calibrated with human review

The quality of the labels determines the quality of the evaluation.

Precision at K

Precision at k measures the fraction of top k results that are relevant.

precision@k = relevant results in top k / k

If a system returns 5 results and 4 are relevant, precision at 5 is 0.8.

Precision answers: how much noise is in the retrieved context?

When Precision Is Useful

Precision is useful when irrelevant results are harmful.

Examples include:

  • RAG systems with small context windows
  • legal, medical, or financial retrieval
  • citation-heavy answers
  • support bots where wrong documents create wrong answers
  • search systems where users inspect only a few results

High precision means the retrieved set is clean.

Precision Limitations

Precision does not tell you whether all necessary evidence was found.

A result set can have high precision but low recall. For example, if a question needs three documents and the system retrieves only one relevant document plus no noise, precision may look good while the answer is incomplete.

Precision also does not care where relevant results appear inside the top k.

Recall at K

Recall at k measures the fraction of all known relevant items that were retrieved in the top k.

recall@k = relevant results in top k / total known relevant results

If there are 4 known relevant documents and the system retrieves 3 of them in the top 10, recall at 10 is 0.75.

Recall answers: did the retriever find enough of the needed evidence?

When Recall Is Useful

Recall is useful when missing evidence is costly.

Examples include:

  • multi-hop questions
  • research assistants
  • compliance review
  • knowledge-base answers that need multiple policy details
  • systems that rerank or filter an initial candidate set

High recall means the system is less likely to miss important sources.

Recall Limitations

Recall can reward broad retrieval even when many results are noisy.

A system can improve recall by increasing k, but that may add irrelevant context and hurt generation quality.

In RAG, higher recall is not automatically better if the extra context distracts the model.

Precision and Recall Trade-Off

Precision and recall often move in opposite directions.

Returning fewer results can improve precision but reduce recall. Returning more results can improve recall but reduce precision.

The right balance depends on the application. A fact lookup may need high precision in the top 3. A research workflow may need high recall across the top 20 before reranking.

Mean Reciprocal Rank

Mean reciprocal rank, or MRR, measures how early the first relevant result appears.

For one query, reciprocal rank is:

reciprocal rank = 1 / rank of first relevant result

If the first result is relevant, reciprocal rank is 1. If the first relevant result appears at rank 4, reciprocal rank is 0.25.

MRR averages this value across many queries.

When MRR Is Useful

MRR is useful when the first good result matters most.

Examples include:

  • question answering where one passage is enough
  • documentation search
  • support article lookup
  • entity lookup
  • factoid retrieval
  • top-result user interfaces

MRR answers: how quickly does the system put a useful result in front of the user or model?

MRR Limitations

MRR ignores relevant results after the first one.

This makes it less useful when a task needs multiple supporting documents. A query with one relevant result at rank 1 and a query with five relevant results after rank 1 can receive the same reciprocal rank.

Use MRR with recall when multi-document evidence matters.

Simple Example

Suppose a query has three relevant documents: A, B, and C.

The retriever returns:

Rank 1: X not relevant
Rank 2: A relevant
Rank 3: Y not relevant
Rank 4: B relevant
Rank 5: Z not relevant

For k = 5:

  • precision@5 = 2 / 5 = 0.40
  • recall@5 = 2 / 3 = 0.67
  • reciprocal rank = 1 / 2 = 0.50

Across many queries, average reciprocal rank becomes MRR.

Context Precision and Context Recall

RAG evaluation often uses context precision and context recall.

Context precision measures how much of the retrieved context is relevant to the question.

Context recall measures whether the retrieved context contains the information needed to answer the question.

These are RAG-oriented versions of the same basic ideas: avoid noisy context, but do not miss necessary evidence.

NDCG and MAP

Precision, recall, and MRR are not the only retrieval metrics.

NDCG evaluates ranking quality with graded relevance. It rewards systems for putting highly relevant results near the top.

MAP evaluates precision across relevant positions and averages across queries.

Use NDCG when some results are more relevant than others. Use MAP when you care about ranking many relevant items across the top k.

Choosing the Right Metric

Choose metrics based on the user experience and retrieval task.

  • Use precision when noisy context is the main risk.
  • Use recall when missing evidence is the main risk.
  • Use MRR when the first relevant result matters most.
  • Use NDCG when graded relevance and rank order matter.
  • Use context precision and context recall for RAG-specific evaluation.

Most serious retrieval evaluations use more than one metric.

RAG-Specific Interpretation

In RAG, retrieval metrics should be interpreted through answer quality.

High precision can improve faithfulness by reducing noisy context. High recall can improve completeness by including required evidence. High MRR can improve latency and answer quality when one strong passage is enough.

But retrieval metrics do not prove the generated answer is good. They should be paired with answer relevance, faithfulness, and citation quality checks.

Building a Test Set

A retrieval evaluation test set should include:

  • realistic user queries
  • known relevant documents or chunks
  • hard negatives
  • ambiguous queries
  • exact keyword queries
  • semantic queries
  • multi-hop questions
  • queries with no valid answer

Include examples that reflect the real distribution of production usage.

Hard Negatives

Hard negatives are documents that look similar to relevant results but do not answer the query.

They are important because retrieval systems often fail by returning plausible but insufficient context.

Good hard negatives make precision and ranking metrics more meaningful.

Evaluation by Slice

Average metrics can hide failures.

Track precision, recall, and MRR by:

  • topic
  • language
  • tenant
  • document type
  • query type
  • fresh vs stale content
  • short vs long queries
  • keyword-heavy vs semantic queries

Slice-level evaluation shows where retrieval actually needs work.

Regression Testing

Precision, recall, and MRR are useful in regression tests.

Run retrieval evaluation before releasing changes to:

  • embedding models
  • chunking strategy
  • hybrid search weights
  • metadata filters
  • rerankers
  • index settings
  • document ingestion pipelines
  • query rewriting

Compare metrics against the previous baseline and inspect failures manually.

Thresholds

Set thresholds that match the product risk.

Examples:

  • precision@5 must not drop below 0.80
  • recall@10 must not regress by more than 2 percentage points
  • MRR must improve for support article lookup
  • critical policy questions must retrieve the approved source at rank 1
  • no-answer queries must not retrieve low-quality filler context

Thresholds should be stricter for high-impact domains.

Production Monitoring

Offline retrieval metrics should be paired with production signals.

Useful signals include:

  • click-through rate
  • result abandonment
  • follow-up searches
  • user feedback
  • support escalations
  • citation failures
  • empty retrieval rate
  • low relevance score rate

Production signals are noisy, but they help identify new test cases and drift.

Common Mistakes

  • Using precision alone and missing recall failures.
  • Using recall alone and flooding the model with noisy context.
  • Using MRR for tasks that require multiple documents.
  • Evaluating only easy queries.
  • Ignoring hard negatives.
  • Changing k without updating the interpretation of results.
  • Trusting average scores without slice analysis.
  • Assuming retrieval metrics guarantee answer quality.

Evaluation Checklist

  • Define the retrieval unit: document, chunk, passage, or record.
  • Create relevance labels for representative queries.
  • Measure precision at the k used by the application.
  • Measure recall at the k needed for enough evidence.
  • Measure MRR when the first relevant result matters.
  • Add NDCG when graded relevance matters.
  • Track results by topic, query type, and document type.
  • Use hard negatives in the benchmark.
  • Run metrics in retrieval regression tests.
  • Pair retrieval metrics with answer relevance and faithfulness checks.

Summary

Precision, recall, and MRR measure different parts of retrieval quality. Precision measures how clean the top results are. Recall measures how much relevant evidence was found. MRR measures how quickly the first relevant result appears.

For RAG applications, these metrics are most useful when measured at the same top k used by the application, tracked by dataset slice, and paired with generation-side checks such as answer relevance, faithfulness, and citation quality.