RAG Evaluation Metrics Explained

RAG evaluation metrics measure how well a retrieval-augmented generation system finds useful context and uses that context to generate a grounded answer. A RAG system can fail in multiple places, so a single score is rarely enough.

The most useful approach is to evaluate the RAG pipeline in layers: retrieval, context quality, answer quality, grounding, citation quality, and operational performance.

Short Answer

RAG evaluation metrics help teams answer three questions:

Did retrieval find the right information?
Did the model use that information correctly?
Did the full system produce a useful answer at acceptable latency and cost?

Common RAG metrics include precision, recall, MRR, nDCG, context precision, context recall, answer relevance, faithfulness, groundedness, citation quality, hallucination rate, latency, and cost.

Why RAG Evaluation Is Hard

RAG systems combine multiple components.

A typical RAG pipeline includes:

document ingestion
chunking
embedding
indexing
retrieval
reranking
prompt assembly
generation
citation or source display

If the final answer is bad, the root cause may be any one of these steps. Metrics help isolate the failure.

Retrieval Metrics vs Answer Metrics

Retrieval metrics measure what enters the model context.

Answer metrics measure what the model produces from that context.

This distinction matters because a bad answer can come from good retrieval and poor generation, or from poor retrieval and reasonable generation. Without separate metrics, teams may tune the wrong component.

Precision@K

Precision@K measures how many of the top K retrieved results are relevant.

For example, if a search returns 5 chunks and 3 are relevant, precision@5 is 3/5.

High precision means the context window contains less noise. Low precision means the model may see irrelevant, distracting, or misleading information.

Recall@K

Recall@K measures how many relevant items were retrieved out of all relevant items available.

If there are 10 relevant chunks in the corpus and the system retrieves 6 in the top 20, recall@20 is 6/10.

High recall is important when the answer requires multiple facts, broad coverage, or complete evidence.

MRR

Mean Reciprocal Rank measures how high the first relevant result appears.

If the first relevant result is ranked first, the reciprocal rank is 1. If it is ranked second, it is 1/2. If it is ranked fifth, it is 1/5.

MRR is useful for factoid questions where the system needs one correct source near the top.

nDCG

Normalized Discounted Cumulative Gain measures ranking quality when relevance can have degrees.

A highly relevant result at rank 1 is better than the same result at rank 8. nDCG captures this by rewarding relevant documents more when they appear near the top.

nDCG is useful when some chunks are partially relevant and others are highly relevant.

MAP

Mean Average Precision measures ranking quality across multiple relevant results and multiple queries.

It is useful when a query has several correct sources and the system should retrieve many of them early in the ranking.

MAP is less intuitive than precision or recall, but useful for comparing retriever configurations.

Context Precision

Context precision measures whether the context given to the model is relevant to the question.

It is similar to retrieval precision, but focuses on the final context passed into the prompt after retrieval, filtering, reranking, and truncation.

Low context precision means the model is being asked to answer with noisy evidence.

Context Recall

Context recall measures whether the prompt context contains the information needed to answer the question.

Low context recall means the answer may be incomplete or the model may be forced to guess.

This metric is especially important for multi-hop questions, policy questions, technical troubleshooting, and comparisons.

Answer Relevance

Answer relevance measures whether the generated answer addresses the user's question.

An answer can be factual but not relevant. For example, it may summarize the retrieved document without answering the specific question.

Answer relevance is useful for detecting evasive, vague, incomplete, or off-topic responses.

Faithfulness

Faithfulness measures whether the answer is supported by the retrieved context.

A faithful answer does not introduce claims that are absent from the provided evidence.

Faithfulness is one of the most important RAG generation metrics because it directly targets unsupported claims.

Groundedness

Groundedness measures whether answer claims can be traced to reliable source material.

Groundedness is closely related to faithfulness. In practice, teams often use groundedness to ask: can this claim be verified from the documents the system retrieved?

Low groundedness is a warning sign for hallucination risk.

Hallucination Rate

Hallucination rate measures how often the system generates unsupported or false claims.

In RAG systems, hallucination can happen because:

retrieval missed the right source
retrieval returned misleading context
the model ignored the context
the model overgeneralized from partial evidence
the prompt encouraged unsupported synthesis

Hallucination detection should be connected to traces so teams can find the cause.

Citation Quality

Citation quality measures whether the answer cites the right sources for the claims it makes.

Useful citation checks include:

Does every important claim have a citation?
Does the cited source support the claim?
Is the cited source the most relevant source?
Are citations specific enough to inspect?
Are stale or low-authority sources avoided?

High citation count is not the same as high citation quality.

Completeness

Completeness measures whether the answer includes the necessary parts of the response.

A RAG answer may be accurate and grounded but still incomplete if it misses an important condition, exception, step, or comparison.

Completeness is often domain-specific and may require a rubric or human review.

Correctness

Correctness measures whether the answer is factually right according to trusted ground truth.

Correctness differs from faithfulness. A response can be faithful to retrieved context but still wrong if the retrieved context is outdated or incorrect.

For high-risk domains, correctness should use trusted references, labels, or expert review.

Source Freshness

Source freshness measures whether the answer uses current information.

This matters for fast-changing content such as pricing, policies, APIs, regulations, product behavior, incidents, and support status.

Freshness can be tracked with document timestamps, index update times, and source version metadata.

Latency

Latency measures how long the RAG system takes to respond.

Break latency into components:

query rewriting time
embedding time
retrieval time
reranking time
generation time
evaluation or guardrail time

Quality improvements are useful only if latency stays acceptable for the product.

Cost

Cost measures the resources required to produce an answer.

Track:

embedding cost
retrieval infrastructure cost
reranker cost
input tokens
output tokens
LLM judge cost
retry cost

RAG evaluation should include cost because some improvements are too expensive for production traffic.

User Outcome Metrics

Technical metrics do not always match user outcomes.

Product-level signals may include:

thumbs up or down
resolution rate
follow-up question rate
support escalation rate
task completion
click-through on cited sources
time to answer
human reviewer acceptance rate

Use product metrics alongside technical metrics, not instead of them.

Offline vs Online Metrics

Offline metrics run on test datasets before release.

Online metrics run on production or sampled production traffic.

Offline metrics are good for comparing configurations and preventing regressions. Online metrics are good for detecting real-world failures, drift, and unexpected user behavior.

LLM-Based Evaluation

LLM-based evaluation uses a model to score relevance, faithfulness, groundedness, completeness, or citation support.

This can reduce manual labeling effort, but it should be calibrated. Judge prompts, model choice, scoring rubrics, and examples affect results.

Use LLM judges as measurement tools, not unquestioned truth.

Human Evaluation

Human evaluation is useful for subjective, high-risk, or ambiguous tasks.

Human reviewers can judge nuance, usefulness, tone, missing context, and domain correctness better than many automated metrics.

The downside is cost, speed, and reviewer disagreement. Use clear rubrics to improve consistency.

How to Interpret Metrics Together

RAG metrics are most useful when interpreted together.

High retrieval precision and low answer relevance suggests generation or prompting problems.
Low context recall and high hallucination rate suggests missing evidence.
High faithfulness and low completeness suggests the answer is grounded but incomplete.
High answer quality and high latency suggests a performance trade-off.
High citation count and low citation quality suggests citation stuffing.

Do not optimize one metric blindly.

Common Mistakes

Using only final answer ratings.
Ignoring retrieval metrics.
Confusing faithfulness with correctness.
Measuring citation count instead of citation support.
Using an unrepresentative test set.
Optimizing for precision while destroying recall.
Ignoring latency and cost.
Trusting LLM judges without calibration.

Evaluation Checklist

Separate retrieval metrics from answer metrics.
Measure precision, recall, MRR, or nDCG for retrieval when labels exist.
Measure context precision and context recall for prompt context quality.
Measure answer relevance and completeness.
Measure faithfulness and groundedness.
Evaluate citation support, not only citation presence.
Track latency and cost by pipeline step.
Use human review for high-risk or subjective cases.
Calibrate LLM judges against trusted examples.
Connect metrics to traces for debugging.

Summary

RAG evaluation metrics explain whether a system retrieved the right evidence, gave the model useful context, generated a relevant and grounded answer, cited sources correctly, and did so within acceptable latency and cost.

The best RAG evaluation strategy is layered. Measure retrieval quality, context quality, answer quality, grounding, citations, operational performance, and user outcomes together. That is how teams find the right part of the system to improve.