LLM evaluation is the process of measuring whether a large language model application produces useful, accurate, safe, and reliable outputs. It helps teams understand how well prompts, retrieval systems, tools, agents, guardrails, and workflows behave before and after launch.
Evaluation is not only about scoring a model in isolation. In production AI applications, the important question is whether the whole system does the right thing for real tasks.
Short Answer
LLM evaluation measures the quality, safety, reliability, and behavior of outputs from LLM-powered systems.
It can evaluate:
- answer accuracy
- answer relevance
- groundedness
- faithfulness to sources
- retrieval quality
- citation quality
- tool use
- policy compliance
- workflow completion
- latency and cost
Good evaluation turns subjective impressions into repeatable evidence.
Why LLM Evaluation Matters
LLM applications can look good in demos while failing under realistic conditions.
They may answer confidently with unsupported claims, miss important context, cite irrelevant sources, choose the wrong tool, violate policy, or behave differently after a prompt, model, or data change.
Evaluation helps teams catch these problems before users do.
Evaluation vs Guardrails
Evaluation and guardrails are related, but they are not the same.
Evaluation measures behavior. It answers: how good, safe, or reliable was the output?
Guardrails enforce boundaries. They answer: should this input, output, or action be allowed?
An evaluation may produce a score that later triggers a guardrail, retry, human review, or fallback path.
What Gets Evaluated?
LLM evaluation can happen at several levels.
- Model output: Did the answer satisfy the user request?
- Prompt: Did the instruction produce consistent behavior?
- Retrieval: Did the system fetch the right context?
- RAG answer: Did the response use retrieved context faithfully?
- Tool use: Did the agent choose and call the right tool?
- Workflow: Did the whole process complete correctly?
- Safety: Did the system follow policy and avoid harmful outputs?
For real applications, evaluate the system path, not just the final text.
Common Evaluation Metrics
Different applications need different metrics.
Common LLM evaluation metrics include:
- Accuracy: Is the answer correct?
- Relevance: Does the answer address the question?
- Completeness: Does the answer include the important parts?
- Faithfulness: Is the answer supported by provided context?
- Groundedness: Are claims traceable to reliable evidence?
- Helpfulness: Does the response solve the user's problem?
- Safety: Does the output avoid prohibited or risky content?
- Format validity: Does the output match the required schema?
RAG Evaluation
Retrieval-augmented generation needs two layers of evaluation.
Retrieval evaluation measures whether the system found useful context.
Answer evaluation measures whether the final answer used that context correctly.
A RAG answer can fail because retrieval missed the right document, because the model ignored the document, or because the model added unsupported information.
Agent Evaluation
Agent evaluation measures more than final text.
It may check:
- task decomposition
- tool selection
- tool arguments
- use of memory
- state transitions
- retry behavior
- approval handling
- workflow completion
- safe stopping behavior
Agent systems need evaluation because a correct final answer can hide inefficient or unsafe intermediate behavior.
Offline Evaluation
Offline evaluation runs tests before deployment or before a change is released.
It usually uses a fixed dataset of inputs, expected answers, rubrics, reference documents, or human labels.
Offline evals are useful for:
- prompt changes
- model upgrades
- retrieval tuning
- regression testing
- comparing architectures
- catching known failure modes
Online Evaluation
Online evaluation measures behavior in production or near-production traffic.
It can use user feedback, sampled traces, automated judges, human review, business outcomes, and monitoring metrics.
Online evals are useful because production traffic often includes edge cases that test datasets miss.
Golden Datasets
A golden dataset is a trusted set of examples used to evaluate an AI system.
It may contain:
- input questions
- expected answers
- reference documents
- acceptable citations
- tool use expectations
- rubric labels
- known bad examples
Golden datasets should evolve as the application and user behavior evolve.
Human Evaluation
Human evaluation uses people to judge model outputs.
It is useful when quality is subjective, high stakes, domain-specific, or difficult to score automatically.
Human reviewers can assess nuance, tone, usefulness, policy fit, citation quality, and task success. The downside is cost, speed, and reviewer inconsistency.
LLM-as-a-Judge
LLM-as-a-judge uses a separate model to score or critique outputs.
The judge may evaluate relevance, faithfulness, policy compliance, tone, completeness, or other rubric criteria.
This approach can scale evaluation faster than human review, but it still needs calibration against human labels and careful prompt design.
Rule-Based Evaluation
Some checks should be deterministic.
Examples:
- JSON schema validity
- required fields present
- citation count
- blocked phrase detection
- tool name validity
- latency threshold
- maximum token budget
Do not use an LLM judge when a simple deterministic check is more reliable.
Evaluation Rubrics
A rubric defines what good means.
A useful rubric should include:
- evaluation criterion
- score scale
- pass threshold
- examples of good and bad outputs
- handling of partial credit
- domain-specific rules
Without a rubric, evaluation becomes inconsistent and hard to reproduce.
Traces and Evaluation
Traces connect evaluation scores to the actual workflow.
A trace can show the prompt, retrieved context, tool calls, model output, guardrail result, retry behavior, and final answer.
This helps teams understand why a score was low and which component needs improvement.
Regression Testing
Regression testing checks whether a change made the system worse.
Run regression evals before changing:
- prompts
- models
- retrievers
- chunking strategies
- ranking logic
- tool descriptions
- guardrails
- workflow policies
LLM behavior can change unexpectedly, so regression tests are essential.
Feedback Loops
Evaluation becomes more useful when it feeds improvement.
Evaluation signals can support:
- prompt revision
- retrieval tuning
- dataset expansion
- guardrail updates
- human review routing
- model comparison
- workflow redesign
- automatic retry with corrective feedback
The goal is not only to score the system, but to improve it.
Common Mistakes
- Evaluating only a few handpicked examples.
- Using vague metrics such as “good answer” without a rubric.
- Confusing retrieval quality with answer quality.
- Trusting LLM judges without calibration.
- Ignoring failed tool calls and workflow behavior.
- Not tracking evaluation results over time.
- Skipping regression tests after prompt or model changes.
- Optimizing for one score while hurting user outcomes.
Evaluation Checklist
- Define the task and success criteria.
- Choose metrics that match the product goal.
- Create or collect representative test cases.
- Separate retrieval, answer, tool, and workflow evaluation.
- Use deterministic checks where possible.
- Use human review for subjective or high-risk cases.
- Calibrate LLM judges against trusted labels.
- Connect scores to traces for debugging.
- Run regression tests before changes.
- Monitor production behavior after launch.
Summary
LLM evaluation is how teams measure whether an AI application works well enough for its intended task. It covers answer quality, retrieval quality, groundedness, safety, tool use, workflow reliability, latency, and cost.
Strong evaluation combines offline tests, online monitoring, human review, deterministic checks, LLM judges, traces, and feedback loops. It turns AI quality from a feeling into an engineering practice.