Error Taxonomy for RAG and AI Agents

An error taxonomy for RAG and AI agents is a shared vocabulary for classifying failures by where they occur and how they appear. Without a taxonomy, teams often treat every bad answer as a model problem. In practice, failures can start in indexing, retrieval, generation, tool use, state management, permissions, or multi-agent handoffs.

A good taxonomy helps teams debug faster, design better evaluations, prioritize fixes, and avoid applying the wrong remedy to the wrong failure.

Short Answer

Classify RAG and agent errors by pipeline stage and failure type.

Major categories include:

indexing and data errors
retrieval errors
generation and grounding errors
citation errors
tool-use errors
workflow and state errors
permission and policy errors
context and memory errors
multi-agent propagation errors
operational and infrastructure errors

The goal is not to invent endless labels. The goal is to map symptoms to the layer that needs repair.

Why Error Taxonomy Matters

RAG and agent systems fail in ways that look similar at the surface.

A confident wrong answer may come from missing documents, low-relevance retrieval, stale sources, unsupported generation, bad tool results, or an upstream agent that passed corrupt context.

If teams only label the symptom as hallucination, they may change the prompt when the real problem is chunking, freshness, permissions, or tool selection.

How to Use a Taxonomy

Use the taxonomy during incident review, human evaluation, regression analysis, and production monitoring.

For each failure, record:

user-visible symptom
pipeline stage where the error started
error category
severity
whether the failure was recoverable
whether the system should have refused or escalated
trace evidence

This turns scattered complaints into actionable failure patterns.

Indexing and Data Errors

These failures happen before query time.

Missing source: the needed document was never ingested.
Stale source: outdated content remains in the index.
Chunking failure: important facts are split, truncated, or mixed with unrelated text.
Metadata failure: filters, dates, tenants, or permissions are missing or wrong.
Embedding mismatch: documents and queries are encoded inconsistently.
Duplicate or near-duplicate collapse: redundant chunks crowd out better evidence.
Corrupt or incomplete ingestion: partial documents enter the index.

Indexing errors often look like retrieval or generation failures later.

Retrieval Errors

Retrieval errors occur when the system fails to find or rank the right evidence.

Low-relevance top-k: the system returns weak matches because nothing is strong enough.
Retrieval drift: results are semantically similar but not sufficient to answer the question.
Recall failure: relevant evidence exists but is not retrieved.
Precision failure: too much noise enters the context window.
Ranking failure: the best evidence is retrieved but ranked too low.
Filter failure: metadata filters exclude valid results or include invalid ones.
Freshness failure: stale documents outrank current ones.
Empty retrieval mishandling: the system invents an answer instead of refusing.

Retrieval quality is often the highest-leverage place to look when answers become unsupported.

Generation and Grounding Errors

These failures happen after context is available.

Hallucination: the answer invents facts not present in sources or context.
Faithfulness failure: the answer contradicts or overstates retrieved evidence.
Answer irrelevance: the response is fluent but does not address the user request.
Incomplete answer: key required facts are missing.
Overconfident refusal: the system says it cannot answer even though evidence is present.
Context ignore: the model fails to use relevant retrieved evidence.
Context overuse: the model summarizes all retrieved text instead of answering the question.
Format failure: the answer is correct in substance but fails required structure.

Separate generation failures from retrieval failures. A model can fail even with good context, and good generation cannot fully compensate for bad context.

Citation Errors

Citation errors are related to grounding but deserve their own labels.

Missing citation: an important claim has no source.
Decorative citation: a citation is present but does not support the claim.
Wrong citation: the cited source is real but attached to the wrong claim.
Fabricated citation: the system invents a source, ID, or URL.
Stale citation: the cited source is outdated for the claim.
Low-granularity citation: the citation points to a broad page instead of the supporting passage.

Citation presence is not the same as citation quality.

Tool-Use Errors

Agent systems fail when tools are chosen or used incorrectly.

Missing tool call: the agent answers without using a required tool.
Unnecessary tool call: the agent acts when it should answer or clarify.
Wrong tool selection: the agent chooses an inappropriate capability.
Argument error: required fields, IDs, filters, or scopes are wrong.
Result misinterpretation: the agent ignores errors, empty results, or partial data.
Unsafe retry: a write action is repeated and creates duplicate side effects.
Permission violation: the agent attempts a restricted action or data access.
Approval skip: a high-risk action runs without required human approval.

Tool-use errors can change real systems, so they often have higher severity than answer-quality errors.

Workflow and State Errors

These failures affect multi-step agent behavior.

Planning failure: the agent chooses the wrong sequence of steps.
State loss: important intermediate results disappear between steps.
Invalid state transition: the workflow skips required checks or approvals.
Checkpoint failure: the system cannot resume safely after interruption.
Loop failure: the agent repeats steps without progress.
Partial completion: some side effects succeed while the overall task fails.
Rollback failure: the system cannot restore a safe state after a bad action.
Handoff failure: control is transferred to a human or another agent incorrectly.

Workflow reliability depends on state, retries, approvals, and recovery, not only final output text.

Context and Memory Errors

Agent context can degrade even when individual tools work.

Context poisoning: incorrect information enters memory and is reused.
Context distraction: too much history or tool output overwhelms reasoning.
Context confusion: irrelevant tools or documents crowd the prompt.
Context clash: contradictory information leaves the agent stuck or inconsistent.
Memory staleness: old summaries or facts override newer evidence.
Over-compression: important details are lost when history is summarized.

These failures are especially common in long-running agents.

Permission and Policy Errors

These failures involve safety boundaries.

Unauthorized data access: the agent reads the wrong tenant, user, or document set.
Policy violation: the output or action breaks business, legal, or safety rules.
Overblocking: guardrails block valid requests.
Underblocking: guardrails miss unsafe requests.
PII leakage: sensitive data is exposed in outputs, logs, or tool calls.
Scope escalation: the agent uses broader permissions than needed.

Policy errors should be tracked separately from ordinary quality failures.

Multi-Agent Propagation Errors

In multi-agent systems, one failure can cascade.

Upstream retrieval contamination: a research agent passes weak or stale evidence.
Summary distortion: a synthesis agent compresses flawed context into confident text.
Assumption lock-in: a later agent treats an earlier error as established fact.
Invisible origin: the final response looks wrong, but the root cause is several steps earlier.
Missing validation gates: agents pass context forward without relevance or freshness checks.

Diagnosing these failures requires full trajectory traces, not only final-answer review.

Operational and Infrastructure Errors

Some failures are operational rather than semantic.

timeouts
rate limits
authentication failures
dependency outages
queue backlog
index rebuild lag
cost or latency spikes
trace or logging gaps

Operational errors still need taxonomy labels because they affect reliability and user trust.

Symptom-to-Cause Mapping

Use symptoms as entry points, then map to likely causes.

Confident wrong answer: check retrieval relevance, stale sources, faithfulness, and tool results.
Vague answer: check recall, chunking, and answer completeness.
Wrong citation: check citation placement, source support, and retrieval set.
Repeated user rephrasing: check answer relevance and intent matching.
Duplicate side effects: check retries, idempotency, and workflow state.
Unexpected refusal: check filters, guardrails, and overblocking.
Slow or incomplete workflow: check tool failures, loops, approvals, and checkpoints.

Start with the earliest stage that can explain the symptom.

Severity Levels

Not all errors have equal impact.

Critical: unsafe action, data leak, unauthorized access, irreversible side effect.
High: unsupported factual claim in a high-stakes domain, wrong policy answer, failed rollback.
Medium: incomplete answer, weak citation, recoverable tool failure.
Low: style issues, minor formatting problems, non-blocking inefficiency.

Severity should drive response time, release gates, and monitoring alerts.

Using Taxonomy in Evaluation

Error taxonomy improves evaluation design.

Use it to:

label golden dataset failures
build regression cases for each major category
separate retrieval metrics from generation metrics
score tool traces and workflow traces
track production incidents by category
decide whether a fix belongs in data, retrieval, prompts, tools, or orchestration

Category counts over time show whether quality work is reducing the right failures.

Using Taxonomy in Production

In production, attach taxonomy labels to sampled reviews, automated judges, support tickets, and incident reports.

Track rates by category, workflow, topic, model version, prompt version, and data version.

When one category rises, investigate that layer first instead of restarting from a generic quality complaint.

Common Mistakes

Labeling every bad answer as hallucination.
Changing the model when the index or retrieval layer is broken.
Ignoring stale data as a distinct failure mode.
Reviewing only final answers and not traces.
Mixing safety violations with ordinary quality issues.
Failing to record the originating stage of multi-agent failures.
Creating too many labels that reviewers cannot apply consistently.

Implementation Checklist

Define a small set of stage-based error categories.
Add severity and recoverability fields.
Require traces for agent and RAG incident review.
Map user-visible symptoms to likely root stages.
Label golden set failures with taxonomy codes.
Create regression tests for each major category.
Track category rates in production monitoring.
Review category trends after prompt, model, or index changes.
Keep labels stable enough for historical comparison.
Train reviewers with examples for each category.

Summary

An error taxonomy for RAG and AI agents turns vague quality problems into stage-specific failure classes. Indexing, retrieval, generation, citations, tools, state, permissions, context, multi-agent handoffs, and infrastructure each produce different errors and need different fixes.

Teams that classify failures consistently debug faster, design better evaluations, and improve the right part of the system instead of treating every bad output as a model problem.