An error taxonomy for RAG and AI agents is a shared vocabulary for classifying failures by where they occur and how they appear. Without a taxonomy, teams often treat every bad answer as a model problem. In practice, failures can start in indexing, retrieval, generation, tool use, state management, permissions, or multi-agent handoffs.
A good taxonomy helps teams debug faster, design better evaluations, prioritize fixes, and avoid applying the wrong remedy to the wrong failure.
Short Answer
Classify RAG and agent errors by pipeline stage and failure type.
Major categories include:
- indexing and data errors
- retrieval errors
- generation and grounding errors
- citation errors
- tool-use errors
- workflow and state errors
- permission and policy errors
- context and memory errors
- multi-agent propagation errors
- operational and infrastructure errors
The goal is not to invent endless labels. The goal is to map symptoms to the layer that needs repair.
Why Error Taxonomy Matters
RAG and agent systems fail in ways that look similar at the surface.
A confident wrong answer may come from missing documents, low-relevance retrieval, stale sources, unsupported generation, bad tool results, or an upstream agent that passed corrupt context.
If teams only label the symptom as hallucination, they may change the prompt when the real problem is chunking, freshness, permissions, or tool selection.
How to Use a Taxonomy
Use the taxonomy during incident review, human evaluation, regression analysis, and production monitoring.
For each failure, record:
- user-visible symptom
- pipeline stage where the error started
- error category
- severity
- whether the failure was recoverable
- whether the system should have refused or escalated
- trace evidence
This turns scattered complaints into actionable failure patterns.
Indexing and Data Errors
These failures happen before query time.
- Missing source: the needed document was never ingested.
- Stale source: outdated content remains in the index.
- Chunking failure: important facts are split, truncated, or mixed with unrelated text.
- Metadata failure: filters, dates, tenants, or permissions are missing or wrong.
- Embedding mismatch: documents and queries are encoded inconsistently.
- Duplicate or near-duplicate collapse: redundant chunks crowd out better evidence.
- Corrupt or incomplete ingestion: partial documents enter the index.
Indexing errors often look like retrieval or generation failures later.
Retrieval Errors
Retrieval errors occur when the system fails to find or rank the right evidence.
- Low-relevance top-k: the system returns weak matches because nothing is strong enough.
- Retrieval drift: results are semantically similar but not sufficient to answer the question.
- Recall failure: relevant evidence exists but is not retrieved.
- Precision failure: too much noise enters the context window.
- Ranking failure: the best evidence is retrieved but ranked too low.
- Filter failure: metadata filters exclude valid results or include invalid ones.
- Freshness failure: stale documents outrank current ones.
- Empty retrieval mishandling: the system invents an answer instead of refusing.
Retrieval quality is often the highest-leverage place to look when answers become unsupported.
Generation and Grounding Errors
These failures happen after context is available.
- Hallucination: the answer invents facts not present in sources or context.
- Faithfulness failure: the answer contradicts or overstates retrieved evidence.
- Answer irrelevance: the response is fluent but does not address the user request.
- Incomplete answer: key required facts are missing.
- Overconfident refusal: the system says it cannot answer even though evidence is present.
- Context ignore: the model fails to use relevant retrieved evidence.
- Context overuse: the model summarizes all retrieved text instead of answering the question.
- Format failure: the answer is correct in substance but fails required structure.
Separate generation failures from retrieval failures. A model can fail even with good context, and good generation cannot fully compensate for bad context.
Citation Errors
Citation errors are related to grounding but deserve their own labels.
- Missing citation: an important claim has no source.
- Decorative citation: a citation is present but does not support the claim.
- Wrong citation: the cited source is real but attached to the wrong claim.
- Fabricated citation: the system invents a source, ID, or URL.
- Stale citation: the cited source is outdated for the claim.
- Low-granularity citation: the citation points to a broad page instead of the supporting passage.
Citation presence is not the same as citation quality.
Tool-Use Errors
Agent systems fail when tools are chosen or used incorrectly.
- Missing tool call: the agent answers without using a required tool.
- Unnecessary tool call: the agent acts when it should answer or clarify.
- Wrong tool selection: the agent chooses an inappropriate capability.
- Argument error: required fields, IDs, filters, or scopes are wrong.
- Result misinterpretation: the agent ignores errors, empty results, or partial data.
- Unsafe retry: a write action is repeated and creates duplicate side effects.
- Permission violation: the agent attempts a restricted action or data access.
- Approval skip: a high-risk action runs without required human approval.
Tool-use errors can change real systems, so they often have higher severity than answer-quality errors.
Workflow and State Errors
These failures affect multi-step agent behavior.
- Planning failure: the agent chooses the wrong sequence of steps.
- State loss: important intermediate results disappear between steps.
- Invalid state transition: the workflow skips required checks or approvals.
- Checkpoint failure: the system cannot resume safely after interruption.
- Loop failure: the agent repeats steps without progress.
- Partial completion: some side effects succeed while the overall task fails.
- Rollback failure: the system cannot restore a safe state after a bad action.
- Handoff failure: control is transferred to a human or another agent incorrectly.
Workflow reliability depends on state, retries, approvals, and recovery, not only final output text.
Context and Memory Errors
Agent context can degrade even when individual tools work.
- Context poisoning: incorrect information enters memory and is reused.
- Context distraction: too much history or tool output overwhelms reasoning.
- Context confusion: irrelevant tools or documents crowd the prompt.
- Context clash: contradictory information leaves the agent stuck or inconsistent.
- Memory staleness: old summaries or facts override newer evidence.
- Over-compression: important details are lost when history is summarized.
These failures are especially common in long-running agents.
Permission and Policy Errors
These failures involve safety boundaries.
- Unauthorized data access: the agent reads the wrong tenant, user, or document set.
- Policy violation: the output or action breaks business, legal, or safety rules.
- Overblocking: guardrails block valid requests.
- Underblocking: guardrails miss unsafe requests.
- PII leakage: sensitive data is exposed in outputs, logs, or tool calls.
- Scope escalation: the agent uses broader permissions than needed.
Policy errors should be tracked separately from ordinary quality failures.
Multi-Agent Propagation Errors
In multi-agent systems, one failure can cascade.
- Upstream retrieval contamination: a research agent passes weak or stale evidence.
- Summary distortion: a synthesis agent compresses flawed context into confident text.
- Assumption lock-in: a later agent treats an earlier error as established fact.
- Invisible origin: the final response looks wrong, but the root cause is several steps earlier.
- Missing validation gates: agents pass context forward without relevance or freshness checks.
Diagnosing these failures requires full trajectory traces, not only final-answer review.
Operational and Infrastructure Errors
Some failures are operational rather than semantic.
- timeouts
- rate limits
- authentication failures
- dependency outages
- queue backlog
- index rebuild lag
- cost or latency spikes
- trace or logging gaps
Operational errors still need taxonomy labels because they affect reliability and user trust.
Symptom-to-Cause Mapping
Use symptoms as entry points, then map to likely causes.
- Confident wrong answer: check retrieval relevance, stale sources, faithfulness, and tool results.
- Vague answer: check recall, chunking, and answer completeness.
- Wrong citation: check citation placement, source support, and retrieval set.
- Repeated user rephrasing: check answer relevance and intent matching.
- Duplicate side effects: check retries, idempotency, and workflow state.
- Unexpected refusal: check filters, guardrails, and overblocking.
- Slow or incomplete workflow: check tool failures, loops, approvals, and checkpoints.
Start with the earliest stage that can explain the symptom.
Severity Levels
Not all errors have equal impact.
- Critical: unsafe action, data leak, unauthorized access, irreversible side effect.
- High: unsupported factual claim in a high-stakes domain, wrong policy answer, failed rollback.
- Medium: incomplete answer, weak citation, recoverable tool failure.
- Low: style issues, minor formatting problems, non-blocking inefficiency.
Severity should drive response time, release gates, and monitoring alerts.
Using Taxonomy in Evaluation
Error taxonomy improves evaluation design.
Use it to:
- label golden dataset failures
- build regression cases for each major category
- separate retrieval metrics from generation metrics
- score tool traces and workflow traces
- track production incidents by category
- decide whether a fix belongs in data, retrieval, prompts, tools, or orchestration
Category counts over time show whether quality work is reducing the right failures.
Using Taxonomy in Production
In production, attach taxonomy labels to sampled reviews, automated judges, support tickets, and incident reports.
Track rates by category, workflow, topic, model version, prompt version, and data version.
When one category rises, investigate that layer first instead of restarting from a generic quality complaint.
Common Mistakes
- Labeling every bad answer as hallucination.
- Changing the model when the index or retrieval layer is broken.
- Ignoring stale data as a distinct failure mode.
- Reviewing only final answers and not traces.
- Mixing safety violations with ordinary quality issues.
- Failing to record the originating stage of multi-agent failures.
- Creating too many labels that reviewers cannot apply consistently.
Implementation Checklist
- Define a small set of stage-based error categories.
- Add severity and recoverability fields.
- Require traces for agent and RAG incident review.
- Map user-visible symptoms to likely root stages.
- Label golden set failures with taxonomy codes.
- Create regression tests for each major category.
- Track category rates in production monitoring.
- Review category trends after prompt, model, or index changes.
- Keep labels stable enough for historical comparison.
- Train reviewers with examples for each category.
Summary
An error taxonomy for RAG and AI agents turns vague quality problems into stage-specific failure classes. Indexing, retrieval, generation, citations, tools, state, permissions, context, multi-agent handoffs, and infrastructure each produce different errors and need different fixes.
Teams that classify failures consistently debug faster, design better evaluations, and improve the right part of the system instead of treating every bad output as a model problem.