A knowledge graph is only useful if it represents the right facts, represents them correctly, and stays current as source data changes. For AI applications, this matters even more because retrieval errors can become confident generated answers.
Knowledge graph evaluation should measure the graph itself and the AI system that uses it. A graph can be structurally valid but still fail if it misses important entities, stores incorrect relationships, loses source evidence, or becomes stale.
Short Answer
Evaluate a knowledge graph across three core dimensions: coverage, accuracy, and freshness.
- Coverage: Does the graph include the entities, relationships, properties, and sources needed for real questions?
- Accuracy: Are the entities, relationships, and attributes correct and supported by evidence?
- Freshness: Does the graph reflect the current state of the source data?
For GraphRAG and AI agents, also measure whether the graph improves retrieval quality, citation quality, and answer faithfulness.
Why Knowledge Graph Evaluation Is Different
Traditional search evaluation focuses on whether retrieved documents are relevant. Knowledge graph evaluation must also check whether structured facts are complete, correct, connected, and traceable.
A graph can fail in several ways:
- missing key entities
- duplicating the same entity under different names
- storing weak or wrong relationships
- using stale source data
- losing provenance links
- expanding through noisy graph paths
- retrieving facts the user cannot access
These failures affect both retrieval and generation.
Dimension 1: Coverage
Coverage measures whether the graph contains enough of the domain to support the target use cases.
Good coverage does not mean every possible fact is in the graph. It means the graph includes the facts needed to answer real user questions.
Coverage evaluation should ask:
- Are the important entity types represented?
- Are important relationships present?
- Are source documents and chunks linked?
- Are aliases and alternate names handled?
- Are common user questions answerable from the graph?
- Are important domains, teams, products, or policies missing?
Entity Coverage
Entity coverage measures whether the graph includes the entities users need.
For example, an incident graph may need services, teams, alerts, incidents, runbooks, customers, and dependencies. A policy graph may need policies, regions, data categories, exceptions, owners, and effective dates.
Compare extracted entities against source records, human-labeled samples, or a trusted system of record.
Relationship Coverage
Relationship coverage measures whether important connections are present.
For example, a service dependency graph must capture depends_on relationships. A compliance graph must capture applies_to, requires, and has_exception relationships.
Relationship coverage is often more important than entity coverage because graph retrieval depends on traversable paths.
Source Coverage
Source coverage checks whether the graph is grounded in the right source material.
A graph may have many nodes and edges but still miss key documents, source systems, logs, tickets, policies, or records.
Track which sources are ingested, which are excluded, which failed ingestion, and which are waiting for review.
Dimension 2: Accuracy
Accuracy measures whether graph facts are correct.
For AI applications, accuracy should be evaluated at several levels:
- entity accuracy
- relationship accuracy
- property accuracy
- provenance accuracy
- retrieval accuracy
- generated-answer accuracy
Do not rely only on graph size or extraction volume. More nodes and edges do not automatically mean better quality.
Entity Accuracy
Entity accuracy checks whether each node represents the right real-world or domain object.
Common entity errors include:
- duplicate entities
- merged entities that should be separate
- wrong entity type
- missing aliases
- incorrect canonical name
- ambiguous references
Evaluate entity accuracy using sampled review, entity resolution benchmarks, and known examples from production queries.
Relationship Accuracy
Relationship accuracy checks whether edges are correct, directional, typed, and supported.
For example, Service A depends_on Service B is not the same as Service B depends_on Service A. Direction errors can break impact analysis and agent planning.
Review relationships against source evidence and expected query behavior.
Property Accuracy
Properties such as owner, status, region, date, severity, score, or category often drive filters and decisions.
Incorrect properties can cause the graph to retrieve the wrong context even when entities and relationships are correct.
Test important properties against source systems and validate type consistency, allowed values, null handling, and timestamp logic.
Provenance Accuracy
Provenance accuracy checks whether graph facts link to the right evidence.
A relationship should point to the source chunk, record, or assertion that supports it. Citations should not merely point to a related document. They should point to the evidence that actually supports the claim.
This is essential for GraphRAG, compliance, legal review, enterprise search, and agentic systems.
Dimension 3: Freshness
Freshness measures whether the graph reflects the current state of source data.
A stale graph can be dangerous. It may show old policies, old owners, old dependencies, resolved incidents, retired products, or outdated customer records.
Freshness evaluation should ask:
- How long after a source update does the graph update?
- Are deleted or deprecated facts removed or marked inactive?
- Are summaries refreshed when source evidence changes?
- Can historical facts be distinguished from current facts?
- Are stale facts visible in generated answers?
Freshness Metrics
Useful freshness metrics include:
- source-to-graph lag
- percentage of stale nodes
- percentage of stale relationships
- failed ingestion count
- unprocessed source update count
- summary refresh lag
- deleted-source retention errors
Freshness should be monitored continuously, not only during manual audits.
Evaluating GraphRAG Retrieval
Graph quality matters because it affects retrieval.
For GraphRAG, evaluate whether graph retrieval improves the context sent to the LLM.
Useful retrieval checks include:
- Are the correct entities retrieved?
- Are the correct relationship paths retrieved?
- Are source chunks relevant to the question?
- Is graph expansion adding useful context or noise?
- Are high-degree generic nodes dominating retrieval?
- Does reranking select the best evidence?
Compare graph retrieval against vector-only retrieval and hybrid keyword/vector retrieval.
Entity and Relationship Recall
Entity recall measures whether retrieval found the entities needed to answer the question.
Relationship recall measures whether retrieval found the relationships needed to explain the answer.
For example, a dependency question may require retrieving both the affected service and the dependency path connecting it to a user-facing application.
If the answer needs the path and the graph retrieves only one endpoint, recall is incomplete.
Precision and Noise
Graph retrieval can over-expand.
Precision measures whether the retrieved graph context is actually relevant. Low precision causes the LLM to receive noisy facts, unrelated neighbors, broad community summaries, or stale context.
Control precision with traversal depth limits, edge-type filters, node-type filters, relationship confidence, access control, and reranking.
Evaluating Generated Answers
End-to-end evaluation checks whether the final AI answer is correct, relevant, and grounded.
Useful answer-level checks include:
- Does the answer use retrieved graph evidence correctly?
- Are cited sources relevant?
- Are relationship paths explained accurately?
- Is the answer faithful to source chunks?
- Does the answer avoid unsupported claims?
- Does it mention uncertainty when evidence is weak?
Answer faithfulness is especially important because an LLM can generate a fluent answer even when graph retrieval is incomplete or wrong.
Evaluation Dataset Design
Use real questions to evaluate the graph.
A good evaluation set includes:
- single-hop questions
- multi-hop questions
- entity lookup questions
- relationship path questions
- provenance questions
- freshness-sensitive questions
- permission-sensitive questions
- questions with known missing data
Each question should define expected entities, expected relationships, expected source evidence, and expected answer behavior.
Human Review vs Automated Evaluation
Human review is useful for high-risk facts, complex relationships, and sampled audits.
Automated evaluation is useful for regression testing, freshness monitoring, citation checks, schema validation, and retrieval benchmarks.
LLM-as-a-judge evaluation can help score relevance, faithfulness, and citation quality, but it should be calibrated against human-reviewed examples for important workflows.
Regression Testing
Every graph update can affect retrieval.
Regression tests should run when teams change:
- entity extraction prompts
- relationship extraction rules
- chunking strategy
- embedding model
- graph schema
- traversal rules
- ranking logic
- permission filters
- summary generation prompts
The goal is to catch quality regressions before they reach users.
Operational Monitoring
Production graph evaluation should include live monitoring.
Track signals such as:
- ingestion failures
- orphaned chunks
- entities with no evidence
- relationships with no source links
- stale summaries
- unusually large graph expansions
- retrieval latency
- citation failure rate
- user feedback on answer quality
These signals help detect graph quality problems before they become answer quality problems.
A Practical Evaluation Checklist
- Can the graph answer representative user questions?
- Are important entity types covered?
- Are important relationship types covered?
- Are entities deduplicated and correctly typed?
- Are relationships directional and supported by evidence?
- Do graph facts link to exact source chunks?
- Are stale facts detected and handled?
- Does graph retrieval improve context recall?
- Does graph expansion add useful context without too much noise?
- Are generated answers faithful to retrieved evidence?
- Are permissions respected during retrieval and citation display?
- Are graph updates covered by regression tests?
Summary
Knowledge graph evaluation should measure coverage, accuracy, and freshness.
Coverage asks whether the graph contains the entities, relationships, and sources needed for real questions. Accuracy asks whether those facts are correct and supported. Freshness asks whether the graph still reflects current source data.
For GraphRAG and AI agents, the final test is whether the graph improves grounded answers: better retrieval, better citations, better relationship reasoning, and fewer unsupported claims.