Knowledge Graph Evaluation: Coverage, Accuracy, and Freshness

A knowledge graph is only useful if it represents the right facts, represents them correctly, and stays current as source data changes. For AI applications, this matters even more because retrieval errors can become confident generated answers.

Knowledge graph evaluation should measure the graph itself and the AI system that uses it. A graph can be structurally valid but still fail if it misses important entities, stores incorrect relationships, loses source evidence, or becomes stale.

Short Answer

Evaluate a knowledge graph across three core dimensions: coverage, accuracy, and freshness.

Coverage: Does the graph include the entities, relationships, properties, and sources needed for real questions?
Accuracy: Are the entities, relationships, and attributes correct and supported by evidence?
Freshness: Does the graph reflect the current state of the source data?

For GraphRAG and AI agents, also measure whether the graph improves retrieval quality, citation quality, and answer faithfulness.

Why Knowledge Graph Evaluation Is Different

Traditional search evaluation focuses on whether retrieved documents are relevant. Knowledge graph evaluation must also check whether structured facts are complete, correct, connected, and traceable.

A graph can fail in several ways:

missing key entities
duplicating the same entity under different names
storing weak or wrong relationships
using stale source data
losing provenance links
expanding through noisy graph paths
retrieving facts the user cannot access

These failures affect both retrieval and generation.

Dimension 1: Coverage

Coverage measures whether the graph contains enough of the domain to support the target use cases.

Good coverage does not mean every possible fact is in the graph. It means the graph includes the facts needed to answer real user questions.

Coverage evaluation should ask:

Are the important entity types represented?
Are important relationships present?
Are source documents and chunks linked?
Are aliases and alternate names handled?
Are common user questions answerable from the graph?
Are important domains, teams, products, or policies missing?

Entity Coverage

Entity coverage measures whether the graph includes the entities users need.

For example, an incident graph may need services, teams, alerts, incidents, runbooks, customers, and dependencies. A policy graph may need policies, regions, data categories, exceptions, owners, and effective dates.

Compare extracted entities against source records, human-labeled samples, or a trusted system of record.

Relationship Coverage

Relationship coverage measures whether important connections are present.

For example, a service dependency graph must capture depends_on relationships. A compliance graph must capture applies_to, requires, and has_exception relationships.

Relationship coverage is often more important than entity coverage because graph retrieval depends on traversable paths.

Source Coverage

Source coverage checks whether the graph is grounded in the right source material.

A graph may have many nodes and edges but still miss key documents, source systems, logs, tickets, policies, or records.

Track which sources are ingested, which are excluded, which failed ingestion, and which are waiting for review.

Dimension 2: Accuracy

Accuracy measures whether graph facts are correct.

For AI applications, accuracy should be evaluated at several levels:

entity accuracy
relationship accuracy
property accuracy
provenance accuracy
retrieval accuracy
generated-answer accuracy

Do not rely only on graph size or extraction volume. More nodes and edges do not automatically mean better quality.

Entity Accuracy

Entity accuracy checks whether each node represents the right real-world or domain object.

Common entity errors include:

duplicate entities
merged entities that should be separate
wrong entity type
missing aliases
incorrect canonical name
ambiguous references

Evaluate entity accuracy using sampled review, entity resolution benchmarks, and known examples from production queries.

Relationship Accuracy

Relationship accuracy checks whether edges are correct, directional, typed, and supported.

For example, Service A depends_on Service B is not the same as Service B depends_on Service A. Direction errors can break impact analysis and agent planning.

Review relationships against source evidence and expected query behavior.

Property Accuracy

Properties such as owner, status, region, date, severity, score, or category often drive filters and decisions.

Incorrect properties can cause the graph to retrieve the wrong context even when entities and relationships are correct.

Test important properties against source systems and validate type consistency, allowed values, null handling, and timestamp logic.

Provenance Accuracy

Provenance accuracy checks whether graph facts link to the right evidence.

A relationship should point to the source chunk, record, or assertion that supports it. Citations should not merely point to a related document. They should point to the evidence that actually supports the claim.

This is essential for GraphRAG, compliance, legal review, enterprise search, and agentic systems.

Dimension 3: Freshness

Freshness measures whether the graph reflects the current state of source data.

A stale graph can be dangerous. It may show old policies, old owners, old dependencies, resolved incidents, retired products, or outdated customer records.

Freshness evaluation should ask:

How long after a source update does the graph update?
Are deleted or deprecated facts removed or marked inactive?
Are summaries refreshed when source evidence changes?
Can historical facts be distinguished from current facts?
Are stale facts visible in generated answers?

Freshness Metrics

Useful freshness metrics include:

source-to-graph lag
percentage of stale nodes
percentage of stale relationships
failed ingestion count
unprocessed source update count
summary refresh lag
deleted-source retention errors

Freshness should be monitored continuously, not only during manual audits.

Evaluating GraphRAG Retrieval

Graph quality matters because it affects retrieval.

For GraphRAG, evaluate whether graph retrieval improves the context sent to the LLM.

Useful retrieval checks include:

Are the correct entities retrieved?
Are the correct relationship paths retrieved?
Are source chunks relevant to the question?
Is graph expansion adding useful context or noise?
Are high-degree generic nodes dominating retrieval?
Does reranking select the best evidence?

Compare graph retrieval against vector-only retrieval and hybrid keyword/vector retrieval.

Entity and Relationship Recall

Entity recall measures whether retrieval found the entities needed to answer the question.

Relationship recall measures whether retrieval found the relationships needed to explain the answer.

For example, a dependency question may require retrieving both the affected service and the dependency path connecting it to a user-facing application.

If the answer needs the path and the graph retrieves only one endpoint, recall is incomplete.

Precision and Noise

Graph retrieval can over-expand.

Precision measures whether the retrieved graph context is actually relevant. Low precision causes the LLM to receive noisy facts, unrelated neighbors, broad community summaries, or stale context.

Control precision with traversal depth limits, edge-type filters, node-type filters, relationship confidence, access control, and reranking.

Evaluating Generated Answers

End-to-end evaluation checks whether the final AI answer is correct, relevant, and grounded.

Useful answer-level checks include:

Does the answer use retrieved graph evidence correctly?
Are cited sources relevant?
Are relationship paths explained accurately?
Is the answer faithful to source chunks?
Does the answer avoid unsupported claims?
Does it mention uncertainty when evidence is weak?

Answer faithfulness is especially important because an LLM can generate a fluent answer even when graph retrieval is incomplete or wrong.

Evaluation Dataset Design

Use real questions to evaluate the graph.

A good evaluation set includes:

single-hop questions
multi-hop questions
entity lookup questions
relationship path questions
provenance questions
freshness-sensitive questions
permission-sensitive questions
questions with known missing data

Each question should define expected entities, expected relationships, expected source evidence, and expected answer behavior.

Human Review vs Automated Evaluation

Human review is useful for high-risk facts, complex relationships, and sampled audits.

Automated evaluation is useful for regression testing, freshness monitoring, citation checks, schema validation, and retrieval benchmarks.

LLM-as-a-judge evaluation can help score relevance, faithfulness, and citation quality, but it should be calibrated against human-reviewed examples for important workflows.

Regression Testing

Every graph update can affect retrieval.

Regression tests should run when teams change:

entity extraction prompts
relationship extraction rules
chunking strategy
embedding model
graph schema
traversal rules
ranking logic
permission filters
summary generation prompts

The goal is to catch quality regressions before they reach users.

Operational Monitoring

Production graph evaluation should include live monitoring.

Track signals such as:

ingestion failures
orphaned chunks
entities with no evidence
relationships with no source links
stale summaries
unusually large graph expansions
retrieval latency
citation failure rate
user feedback on answer quality

These signals help detect graph quality problems before they become answer quality problems.

A Practical Evaluation Checklist

Can the graph answer representative user questions?
Are important entity types covered?
Are important relationship types covered?
Are entities deduplicated and correctly typed?
Are relationships directional and supported by evidence?
Do graph facts link to exact source chunks?
Are stale facts detected and handled?
Does graph retrieval improve context recall?
Does graph expansion add useful context without too much noise?
Are generated answers faithful to retrieved evidence?
Are permissions respected during retrieval and citation display?
Are graph updates covered by regression tests?

Summary

Knowledge graph evaluation should measure coverage, accuracy, and freshness.

Coverage asks whether the graph contains the entities, relationships, and sources needed for real questions. Accuracy asks whether those facts are correct and supported. Freshness asks whether the graph still reflects current source data.

For GraphRAG and AI agents, the final test is whether the graph improves grounded answers: better retrieval, better citations, better relationship reasoning, and fewer unsupported claims.