How to Shadow Test a New Embedding Model

Shadow testing a new embedding model means sending real production queries to a candidate embedding index while users still receive results from the current production index. The new model is tested with live traffic patterns, but it does not affect the user experience until it has passed evaluation.

This is one of the safest ways to validate an embedding model change. Offline benchmarks are useful, but they rarely capture every query shape, filter combination, metadata issue, latency spike, or RAG grounding problem that appears in production.

What Shadow Testing Proves

Shadow testing helps answer a practical question: if we switched to the new embedding model today, what would users have seen?

It can reveal:

Ranking differences between the old and new model.
Queries where the new model finds better semantic matches.
Queries where exact terms, IDs, or product names get worse.
Filter or permission mismatches in the candidate index.
Latency and timeout behavior under real query load.
RAG context changes that affect answer grounding.
Gaps caused by incomplete backfill or stale metadata.

The important point is that shadow testing observes production-like behavior without letting the candidate system make user-facing decisions yet.

Prerequisite: Build the Candidate Index

Before shadow testing, the new embedding model needs a separate candidate target. That might be a new index, collection, namespace, or named vector. The current production index should remain unchanged.

The candidate target should be complete enough for fair comparison. It should include the same source records, chunk IDs, metadata, permissions, filters, and source references as production unless the test is deliberately evaluating a broader change.

If the candidate index is only partially backfilled, label the test clearly and limit comparison to the completed subset. Otherwise, missing results may be mistaken for model quality problems.

Shadow Test Flow

A basic shadow test follows this flow:

User sends a normal search or RAG query.
The application queries the production index and returns those results to the user.
In parallel or asynchronously, the same query is sent to the candidate index.
The system logs production results and candidate results side by side.
Metrics, diffs, and sampled reviews compare the two result sets.
No candidate result is shown to users until promotion.

This design keeps the experiment low risk. If the candidate index fails, times out, or returns poor results, users still receive the current production behavior.

What to Log

Good shadow testing depends on good logs. For each shadowed query, record enough information to reproduce and explain the comparison later.

Useful fields include:

Query text or a privacy-safe query hash.
Query timestamp.
User segment, tenant, product, language, or region where allowed.
Filters and permission scope applied.
Production index generation and candidate index generation.
Top result IDs from both systems.
Scores, ranks, and source metadata where available.
Latency and timeout status for both systems.
RAG context chunks selected by each retriever.
Any downstream answer, citation, or user feedback signal.

Be careful with privacy. Search logs can contain sensitive user text. Redact, hash, sample, or restrict access where needed.

Compare Result Overlap

The first useful comparison is result overlap. For each query, compare the top results from the current model and the candidate model.

High overlap means the new model behaves similarly. Low overlap is not automatically bad; it may mean the new model finds better results. But low overlap deserves review, especially for exact lookup queries and high-value workflows.

Track overlap at different depths, such as top 1, top 3, top 5, and top 10. A candidate model that changes rank order slightly is different from one that returns an entirely different set of documents.

Segment by Query Type

Average metrics can hide important failures. Segment shadow results by query type.

Useful segments include:

Exact ID, SKU, citation, or error-code queries.
Short keyword phrases.
Long natural-language questions.
Broad conceptual searches.
Hybrid search queries with keyword and vector signals.
Filtered or permission-scoped queries.
RAG questions that need directly quotable evidence.

A new model may improve semantic questions while hurting exact-match behavior. Segmentation makes that trade-off visible before cutover.

Measure Latency Separately

Do not evaluate only relevance. A candidate model may require larger vectors, slower query paths, heavier reranking, or more expensive embedding calls.

Track p50, p95, and p99 latency for production and candidate retrieval. Also track timeout rate, embedding generation time, candidate index query time, and reranking time if reranking is part of the test.

Shadow queries should not overload production. If candidate testing adds noticeable load, sample the traffic or run the shadow query asynchronously.

Evaluate RAG Context Changes

For RAG systems, shadow testing should compare retrieved context, not just document IDs. The candidate model may retrieve different chunks from the same source document or different evidence altogether.

For sampled RAG queries, review:

Whether the candidate context contains enough evidence to answer.
Whether citations point to useful source passages.
Whether the candidate retrieves broader but less grounded context.
Whether irrelevant chunks crowd out the answer-bearing chunk.
Whether the answer would change if the candidate context were used.

A candidate embedding model should not be promoted only because it retrieves semantically interesting context. It must retrieve context that supports correct answers.

Use Human Review on Important Diffs

Automated metrics help, but human review is still useful for high-impact queries. Sample cases where the candidate and production results differ sharply.

Reviewers should mark whether the candidate is better, worse, equivalent, or unsafe. They should also note the failure cause: missing exact term, wrong tenant, stale document, too-broad semantic match, poor chunk, weak metadata, or filter issue.

This failure analysis tells you what to fix before promotion.

Promotion Gates

Define promotion criteria before the shadow test begins. Otherwise, teams may cherry-pick wins and ignore regressions.

Promotion gates can include:

Candidate relevance improves or holds steady on labeled evaluation queries.
No critical query segment regresses beyond an agreed threshold.
Exact-match and permission-scoped queries remain safe.
RAG grounding is equal or better on sampled reviews.
Latency and timeout rates remain within limits.
Backfill completeness and metadata consistency are verified.
Rollback path has been tested.

The candidate model should pass both offline evaluation and shadow testing before it becomes production.

Common Mistakes

The first mistake is letting shadow results affect users. Shadow testing should observe, not decide.

The second mistake is shadowing an incomplete candidate index and interpreting missing results as model behavior.

The third mistake is comparing only average relevance. Query-type regressions matter.

The fourth mistake is ignoring latency. A better model that cannot meet production response times may not be deployable.

The fifth mistake is skipping RAG-specific review. Retrieval changes can alter answer faithfulness even when search metrics look acceptable.

Practical Summary

To shadow test a new embedding model, keep the current index serving users, mirror real queries to the candidate index, log both result sets, compare relevance and latency, review important differences, and define promotion gates before cutover.

Shadow testing is not a replacement for offline evaluation. It is the bridge between controlled benchmarks and production rollout. It shows how the new embedding model behaves under real traffic while rollback is still simple.