How to Interpret Vector Search Scores

Interpret vector search scores by first checking what kind of score the system returned. A distance score usually means lower is better. A similarity score usually means higher is better. The numeric meaning depends on the distance metric, embedding model, normalization, and search pipeline.

A vector score is not a universal confidence percentage. It is a measurement produced by a specific metric in a specific vector space.

Short Answer

To interpret a vector search score, ask four questions:

  1. Is the score a distance or a similarity?
  2. Which metric produced it?
  3. What range or scale does that metric use?
  4. How does the score behave on real examples from your data?

Do not compare scores across different models, metrics, indexes, or datasets unless you have calibrated them.

Distance vs Similarity

The first distinction is distance versus similarity.

  • Distance: lower means closer.
  • Similarity: higher means closer.

If a result has a smaller distance than another result, it is closer under that metric. If a result has a larger similarity score than another result, it is closer under that score.

Many mistakes happen because teams sort or threshold scores in the wrong direction.

Distance Is Not Confidence

A distance score is not the same as confidence.

For example, a low distance means two vectors are close. It does not automatically mean the answer is correct, the source is trustworthy, or the retrieved chunk contains enough information for a RAG answer.

Vector scores measure closeness in embedding space, not truth.

Similarity Is Not Always Probability

A similarity score is also not necessarily a probability.

A score of 0.82 does not always mean “82% likely to be relevant.” It may only mean that, under one metric and one model, this result is closer than many other results.

Only treat scores as probabilities if the system explicitly calibrates them that way.

Cosine Similarity and Cosine Distance

Cosine similarity compares vector direction. Higher cosine similarity means the vectors point in a more similar direction.

Some systems expose cosine distance instead:

cosine_distance = 1 - cosine_similarity

With cosine distance, lower is better.

A cosine distance of 0 means identical direction. A larger distance means less similar direction. Depending on the implementation, cosine distance may commonly be described on a range from 0 to 2, where opposite directions are far apart.

L2 and Squared L2 Scores

L2 distance measures Euclidean coordinate distance. Squared L2 omits the final square root:

squared_L2(a, b) = sum((a_i - b_i)^2)

For both L2 and squared L2, lower is closer.

The scale is different, though. A squared L2 score of 25 corresponds to an L2 distance of 5. Do not reuse ordinary L2 thresholds for squared L2 scores.

Dot Product Scores

Dot product measures alignment and magnitude.

Raw dot product is usually a similarity-style score: higher dot product means stronger alignment. Some vector systems convert dot product into a distance-style value by returning the negative dot product. In that case, lower values are closer.

Always check whether your system exposes raw dot product, negative dot product, or another transformed value.

Hybrid Search Scores

Hybrid search scores may combine vector similarity with keyword search scores, such as BM25.

These scores are often relative to the query, dataset, and fusion method. A hybrid score from one query may not mean the same thing as the same number from another query.

If the system exposes explanations, use them to see how much came from vector search and how much came from keyword scoring.

Reranker Scores

Reranker scores are different from raw vector scores.

A reranker may look at the query and candidate text more directly, then assign its own score. That score should be interpreted according to the reranker model, not the vector distance metric.

Do not mix vector distances and reranker scores as if they are on the same scale.

Thresholds Must Be Calibrated

Thresholds should be chosen from real score distributions.

For example:

return result only if distance <= 0.35

This threshold may work for one embedding model and metric, but fail for another.

Build a small validation set with known good and bad matches. Inspect the score ranges. Choose thresholds that separate useful results from weak results for your application.

Scores Depend on the Embedding Model

The same text can produce different score ranges under different embedding models.

Changing the model can change:

  • vector dimensions
  • vector normalization
  • score distributions
  • nearest-neighbor ordering
  • threshold behavior

After changing models, recalibrate score thresholds and rerun retrieval tests.

Scores Depend on the Metric

A cosine distance score cannot be compared directly with an L2 score. A dot product score cannot be compared directly with a hybrid score. A reranker score cannot be compared directly with a raw vector distance.

Each score has meaning only in the context of the metric or model that produced it.

Scores Are Often Relative

Sometimes the absolute score matters less than the gap between results.

For example, if the first result has distance 0.21 and the second has 0.22, the top result may not be clearly better. If the first result has distance 0.21 and the second has 0.70, the top result may be much more distinct.

Score gaps can help decide whether to trust the top result, return multiple results, or ask for clarification.

Low Distance Does Not Guarantee Good RAG Context

In RAG systems, a close vector match can still be a poor context chunk.

Possible reasons include:

  • the chunk is too short
  • the chunk lacks the answer
  • the chunk is semantically related but not specific enough
  • the source is outdated
  • metadata filters allowed the wrong document type
  • the query is ambiguous

Score interpretation should be paired with content inspection, metadata checks, and answer quality evaluation.

How to Debug Scores

When scores look confusing, check:

  1. Whether the value is distance, similarity, hybrid score, or reranker score.
  2. Which distance metric is configured.
  3. Whether vectors are normalized.
  4. Whether thresholds were copied from a different model or metric.
  5. Whether filters changed the candidate pool.
  6. Whether approximate search settings affected recall.
  7. Whether a reranker changed final order after vector retrieval.

Practical Rules

Use these rules in production:

  • Label scores by type, such as cosine distance or reranker score.
  • Sort distance scores ascending.
  • Sort similarity scores descending.
  • Calibrate thresholds per model and metric.
  • Do not treat vector scores as universal confidence.
  • Use evaluation data instead of guessing thresholds.
  • Recalibrate after changing embeddings, chunking, filters, or index settings.

Common Mistakes

Common mistakes include:

  • treating distance as similarity
  • treating similarity as probability
  • comparing cosine scores with L2 scores
  • using one threshold across all embedding models
  • assuming a low distance means the answer is correct
  • ignoring score gaps between results
  • mixing vector scores, hybrid scores, and reranker scores

Summary

Vector search scores are metric-specific measurements of closeness in embedding space. Distance scores usually mean lower is closer. Similarity scores usually mean higher is closer.

Interpret scores only after you know the metric, model, direction, and scale. For production search and RAG, calibrate thresholds on real examples and avoid treating raw vector scores as universal confidence values.