How Vector Compression Affects Recall and Latency

Vector compression changes both search quality and search speed. It can reduce memory, improve cache behavior, and increase throughput, but it can also distort distances and reduce recall.

The important trade-off is not simply compressed versus uncompressed. The real question is how much compression changes candidate selection, whether the system rescors results, and what latency target the application must meet.

Short Answer

Vector compression can improve latency by making candidate scoring cheaper, but it can reduce recall because compressed vectors contain less information than full vectors.

Over-fetching and rescoring can recover much of the lost recall, but they add extra work and can affect p95 or p99 latency.

What Recall Means

Recall measures how many true nearest neighbors the search system returns.

If the exact top 10 neighbors contain documents A through J, and the compressed search returns 8 of those 10, recall@10 is 80%.

Compression affects recall when the compressed representation changes which vectors appear closest to the query.

What Latency Means

Latency is how long a query takes to return results.

For production systems, average latency is not enough. You usually need p95 and p99 latency because users notice slow tail queries.

Compression can improve average latency while still creating tail-latency problems if rescoring or disk reads are uneven.

Why Compression Can Lower Recall

Compressed vectors are approximations.

Quantization methods replace full-precision values with lower-precision values, bits, centroids, or compact codes. This removes some numeric detail from the original vector.

When distances are computed from compressed data, some true neighbors may look farther away than they really are.

Quantization Error

Quantization error is the difference between the original vector and its compressed approximation.

Small error may barely affect nearest-neighbor ranking. Large error can change candidate order or remove relevant vectors from the candidate set.

The more aggressive the compression, the more carefully recall must be measured.

Why Compression Can Improve Latency

Compression can improve latency because smaller vector representations are cheaper to move and compare.

Compressed vectors can fit better in CPU cache, reduce memory bandwidth pressure, and reduce disk reads in disk-backed or flat search designs.

Some compressed representations also support faster distance approximations than full 32-bit float vectors.

Why Compression Can Increase Latency

Compression can also increase latency when the system does extra work to protect recall.

Common extra work includes:

over-fetching more candidates than the final result count
fetching uncompressed vectors for candidate rescoring
increasing graph search breadth
probing more clusters or posting lists
scanning larger candidate buckets

These steps may be necessary to maintain quality, but they are not free.

Over-Fetching

Over-fetching means retrieving more candidates than the query will return.

For example, if the user asks for 10 results, the system might retrieve 100 or 200 compressed candidates first.

The larger candidate list gives true neighbors a better chance to survive the approximate compressed search stage.

Rescoring

Rescoring recomputes distances for the candidate list using full-precision vectors or a more accurate representation.

The compressed search finds likely candidates. Rescoring improves final ranking.

Rescoring can greatly improve recall, but only for candidates that were included in the over-fetched list.

The Candidate Loss Problem

Rescoring cannot fix candidates that were discarded too early.

If compression distortion causes a true neighbor to never enter the candidate pool, final rescoring never sees it.

This is why recall depends on both compression quality and candidate expansion strategy.

Compression and p95 Latency

p95 latency can improve when compressed vectors reduce the cost of most queries.

It can worsen when a minority of queries need much larger candidate pools, more disk reads, or more rescoring work.

Measure p95 with production-like query filters and realistic concurrency.

Compression and p99 Latency

p99 latency is especially sensitive to storage and rescoring behavior.

If full vectors are fetched from disk during rescoring, slow disk reads can appear in the tail.

Compression benchmarks should report p99 latency, not only QPS or mean latency.

How Product Quantization Affects Recall and Latency

Product quantization can reduce memory significantly by storing segment codes instead of full float vectors.

Recall depends on segment count, codebook quality, candidate expansion, and rescoring. Latency depends on compressed scoring cost, candidate bucket size, and full-vector fetch cost.

PQ is powerful but configuration-sensitive.

How Scalar Quantization Affects Recall and Latency

Scalar quantization reduces the precision of each dimension, often from 32-bit floats to smaller integer values.

It can offer a strong speed and memory balance because the vector shape is preserved dimension by dimension.

Recall depends on how well the lower-precision values preserve useful distance relationships.

How Rotational Quantization Affects Recall and Latency

Rotational quantization rotates vectors before quantizing them.

The rotation can spread information more evenly across dimensions, reducing the quality loss from lower precision.

This can produce strong recall with faster compressed distance calculations, especially when the compression ratio is moderate.

How Binary Quantization Affects Recall and Latency

Binary quantization can make distance calculations very fast and reduce memory aggressively.

The trade-off is that one-bit representations may lose more detail than 8-bit or segment-based methods.

BQ can be excellent for some data and weaker for others, so recall must be validated against the actual embedding model.

Why Compression Sometimes Improves Both Recall and Speed

Compression can sometimes improve the effective recall-latency trade-off.

If compressed distance estimates are fast enough, the system may search a wider candidate set for the same latency budget. That wider search can offset the small quality loss from compression.

This is why compression should be evaluated on the recall-throughput curve, not as a single setting.

Recall vs Throughput Curves

A good benchmark plots recall against throughput or latency.

Each compression method should be tested at multiple search settings, such as graph breadth, probe count, candidate count, or rescoring limit.

The best method is the one that meets the recall target at the lowest latency or cost.

Filters Change the Trade-Off

Metadata filters can make compression effects more visible.

If a filter leaves only a small eligible candidate set, compressed search may need to expand more aggressively to find enough valid results.

Always benchmark compressed search with real filters, not only unfiltered nearest-neighbor queries.

Memory Savings Still Matter

Even when compression slightly increases query work, it may still be worth using.

RAM is often more expensive than disk, and lower memory usage can allow smaller machines, more tenants, or larger indexes.

The production decision should consider recall, latency, throughput, and cost together.

What to Benchmark

Benchmark compressed and uncompressed search using:

recall at the target k
p50, p95, and p99 latency
queries per second under concurrency
memory usage before and after compression
candidate pool size
rescoring limit
filtered-query behavior
disk reads during rescoring
quality after embedding or data drift

Common Mistakes

Common mistakes include:

choosing the highest compression ratio without recall tests
measuring only average latency
ignoring p99 latency from rescoring reads
benchmarking without production filters
comparing methods at different recall targets
assuming rescoring fixes every recall loss
ignoring memory cost when latency is similar

Summary

Vector compression affects recall by changing distance estimates and candidate selection. It affects latency by reducing vector size while sometimes adding over-fetching and rescoring work.

The best compression setting is not the smallest representation. It is the setting that meets recall targets at acceptable p95 and p99 latency, with a memory and cost profile that fits production needs.