Product Quantization Explained for Vector Search

Product quantization changes how a vector search system stores and compares embeddings. Instead of comparing every candidate with full-precision vectors, the system can use compact codes to perform a faster, lower-memory first pass.

For vector search, the important point is not only that product quantization compresses vectors. The important point is how that compressed representation affects candidate generation, distance estimates, rescoring, recall, and latency.

The Role of PQ in Vector Search

Product quantization is usually used to reduce the cost of searching large vector collections.

A vector database may still use an ANN index such as a graph or cluster-based structure, but the vectors inside that search process can be represented with compact product-quantized codes.

This lets the system keep more searchable data in memory and reduce the amount of full-precision vector comparison required per query.

Step 1: Split Vectors Into Segments

PQ starts by dividing each embedding into smaller segments.

For example, a 128-dimensional vector might be split into 32 segments, with 4 dimensions per segment. Each segment is compressed independently.

This matters for search because distances can later be estimated segment by segment instead of reconstructing and comparing every original vector value.

Step 2: Learn Codebooks

During training, the system learns representative centroids for each segment position.

Those centroids form codebooks. A codebook is a table of learned representative segment values.

The training data should be representative of the vectors that will be searched. If the training sample does not match the real distribution, compressed distances may become less reliable.

Step 3: Encode Stored Vectors

After the codebooks are ready, each vector segment is replaced by the ID of its nearest centroid.

The stored vector becomes a sequence of compact codes rather than a long array of 32-bit floats.

This is the compressed representation used during vector search.

Step 4: Encode or Compare the Query

At query time, the query vector is compared against the codebooks.

Many implementations build lookup tables that contain the distances from each query segment to each centroid in the corresponding codebook.

Then, scoring a compressed candidate can be reduced to looking up and adding segment-level distances.

Step 5: Search With Compressed Candidates

The ANN search process can use compressed vectors to score many candidates cheaply.

Instead of reading full vectors for every candidate, the system reads small codes, estimates distances, and keeps the best candidates seen so far.

This is where PQ reduces memory pressure and can improve throughput.

Step 6: Rescore the Best Candidates

Compressed distances are approximate. They can identify good candidates, but they may not preserve the exact order of nearest neighbors.

To improve final ranking, many systems fetch the original full-precision vectors for a smaller candidate set and rescore them against the query.

Rescoring helps recover recall and ranking quality while still avoiding full-precision comparison against the entire collection.

Why PQ Can Change Search Results

Product quantization maps continuous vector values into discrete regions.

Two different vector segments can map to the same centroid ID. Once that happens, the compressed representation cannot fully distinguish them.

This distortion can change distance estimates, candidate ordering, and final recall.

The Candidate Bucket Problem

Rescoring only helps candidates that make it into the candidate bucket.

If a relevant vector is filtered out during the compressed search stage, final rescoring never sees it.

This is why PQ search often needs a larger candidate pool, a higher graph search breadth, more cluster probes, or a larger rescoring limit than an uncompressed configuration.

How PQ Affects Memory

PQ can reduce memory by storing compact codes instead of full vectors for search-time operations.

The savings are especially valuable with high-dimensional embeddings and large collections.

The system still needs to store codebooks, index structures, metadata, and often original vectors for retrieval or rescoring. PQ reduces a major cost, but it does not remove every cost.

How PQ Affects Latency

PQ can reduce latency when compressed candidate scoring is cheaper than full vector comparison.

It can also improve cache behavior because smaller representations allow more candidate data to stay hot.

Latency can increase if the system must fetch many original vectors from disk for rescoring, or if compression settings require too many candidate expansions to maintain recall.

How PQ Affects Recall

Recall depends on how much useful distance information survives compression.

More segments usually preserve more detail but use more memory. Fewer segments save more memory but may introduce more distortion.

Recall also depends on training quality, candidate pool size, rescoring strategy, and the underlying ANN index settings.

PQ With Graph Search

In graph-based search, PQ can compress the vectors used while traversing the graph.

The graph still guides the search through nearby nodes, but distance estimates may be based on compressed representations.

Because graph traversal decisions affect which candidates are explored, PQ distortion can influence not only final ranking but also the path through the graph.

PQ With Cluster-Based Search

In cluster-based search, PQ is often paired with IVF-style routing.

The index first chooses likely clusters or posting lists. Then compressed codes are used to scan many vectors inside those candidate regions.

This combination is useful when memory efficiency matters more than keeping every full vector hot in RAM.

What Engineers Tune

Important PQ-related tuning knobs include:

number of segments
centroids per segment
training sample size
candidate pool size
rescoring limit
graph search breadth or cluster probe count
whether original vectors are available for final ranking

These settings decide where the system lands on the memory, recall, and latency trade-off curve.

When PQ Works Well

Product quantization tends to work well when:

vectors are high dimensional
the collection is large enough for memory to matter
training data represents the searched data
the application can tolerate approximate candidate selection
rescoring is available for final ranking
benchmarks show acceptable recall at target latency

When PQ Needs Care

PQ needs careful testing when:

recall requirements are strict
the vector distribution changes over time
queries depend on subtle semantic differences
filters reduce the available candidate set
the system cannot rescore with original vectors
compressed search must handle very crowded vector regions

Benchmarking PQ for Vector Search

Benchmark PQ with realistic production queries. Measure:

memory before and after compression
recall at the target k
ranking quality after rescoring
p50, p95, and p99 latency
queries per second under concurrency
candidate expansion requirements
rescoring read cost
behavior under metadata filters

The goal is not maximum compression by itself. The goal is enough compression while preserving useful search quality.

Summary

Product quantization helps vector search by replacing full vectors with compact segment codes during candidate search. The system estimates distances using codebooks, keeps likely candidates, and may rescore the final set using original vectors.

This can greatly reduce memory usage and improve search efficiency, but it introduces approximation. The practical question is whether the memory savings justify the recall, latency, training, and rescoring trade-offs for your workload.