Product quantization changes how a vector search system stores and compares embeddings. Instead of comparing every candidate with full-precision vectors, the system can use compact codes to perform a faster, lower-memory first pass.
For vector search, the important point is not only that product quantization compresses vectors. The important point is how that compressed representation affects candidate generation, distance estimates, rescoring, recall, and latency.
The Role of PQ in Vector Search
Product quantization is usually used to reduce the cost of searching large vector collections.
A vector database may still use an ANN index such as a graph or cluster-based structure, but the vectors inside that search process can be represented with compact product-quantized codes.
This lets the system keep more searchable data in memory and reduce the amount of full-precision vector comparison required per query.
Step 1: Split Vectors Into Segments
PQ starts by dividing each embedding into smaller segments.
For example, a 128-dimensional vector might be split into 32 segments, with 4 dimensions per segment. Each segment is compressed independently.
This matters for search because distances can later be estimated segment by segment instead of reconstructing and comparing every original vector value.
Step 2: Learn Codebooks
During training, the system learns representative centroids for each segment position.
Those centroids form codebooks. A codebook is a table of learned representative segment values.
The training data should be representative of the vectors that will be searched. If the training sample does not match the real distribution, compressed distances may become less reliable.
Step 3: Encode Stored Vectors
After the codebooks are ready, each vector segment is replaced by the ID of its nearest centroid.
The stored vector becomes a sequence of compact codes rather than a long array of 32-bit floats.
This is the compressed representation used during vector search.
Step 4: Encode or Compare the Query
At query time, the query vector is compared against the codebooks.
Many implementations build lookup tables that contain the distances from each query segment to each centroid in the corresponding codebook.
Then, scoring a compressed candidate can be reduced to looking up and adding segment-level distances.
Step 5: Search With Compressed Candidates
The ANN search process can use compressed vectors to score many candidates cheaply.
Instead of reading full vectors for every candidate, the system reads small codes, estimates distances, and keeps the best candidates seen so far.
This is where PQ reduces memory pressure and can improve throughput.
Step 6: Rescore the Best Candidates
Compressed distances are approximate. They can identify good candidates, but they may not preserve the exact order of nearest neighbors.
To improve final ranking, many systems fetch the original full-precision vectors for a smaller candidate set and rescore them against the query.
Rescoring helps recover recall and ranking quality while still avoiding full-precision comparison against the entire collection.
Why PQ Can Change Search Results
Product quantization maps continuous vector values into discrete regions.
Two different vector segments can map to the same centroid ID. Once that happens, the compressed representation cannot fully distinguish them.
This distortion can change distance estimates, candidate ordering, and final recall.
The Candidate Bucket Problem
Rescoring only helps candidates that make it into the candidate bucket.
If a relevant vector is filtered out during the compressed search stage, final rescoring never sees it.
This is why PQ search often needs a larger candidate pool, a higher graph search breadth, more cluster probes, or a larger rescoring limit than an uncompressed configuration.
How PQ Affects Memory
PQ can reduce memory by storing compact codes instead of full vectors for search-time operations.
The savings are especially valuable with high-dimensional embeddings and large collections.
The system still needs to store codebooks, index structures, metadata, and often original vectors for retrieval or rescoring. PQ reduces a major cost, but it does not remove every cost.
How PQ Affects Latency
PQ can reduce latency when compressed candidate scoring is cheaper than full vector comparison.
It can also improve cache behavior because smaller representations allow more candidate data to stay hot.
Latency can increase if the system must fetch many original vectors from disk for rescoring, or if compression settings require too many candidate expansions to maintain recall.
How PQ Affects Recall
Recall depends on how much useful distance information survives compression.
More segments usually preserve more detail but use more memory. Fewer segments save more memory but may introduce more distortion.
Recall also depends on training quality, candidate pool size, rescoring strategy, and the underlying ANN index settings.
PQ With Graph Search
In graph-based search, PQ can compress the vectors used while traversing the graph.
The graph still guides the search through nearby nodes, but distance estimates may be based on compressed representations.
Because graph traversal decisions affect which candidates are explored, PQ distortion can influence not only final ranking but also the path through the graph.
PQ With Cluster-Based Search
In cluster-based search, PQ is often paired with IVF-style routing.
The index first chooses likely clusters or posting lists. Then compressed codes are used to scan many vectors inside those candidate regions.
This combination is useful when memory efficiency matters more than keeping every full vector hot in RAM.
What Engineers Tune
Important PQ-related tuning knobs include:
- number of segments
- centroids per segment
- training sample size
- candidate pool size
- rescoring limit
- graph search breadth or cluster probe count
- whether original vectors are available for final ranking
These settings decide where the system lands on the memory, recall, and latency trade-off curve.
When PQ Works Well
Product quantization tends to work well when:
- vectors are high dimensional
- the collection is large enough for memory to matter
- training data represents the searched data
- the application can tolerate approximate candidate selection
- rescoring is available for final ranking
- benchmarks show acceptable recall at target latency
When PQ Needs Care
PQ needs careful testing when:
- recall requirements are strict
- the vector distribution changes over time
- queries depend on subtle semantic differences
- filters reduce the available candidate set
- the system cannot rescore with original vectors
- compressed search must handle very crowded vector regions
Benchmarking PQ for Vector Search
Benchmark PQ with realistic production queries. Measure:
- memory before and after compression
- recall at the target
k - ranking quality after rescoring
- p50, p95, and p99 latency
- queries per second under concurrency
- candidate expansion requirements
- rescoring read cost
- behavior under metadata filters
The goal is not maximum compression by itself. The goal is enough compression while preserving useful search quality.
Summary
Product quantization helps vector search by replacing full vectors with compact segment codes during candidate search. The system estimates distances using codebooks, keeps likely candidates, and may rescore the final set using original vectors.
This can greatly reduce memory usage and improve search efficiency, but it introduces approximation. The practical question is whether the memory savings justify the recall, latency, training, and rescoring trade-offs for your workload.