PQ compression is the practical use of product quantization to reduce the memory footprint of vector search. It compresses vector embeddings into compact codes so a database can search large collections with less RAM.
The important operational question is not only how PQ works. It is when to enable it, how to train it, what it changes during indexing, and how to measure the recall, latency, and memory trade-off.
Short Definition
PQ compression is lossy vector compression based on product quantization.
It splits embeddings into segments, learns centroids for each segment, and replaces full floating-point segment values with compact centroid IDs.
Why Use PQ Compression?
Vector embeddings can be expensive to keep in memory.
A collection with millions of high-dimensional embeddings may require many gigabytes of RAM before considering index structures, metadata, caches, replicas, or query overhead.
PQ compression lowers the size of the searchable vector representation, which can reduce cost and make larger collections practical.
What PQ Compression Changes
Without compression, a vector search index may compare full-precision vectors during candidate search.
With PQ compression, the index can compare compact codes or approximate vectors derived from codebooks. This reduces memory bandwidth and storage pressure.
The trade-off is that distances become approximate unless the system rescors candidates with full vectors.
When to Consider PQ Compression
Consider PQ compression when:
- vector memory is a major cost
- the collection is large
- embeddings have many dimensions
- the workload can tolerate approximate candidate generation
- you can benchmark recall before and after compression
- full-vector rescoring is available if final ranking quality matters
When to Avoid PQ Compression
Avoid or delay PQ compression when:
- the dataset is small enough for exact or uncompressed search
- recall requirements are strict and no rescoring is available
- you do not have enough representative vectors for training
- the embedding model or data distribution is still changing rapidly
- memory is not a bottleneck
The Training Requirement
PQ compression needs training data.
The system must learn codebooks that describe the vector distribution. These codebooks contain centroids for each segment position.
If training data is too small or unrepresentative, the compressed codes may distort distances too much and reduce search quality.
Training Limits
Large collections do not always need every vector for PQ training.
Many systems train codebooks from a representative sample. A training limit controls the maximum number of vectors used to learn the codebooks.
This reduces fit time and avoids long training runs on very large datasets, but the sample must still reflect the actual search distribution.
The Conversion Phase
After codebooks are trained, stored vectors must be converted into compressed PQ codes.
This conversion may run as a background job, or it may happen during an index rebuild or collection reconfiguration.
During conversion, some systems restrict writes or place the affected shard or index into a temporary read-only state. This operational detail matters for production rollout planning.
Segment Count
The segment count controls how many pieces each vector is split into.
More segments usually preserve more detail and improve recall, but they produce longer codes and use more memory.
Fewer segments save more memory but increase distance distortion.
Centroids Per Segment
The number of centroids per segment controls how many representative values each codebook can choose from.
A common design uses 256 centroids per segment, allowing one-byte codes. More centroid choices can preserve more detail, but may increase storage or computation cost.
This setting should be evaluated with the target embedding model and query workload.
Codebook Overhead
PQ compression does not reduce memory to zero.
The system still stores codebooks, object IDs, metadata, index routing structures, and possibly original vectors for rescoring.
The headline compression ratio should be checked against total system memory, not only the compressed code size.
Interaction With ANN Tuning
PQ compression should not be tuned in isolation.
If the underlying ANN index uses graph traversal, parameters such as search breadth can affect whether compressed search finds enough good candidates.
If the index uses cluster probing, probe count and candidate list size may need adjustment after compression.
Rescoring
Rescoring is a common way to recover quality after PQ compression.
The index first searches using compressed representations. It then fetches a smaller set of original vectors and recomputes distances more accurately.
Rescoring can improve final ranking, but it adds read and compute cost. It also cannot rescue candidates that were discarded before the rescoring stage.
Recall Impact
PQ compression can reduce recall because many original vector segments may map to the same centroid ID.
This makes some distinct vectors look more similar and can make some relevant vectors look less relevant.
Recall should be measured at the actual target k, with real queries and real filters.
Latency Impact
PQ compression can improve latency by reducing memory bandwidth and improving cache behavior.
It can also increase latency if the system needs larger candidate buckets, more probes, or expensive rescoring.
The final latency effect depends on the index type, storage layout, and recall target.
Memory Impact
The primary benefit of PQ compression is lower memory usage.
For example, a vector stored as hundreds of 32-bit floats can become a much smaller sequence of segment codes.
Measure memory before and after compression under realistic load, including index overhead and query-time working memory.
Rollout Checklist
A practical PQ compression rollout should include:
- baseline recall, latency, throughput, and memory measurements
- a representative training sample
- segment and centroid configuration choices
- conversion-time planning
- candidate expansion and rescoring settings
- filtered-query benchmarks
- rollback or rebuild plan if recall falls too far
Benchmark Metrics
Measure these metrics before deciding that PQ compression is production-ready:
- compressed bytes per vector
- total memory reduction
- codebook overhead
- training time
- conversion time
- recall at
k - p50, p95, and p99 latency
- queries per second
- rescoring cost
- quality after new data is added
Common Mistakes
Common mistakes include:
- enabling PQ before enough representative data exists
- choosing the highest compression ratio without recall tests
- forgetting that conversion can affect writes
- not retuning ANN search parameters after compression
- benchmarking without production filters
- ignoring drift after embedding model changes
Summary
PQ compression is a practical way to reduce vector search memory by replacing full vector segments with compact centroid IDs. It requires codebook training, vector conversion, and careful tuning.
The benefit is lower memory and potentially better search efficiency. The cost is approximate distance estimation, so PQ compression should always be validated with recall, latency, throughput, and production-query benchmarks.