Product quantization, often shortened to PQ, is a vector compression technique. It reduces the size of vector embeddings by replacing groups of floating-point dimensions with compact codes.
In vector databases, PQ is useful because embeddings can consume large amounts of memory. Compressing them can lower infrastructure cost and make large-scale approximate nearest neighbor search more practical.
What PQ Compresses
A vector embedding is usually stored as a list of numbers. In many systems, each number is a 32-bit floating-point value.
For example, a 768-dimensional vector stored as 32-bit floats uses:
768 dimensions x 4 bytes = 3072 bytes
That is only one vector. At millions of vectors, the raw vector memory alone becomes large before accounting for graph edges, posting lists, metadata, caches, or replicas.
The Core Compression Idea
PQ does not store every original vector coordinate.
Instead, it splits the vector into smaller parts and replaces each part with the ID of a learned representative value.
The compressed vector becomes a sequence of IDs. These IDs are much smaller than the original floating-point values.
Segments and Sub-Vectors
The first step in PQ compression is segmentation.
A vector is divided into equal-sized segments, also called sub-vectors or subspaces. Each segment contains a contiguous group of dimensions.
For example:
128-dimensional vector
32 segments
4 dimensions per segment
Each segment is compressed separately. This is why the technique is called product quantization: the full vector is represented as a product of smaller quantized parts.
Codebooks and Centroids
For each segment position, PQ learns a set of representative centroids from training data.
These centroids form a codebook. The codebook is a lookup table that says, for this segment position, these are the representative segment values the system can use.
If a segment uses 256 centroids, each centroid can be represented with one byte because one byte can encode 256 possible values.
How a Segment Becomes a Code
After training, each vector segment is compared to the centroids in the matching codebook.
The nearest centroid is selected, and the system stores that centroid ID instead of the original segment values.
For a segment of four 32-bit floats, the original storage is:
4 dimensions x 4 bytes = 16 bytes
If that segment is encoded as a one-byte centroid ID, the compressed storage is:
1 byte
That segment-level example is a 16:1 reduction before considering codebook overhead.
Vector-Level Memory Savings
The savings become clearer at the full-vector level.
A 768-dimensional vector stored as floats uses 3072 bytes. If PQ stores it as 128 one-byte segment codes, the compressed representation is roughly:
128 bytes
The exact ratio depends on segment count, code size, codebook overhead, and whether original vectors are also stored for rescoring or reconstruction.
Why PQ Needs Training
PQ needs training because the centroids should reflect the shape of the vector data.
The system samples vectors, splits them into segments, and clusters each segment position to learn representative centroids.
If the training sample is too small or not representative, the codebooks may not describe the real vector distribution well. That can increase distance distortion and reduce recall.
Codebook Overhead
PQ compression is not free of overhead.
The system must store codebooks, and it may also store original vectors for retrieval, reconstruction, or final rescoring.
Even with this overhead, PQ can dramatically reduce the memory used for search-time vector representations.
Why PQ Is Lossy
PQ is lossy because it replaces exact numeric values with approximate centroids.
Many different original segments can map to the same centroid. Once compressed, those different segments may become indistinguishable at the code level.
This is how PQ saves memory, and it is also why search quality must be measured after compression.
Compression Ratio vs Recall
The main trade-off is compression ratio versus recall.
Fewer segments or more aggressive compression usually saves more memory but loses more information. More segments preserve more detail but use more memory.
There is no universally best setting. The right configuration depends on the embedding model, dimensionality, dataset distribution, recall target, and latency budget.
Compression Ratio vs Latency
PQ can improve latency because compressed vectors are smaller and friendlier to memory caches.
However, latency can also increase if the system must expand candidate pools, fetch full vectors, or rescore many candidates to recover recall.
The practical result depends on the whole search path, not just the compressed byte size.
PQ and Full-Precision Rescoring
Many systems use PQ for candidate selection and full-precision vectors for final scoring.
The compressed codes help identify likely matches cheaply. Then the system fetches original vectors for the best candidates and recomputes distances more accurately.
This approach can preserve much of the memory benefit while improving final result quality.
When PQ Compression Helps Most
PQ compression is most useful when:
- the vector collection is large
- vectors have many dimensions
- memory cost is a bottleneck
- exact distances are not required for every candidate
- the system can train representative codebooks
- the application can benchmark recall after compression
When PQ Compression Needs Caution
Use PQ carefully when:
- recall requirements are very strict
- the dataset is small enough that compression is unnecessary
- vector distribution changes often
- new embeddings differ from the training sample
- there is no rescoring path
- filters make candidate pools very small
PQ Compared With Simple Rounding
PQ is not the same as rounding every vector dimension.
Simple rounding compresses each dimension independently. Product quantization compresses groups of dimensions using learned centroids for each segment position.
That segment-level learning is what lets PQ capture common local patterns in the embedding space.
What to Measure
Before adopting PQ compression, measure:
- raw vector memory before compression
- compressed code memory after compression
- codebook overhead
- whether original vectors are still stored
- recall at the target
k - latency with and without rescoring
- training and conversion time
- behavior after new data is added
Summary
Product quantization is a vector compression method that splits embeddings into segments, learns centroid codebooks for those segments, and stores compact centroid IDs instead of full floating-point values.
PQ can reduce memory dramatically, especially for large high-dimensional vector collections. The trade-off is that compression is approximate, so recall, latency, training quality, and rescoring behavior must be tested with real data.