Product quantization is a vector compression method used in similarity search systems. It reduces the memory needed to store and search embeddings by replacing full-precision vector values with compact learned codes.
The short version: product quantization divides a vector into smaller parts, learns representative values for each part, and stores references to those representatives instead of storing every original floating-point value.
Why Product Quantization Exists
Vector embeddings can be large. A single 1536-dimensional embedding stored as 32-bit floats uses about 6 KB before accounting for index overhead, metadata, replication, and caches.
At millions or billions of vectors, raw embeddings can dominate memory and infrastructure cost. Product quantization exists to reduce that footprint while keeping vector search useful.
Product Quantization in Simple Terms
Product quantization is a lossy compression technique.
Instead of storing the exact coordinates of every vector, the system stores an approximate representation. That approximation is much smaller, but it may slightly distort distances between vectors.
This creates the central trade-off: lower memory usage in exchange for possible recall loss.
How Product Quantization Represents a Vector
A normal embedding is a long list of numbers:
[0.12, -0.44, 0.03, ...]
Product quantization splits that list into multiple smaller segments, sometimes called sub-vectors or subspaces.
For example, a 128-dimensional vector might be split into 32 segments of 4 dimensions each. Each segment is compressed independently.
What the Training Step Does
Before PQ can compress vectors, it needs a training step.
The system looks at sample vectors and learns representative centroids for each segment position. These centroids form a codebook.
The codebook is the lookup table that later lets the system replace each vector segment with a compact identifier.
What Gets Stored After Compression
After training, each segment of a vector is matched to its nearest learned centroid.
Instead of storing the original segment values, the system stores the centroid ID. A long floating-point vector becomes a short sequence of IDs.
This is why PQ can reduce memory dramatically. The stored representation is no longer a full list of 32-bit floating-point numbers.
A Simple Example
Suppose a vector has 768 dimensions stored as 32-bit floats. The raw vector uses:
768 x 4 bytes = 3072 bytes
If PQ represents that vector using 128 one-byte segment codes, the compressed representation uses roughly:
128 x 1 byte = 128 bytes
The exact savings depend on configuration and codebook overhead, but the reduction can be large enough to change the economics of a vector database.
Why Product Quantization Is Lossy
PQ is lossy because many different original segment values can map to the same centroid.
Once a segment is replaced by a centroid ID, the compressed representation no longer contains all of the original numeric detail. The vector is now an approximation.
That approximation is usually good enough for candidate search, but it can change distance calculations and ranking order.
How PQ Affects Recall
Recall can drop if compression causes the system to miss relevant nearest neighbors.
This happens when the compressed representation makes two vectors look less similar than they really are, or makes a less relevant vector look closer than it should.
More segments, better training data, and rescoring with original vectors can help reduce recall loss.
How PQ Affects Latency
Product quantization can improve latency by reducing memory bandwidth and cache pressure.
Compressed vectors are smaller, so more candidate representations can fit in memory. In some systems, query-time search first uses compressed vectors to find candidates, then uses full vectors for final ranking.
However, rescoring and disk reads can add latency, so the final effect depends on the implementation and configuration.
Product Quantization and Rescoring
Many systems use PQ for a rough first pass, then rescore the best candidates using uncompressed vectors.
This two-stage approach works like this:
- use compressed vectors to find likely candidates
- fetch the original vectors for those candidates
- recompute distances using full precision
- return the reranked results
Rescoring helps recover recall, but only for candidates that survived the compressed search stage.
Where Product Quantization Fits
Product quantization is most useful when vector memory is a limiting factor.
It is commonly used with approximate nearest neighbor indexes, including graph-based and cluster-based search systems. In some designs, PQ compresses vectors inside an HNSW-style index. In others, it is combined with IVF-style cluster routing.
The goal is the same: keep search practical as vector collections grow.
When Product Quantization Helps
PQ helps when:
- the vector collection is large
- embeddings have many dimensions
- RAM cost is high
- slightly approximate ranking is acceptable
- candidate rescoring is available
- the system can train a representative codebook
When Product Quantization May Be Risky
PQ may be risky when:
- very high recall is mandatory
- the dataset is too small for useful training
- the vector distribution changes heavily over time
- the embedding model changes frequently
- there is no rescoring step
- compression settings are chosen without benchmarking
Product Quantization vs Other Compression
Product quantization is one form of vector quantization. Other approaches include scalar quantization, binary quantization, and rotational quantization.
The key difference is that PQ compresses groups of dimensions using learned segment-level codebooks. It is not simply rounding each dimension independently.
What to Benchmark
Before using PQ in production, benchmark:
- memory reduction
- recall at the target
k - p95 and p99 latency
- query throughput
- training time
- index conversion time
- rescoring cost
- behavior after new data is added
Product quantization is only useful if the memory savings are worth the accuracy and operational trade-offs.
Summary
Product quantization is a lossy method for compressing vector embeddings. It splits vectors into smaller segments, learns centroids for each segment, and stores compact centroid IDs instead of full floating-point values.
This can reduce memory dramatically, especially for large high-dimensional vector collections. The trade-off is that compressed vectors are approximate, so recall and ranking quality must be measured carefully.