Product Quantization PQ Vector Compression Explained

Product quantization, often shortened to PQ, is a vector compression technique. It reduces the size of vector embeddings by replacing groups of floating-point dimensions with compact codes.

In vector databases, PQ is useful because embeddings can consume large amounts of memory. Compressing them can lower infrastructure cost and make large-scale approximate nearest neighbor search more practical.

What PQ Compresses

A vector embedding is usually stored as a list of numbers. In many systems, each number is a 32-bit floating-point value.

For example, a 768-dimensional vector stored as 32-bit floats uses:

768 dimensions x 4 bytes = 3072 bytes

That is only one vector. At millions of vectors, the raw vector memory alone becomes large before accounting for graph edges, posting lists, metadata, caches, or replicas.

The Core Compression Idea

PQ does not store every original vector coordinate.

Instead, it splits the vector into smaller parts and replaces each part with the ID of a learned representative value.

The compressed vector becomes a sequence of IDs. These IDs are much smaller than the original floating-point values.

Segments and Sub-Vectors

The first step in PQ compression is segmentation.

A vector is divided into equal-sized segments, also called sub-vectors or subspaces. Each segment contains a contiguous group of dimensions.

For example:

128-dimensional vector
32 segments
4 dimensions per segment

Each segment is compressed separately. This is why the technique is called product quantization: the full vector is represented as a product of smaller quantized parts.

Codebooks and Centroids

For each segment position, PQ learns a set of representative centroids from training data.

These centroids form a codebook. The codebook is a lookup table that says, for this segment position, these are the representative segment values the system can use.

If a segment uses 256 centroids, each centroid can be represented with one byte because one byte can encode 256 possible values.

How a Segment Becomes a Code

After training, each vector segment is compared to the centroids in the matching codebook.

The nearest centroid is selected, and the system stores that centroid ID instead of the original segment values.

For a segment of four 32-bit floats, the original storage is:

4 dimensions x 4 bytes = 16 bytes

If that segment is encoded as a one-byte centroid ID, the compressed storage is:

1 byte

That segment-level example is a 16:1 reduction before considering codebook overhead.

Vector-Level Memory Savings

The savings become clearer at the full-vector level.

A 768-dimensional vector stored as floats uses 3072 bytes. If PQ stores it as 128 one-byte segment codes, the compressed representation is roughly:

128 bytes

The exact ratio depends on segment count, code size, codebook overhead, and whether original vectors are also stored for rescoring or reconstruction.

Why PQ Needs Training

PQ needs training because the centroids should reflect the shape of the vector data.

The system samples vectors, splits them into segments, and clusters each segment position to learn representative centroids.

If the training sample is too small or not representative, the codebooks may not describe the real vector distribution well. That can increase distance distortion and reduce recall.

Codebook Overhead

PQ compression is not free of overhead.

The system must store codebooks, and it may also store original vectors for retrieval, reconstruction, or final rescoring.

Even with this overhead, PQ can dramatically reduce the memory used for search-time vector representations.

Why PQ Is Lossy

PQ is lossy because it replaces exact numeric values with approximate centroids.

Many different original segments can map to the same centroid. Once compressed, those different segments may become indistinguishable at the code level.

This is how PQ saves memory, and it is also why search quality must be measured after compression.

Compression Ratio vs Recall

The main trade-off is compression ratio versus recall.

Fewer segments or more aggressive compression usually saves more memory but loses more information. More segments preserve more detail but use more memory.

There is no universally best setting. The right configuration depends on the embedding model, dimensionality, dataset distribution, recall target, and latency budget.

Compression Ratio vs Latency

PQ can improve latency because compressed vectors are smaller and friendlier to memory caches.

However, latency can also increase if the system must expand candidate pools, fetch full vectors, or rescore many candidates to recover recall.

The practical result depends on the whole search path, not just the compressed byte size.

PQ and Full-Precision Rescoring

Many systems use PQ for candidate selection and full-precision vectors for final scoring.

The compressed codes help identify likely matches cheaply. Then the system fetches original vectors for the best candidates and recomputes distances more accurately.

This approach can preserve much of the memory benefit while improving final result quality.

When PQ Compression Helps Most

PQ compression is most useful when:

the vector collection is large
vectors have many dimensions
memory cost is a bottleneck
exact distances are not required for every candidate
the system can train representative codebooks
the application can benchmark recall after compression

When PQ Compression Needs Caution

Use PQ carefully when:

recall requirements are very strict
the dataset is small enough that compression is unnecessary
vector distribution changes often
new embeddings differ from the training sample
there is no rescoring path
filters make candidate pools very small

PQ Compared With Simple Rounding

PQ is not the same as rounding every vector dimension.

Simple rounding compresses each dimension independently. Product quantization compresses groups of dimensions using learned centroids for each segment position.

That segment-level learning is what lets PQ capture common local patterns in the embedding space.

What to Measure

Before adopting PQ compression, measure:

raw vector memory before compression
compressed code memory after compression
codebook overhead
whether original vectors are still stored
recall at the target k
latency with and without rescoring
training and conversion time
behavior after new data is added

Summary

Product quantization is a vector compression method that splits embeddings into segments, learns centroid codebooks for those segments, and stores compact centroid IDs instead of full floating-point values.

PQ can reduce memory dramatically, especially for large high-dimensional vector collections. The trade-off is that compression is approximate, so recall, latency, training quality, and rescoring behavior must be tested with real data.