Product quantization
Last reviewed
Apr 30, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 4,080 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
16 citations
Review status
Source-backed
Revision
v1 · 4,080 words
Add missing citations, update stale details, or suggest a clearer explanation.
Product quantization (PQ) is a vector-compression technique for approximate nearest-neighbor (ANN) search in high-dimensional spaces. It splits each D-dimensional input vector into M equal sub-vectors, runs k-means independently on each sub-space to learn a small codebook of K centroids, and replaces every sub-vector with the index of its nearest centroid. The resulting code is M * log_2(K) bits long, typically M bytes when K = 256, which is one to two orders of magnitude smaller than the raw float32 vector. Distances between a full-precision query and a compressed database vector can then be estimated in time linear in M using precomputed lookup tables, an operation known as asymmetric distance computation (ADC).
PQ was introduced by Hervé Jégou, Matthijs Douze, and Cordelia Schmid at Inria in a 2011 paper in IEEE Transactions on Pattern Analysis and Machine Intelligence, which remains the foundational reference for the method [1]. Since then, PQ and its variants have become a core building block of modern vector databases and large-scale retrieval systems, including FAISS, Milvus, Pinecone, Weaviate, pgvector, and Google's ScaNN. PQ is also a standard component of billion-scale similarity search pipelines for image retrieval, recommender systems, and the retrieval-augmented generation stack used by large language models.
Many machine-learning systems represent items as dense vector embeddings in R^D, with D typically between 100 and 3,000. Search over these vectors reduces to a k-nearest-neighbors problem: given a query vector, return the k database vectors with smallest distance under a chosen metric, usually Euclidean or cosine.
For billion-scale corpora, exact nearest-neighbor search is infeasible on two axes:
O(N * D) floating-point operations. At billion scale this is hundreds of milliseconds per query even on optimized hardware, far too slow for interactive use.Approximate methods trade a small amount of recall for orders-of-magnitude gains in memory or speed. PQ attacks the memory axis directly: a typical configuration with D = 128, M = 8, and 8 bits per sub-quantizer compresses each vector from 512 bytes to 8 bytes, a 64x reduction. With ADC, distance computation also becomes faster, since per-query work reduces to a constant number of table lookups per database vector instead of a full dot product.
The Inria group originally developed PQ for content-based image retrieval over millions of SIFT descriptors, but the same machinery generalizes to any dense embedding space. Combined with an inverted file (IVF) coarse partition, PQ scales to billion-vector indexes on a single server, a setting demonstrated by FAISS [4][1].
A vector quantizer is a function q: R^D -> C that maps each input vector x to a codeword c_i drawn from a finite codebook C = {c_1, ..., c_K}. The optimal quantizer in the mean-squared sense minimizes the expected reconstruction error:
E[||x - q(x)||^2]
Lloyd's algorithm (k-means) provides a local minimum. Each x requires log_2(K) bits to store as the index of its assigned centroid. The challenge is that high-quality quantization in D dimensions needs K exponential in D. For D = 128 and 8 bytes per code (K = 2^64), training and lookup with a single codebook are not tractable.
PQ sidesteps the curse of dimensionality by factoring the codebook as a Cartesian product. The space R^D is decomposed into the product of M orthogonal sub-spaces, each of dimension D / M. A separate sub-quantizer q_m: R^(D/M) -> C_m is trained on each sub-space, with its own small codebook C_m of size K. The full quantizer is:
q(x) = (q_1(x_1), q_2(x_2), ..., q_M(x_M))
where x = [x_1; x_2; ...; x_M] is the concatenation of the M sub-vectors. The effective codebook has K^M entries but is stored implicitly as M separate codebooks of size K. Training runs M independent k-means problems, each in D / M dimensions with K clusters. With K = 256 and b = 8 bits per sub-quantizer, each sub-vector index fits in a single byte and the total code length is M bytes [1].
Given a database vector x:
x into M sub-vectors of dimension D / M.x_m, find the nearest centroid in C_m and emit its 8-bit index i_m.(i_1, i_2, ..., i_M) as M bytes.A database of N vectors at D = 128 floats per vector occupies 512 * N bytes uncompressed. With M = 8, the PQ-compressed database occupies 8 * N bytes, plus a small fixed overhead for the codebooks (M * K * (D / M) * 4 = 16 KB for M = 8, K = 256, D = 128). A 1-billion-vector dataset that needs 512 GB raw fits in 8 GB compressed [1].
The utility of PQ depends on being able to estimate ||q - x||^2 from q and the code of x without first reconstructing x. The Jégou paper proposes two estimators:
d_SDC(q, x) = sum_m ||q_m(q_m_subvec) - q_m(x_m)||^2. Inter-centroid distances are precomputed once into an M * K * K table. At query time, each compressed database vector requires M table lookups, one per sub-space.d_ADC(q, x) = sum_m ||q_m_subvec - q_m(x_m)||^2. For each query, the M * K lookup table of distances ||q_m_subvec - c_m_k||^2 is computed once. Each database vector then takes M lookups and M - 1 additions to score.ADC has lower bias and is the default in practice, since paying M * K distance computations per query (a few thousand operations) is negligible compared to scoring millions of database codes. SDC is occasionally preferred when query latency is dominated by small batched inner loops over compressed codes, since the lookup tables can be cached more aggressively. The original paper reports that ADC outperforms SDC by 5-10 percentage points of recall at the same code size on SIFT-1M [1].
For very large N, scoring every code is still too slow. The standard production setup is IVFADC, also known as IVF-PQ:
q_c with K_coarse centroids (e.g., 1,024 to 65,536) using ordinary k-means on the full D-dimensional space.x, compute its coarse cell q_c(x) and the residual r = x - q_c(x).nprobe coarse cells nearest to the query. For each selected cell, compute the residual q - q_c(cell) and score the codes in that cell using ADC against the residual.The coarse quantizer prunes the search to a small fraction nprobe / K_coarse of the database. The PQ residual encoding compresses what remains. Typical values are K_coarse = sqrt(N) rounded to a power of two and nprobe between 1 and a few hundred. On SIFT-1M with K_coarse = 1024, M = 8, nprobe = 8, the original paper reports recall@100 around 0.9 with index size 8 MB instead of 512 MB [1].
PQ training runs M independent k-means jobs, each on a sub-sample of the data (typically 256,000 to 1 million vectors). Wall-clock cost is dominated by the coarse quantizer when K_coarse is large, but is generally a one-time offline operation.
PQ has spawned a family of related methods that improve on its accuracy under different assumptions. The most important are summarized below.
| Method | Year | Authors | Idea |
|---|---|---|---|
| Product quantization (PQ) | 2011 | Jégou, Douze, Schmid | Independent k-means on M sub-spaces [1] |
| Inverted Multi-Index (IMI) | 2012 | Babenko, Lempitsky | Two coarse quantizers, Cartesian product of cells gives finer partition [3] |
| Optimized PQ (OPQ) | 2013-2014 | Ge, He, Ke, Sun | Learn an orthogonal rotation R so dimensions distribute uniformly across sub-spaces [2] |
| Cartesian k-means (CKM) | 2013 | Norouzi, Fleet | Concurrent formulation equivalent to OPQ [5] |
| Locally Optimized PQ (LOPQ) | 2014 | Kalantidis, Avrithis | Learn a separate rotation and codebook per IVF cell [6] |
| Distance-Encoded PQ (DPQ) | 2014 | Heo, Lin, Yoon | Spend some bits encoding the quantized distance to the centroid [7] |
| Additive Quantization (AQ) | 2014 | Babenko, Lempitsky | Encode each vector as a sum of codewords from M codebooks; no orthogonality assumption [8] |
| Composite Quantization (CQ) | 2014 | Zhang et al. | Constrain inter-codebook inner products for fast distance computation |
| Residual Vector Quantization (RVQ) | classical, modern variants | Various | Cascade of quantizers; each stage encodes the residual of the previous one |
| Anisotropic Vector Quantization | 2020 | Guo et al. (Google ScaNN) | Loss function that penalizes the parallel residual component more than the orthogonal one [9] |
The quality of PQ depends on the choice of sub-space decomposition. If a few dimensions carry most of the variance, splitting them into the same sub-vector wastes most of the codebook budget on a different sub-space. Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun introduced OPQ to learn an orthogonal rotation R of the input space so that variance is balanced across sub-spaces and intra-sub-space dimensions are decorrelated. The objective is to jointly minimize quantization distortion over R and the M codebooks:
min_{R, C_1, ..., C_M} sum_x ||R x - q(R x)||^2
subject to R^T R = I. The CVPR 2013 conference paper proposes both a parametric solution that assumes a Gaussian distribution and a non-parametric iterative solution that alternates between updating R (an orthogonal Procrustes step) and re-running k-means in the rotated sub-spaces. The TPAMI 2014 journal version expands the analysis [2]. Norouzi and Fleet's Cartesian k-means at the same CVPR 2013 derives the same family of models independently [5].
OPQ closes a substantial fraction of the recall gap to full-precision search on benchmarks where the raw embedding has correlated dimensions, and is now standard in FAISS as IndexPQ(...) with OPQMatrix pre-rotation, or as the OPQ<m>_<d>,IVF<k>,PQ<m> index factory string.
Artem Babenko and Victor Lempitsky's Inverted Multi-Index (IMI), presented at CVPR 2012, replaces the single coarse quantizer of IVFADC with two coarse quantizers, one per half of the input. The Cartesian product of their cells yields K_coarse^2 implicit sub-cells, often a thousand times more than a flat IVF can afford, while keeping training cheap. The list traversal is reordered using a multi-sequence algorithm that walks over candidate cells in increasing-distance order. On SIFT-1B, IMI achieves a denser partition and higher recall than flat IVF at similar memory and query cost [3]. IMI is implemented in FAISS as IndexIVFPQ with two-level coarse quantization (IndexIMI2).
LOPQ extends IVFADC by learning a separate orthogonal rotation and PQ codebook for each coarse cell, rather than a single global one. Cells in different parts of the embedding space tend to have different local covariance structure, so a global rotation is suboptimal. The Kalantidis and Avrithis CVPR 2014 paper shows LOPQ improves recall on SIFT-1B and Yahoo Flickr Creative Commons 100M with constant memory overhead per cell [6]. The original implementation was open-sourced by Yahoo.
Additive Quantization (AQ) drops the orthogonal sub-space assumption entirely. Each vector is approximated as a sum of M codewords drawn from M codebooks of size K, with full freedom in how the codebooks span R^D. The encoding problem becomes combinatorial and is solved with beam search; training alternates between codeword updates and re-encoding. AQ delivers lower distortion than PQ at the same code size, at the cost of more expensive encoding [8].
Residual Vector Quantization (RVQ) is a cascaded variant: the first stage quantizes x, the second quantizes x - c_1, the third quantizes x - c_1 - c_2, and so on. Classical RVQ predates PQ and has been used for speech and image coding for decades. In modern deep learning, RVQ is the workhorse of neural audio codecs such as Google's SoundStream and Meta's EnCodec, where it is jointly trained with the encoder and decoder networks.
Fred André, Anne-Marie Kermarrec, and Nicolas Le Scouarnec showed in their VLDB 2015 paper that PQ throughput on commodity CPUs is bottlenecked not by FLOPs but by the cost of repeated cache accesses to the M * K lookup table. Their PQ Fast Scan algorithm shrinks the table to fit in SIMD registers, replaces cache lookups with in-register PSHUFB-style permutations, and processes 16 or 32 codes per instruction. The result is a 4x to 6x speed-up at identical recall, reducing scan time on 25 million vectors from about 74 ms to about 13 ms [10]. The same group's follow-up work, Quick ADC and Quicker ADC, extends the trick to wider SIMD and 4-bit codes. Most modern PQ implementations, including FAISS's IndexPQFastScan and IndexIVFPQFastScan, ship Fast Scan kernels.
Performance on PQ-style indexes is usually reported as recall@R: the fraction of queries for which the true nearest neighbor (or one of the true top-k) appears in the top R results returned by the approximate index. The original Jégou paper benchmarks SIFT-1M (D = 128, N = 1 million) and SIFT-1B (N = 1 billion):
M = 8 (8-byte codes), recall@100 reaches about 0.59 with plain ADC and around 0.89 with IVFADC at K_coarse = 1024, nprobe = 8 [1].M = 16 (16-byte codes), recall@100 climbs above 0.95 on SIFT-1M IVFADC.M = 16 and a coarse quantizer of 8,192 cells, IVFADC retrieves the true nearest neighbor in the top 100 candidates for the majority of queries while consuming roughly 12 GB of RAM for the entire 1-billion-vector index.OPQ, IMI, and LOPQ each push these numbers further, generally improving recall by 5 to 15 percentage points at the same code size. ScaNN's anisotropic loss is reported to outperform OPQ on glove-100-angular and other modern benchmarks by another factor of two in queries-per-second at fixed recall [9].
The Johnson, Douze, and Jégou 2017 paper on FAISS demonstrates GPU implementations of IVF-PQ that perform billion-scale k-NN at single-digit milliseconds per query, with peak k-selection running at 55% of theoretical hardware bandwidth. The same paper reports building a high-accuracy k-NN graph over 95 million Flickr images in 35 minutes on 8 GPUs, and a 1-billion-vector graph in under 12 hours on 4 GPUs [4].
PQ-based indexes occupy a different point in the recall-memory-throughput trade-off than graph-based methods like HNSW or hashing methods like LSH. The table below summarizes the main families.
| Method | Index type | Memory per vector | Query speed | Recall | Training | Billion-scale |
|---|---|---|---|---|---|---|
| Flat (brute force) | Exhaustive | 4 * D bytes (float32) | O(N * D) per query | 100% | none | impractical |
| LSH | Random or learned hash buckets | tens of bytes | sub-linear | moderate | minimal | yes, with sharding |
| IVF-Flat | k-means cells, full vectors in lists | 4 * D + 8 bytes | nprobe / K_coarse * N | high (up to 100% with full lists) | k-means | yes, but memory-heavy |
| PQ (flat) | Compressed codes only | M bytes (often 8-32) | O(N) table lookups | low to moderate | k-means per sub-space | yes, fits in RAM |
| IVF-PQ (IVFADC) | Coarse quantizer + PQ residuals | M + small bytes | (nprobe / K_coarse) * N lookups | moderate to high | coarse k-means + sub-space k-means | yes, default at 1B scale |
| HNSW | Multi-layer proximity graph | 4 * D + ~64-200 bytes for graph | O(log N) distance evals per query | very high | incremental | up to 100M comfortably |
| HNSW + PQ | Graph over compressed codes | M + graph bytes | O(log N) lookups | moderate to high | k-means + graph build | yes, hybrid |
| ScaNN | IVF + anisotropic PQ | M bytes per residual | (nprobe / K_coarse) * N | very high at given memory | anisotropic loss training | yes |
| DiskANN | SSD-resident graph | small RAM, large SSD | O(log N) SSD reads | very high | offline graph build | designed for billion scale |
In practice, HNSW wins when memory is plentiful and high recall (above 0.95) is required, while PQ-family methods win when memory is the binding constraint or when the corpus exceeds RAM. Hybrids that build HNSW or NSG graphs over PQ-compressed vectors, then re-rank the top candidates against full-precision vectors, are common in production systems that need both high recall and low memory.
Nearly every billion-scale vector search deployment in industry uses PQ in some form. The table summarizes the main implementations.
| System | First released | Owner | PQ usage |
|---|---|---|---|
| FAISS | 2017 | Meta AI Research | IndexPQ, IndexIVFPQ, IndexHNSWPQ, IndexIVFPQFastScan, OPQ pre-rotation; reference implementation [4] |
| Milvus | 2019 | Zilliz | IVF_PQ, IVF_SQ8, HNSW_PQ index types; uses FAISS and Knowhere internally |
| ScaNN | 2020 | Anisotropic vector quantization, the state of the art on glove-100-angular [9] | |
| Pinecone | 2021 | Pinecone | Internal indexes use IVF-PQ-like compression alongside graph methods |
| Weaviate | 2019 | Weaviate B.V. | HNSW with optional PQ compression and 1-bit binary quantization |
| Qdrant | 2021 | Qdrant Solutions | HNSW with optional PQ, scalar, and binary quantization |
| Vespa | open-sourced 2017 | Yahoo / Verizon | HNSW with int8 scalar quantization; PQ available for recommendation workloads |
| Vertex AI Matching Engine | 2021 | Google Cloud | Powered by ScaNN under the hood |
| pgvector | 2021 | Andrew Kane | IVFFlat and HNSW; community has added IVFPQ patches and external plugins |
| Elasticsearch / OpenSearch | 2022 | Elastic / OpenSearch Project | HNSW, with PQ via Faiss engine and scalar quantization native |
| LOPQ | 2014 | Yahoo | Reference open-source implementation of locally optimized PQ |
FAISS, released by Facebook AI Research in March 2017, remains the dominant open-source library. Its index factory mini-language lets users compose pipelines such as OPQ32_128,IVF65536,PQ32x8: rotate inputs with a 32-block OPQ matrix into 128 dimensions, partition with a 65,536-cell IVF, then encode residuals as 32-byte PQ codes. The FAISS GPU backend accelerates both training and search by one to two orders of magnitude over CPU [4]. Milvus, launched by Zilliz in October 2019, packages FAISS and a custom engine called Knowhere behind a database-style API. Google's ScaNN, open-sourced in July 2020, demonstrated that an anisotropic loss function biased toward inner-product preservation could outperform OPQ on retrieval workloads where queries follow a known distribution; it powers Vertex AI Matching Engine [9].
Dense retrieval since 2020 has made PQ-style compression a default consideration for any vector index scaling beyond a few million entries. In retrieval-augmented generation, document chunks are embedded with models such as text-embedding-3-large, BGE, or E5, producing vectors of dimension 768 to 3,072. A typical RAG corpus of 100 million chunks at 1,024 dimensions consumes about 400 GB raw; with M = 64, 8-bit PQ codes, the index fits in roughly 6.4 GB. The recall hit is usually repaired by re-ranking the top few hundred PQ candidates against full-precision vectors held on disk or in a tiered cache.
Recommender systems at Spotify, Pinterest, and eBay have used PQ-style compression in their candidate generators for years, typically combined with two-tower neural networks. Image and video retrieval at Google, Meta, and Yandex relies on PQ for both product image search and content moderation pipelines.
PQ codes are also sometimes used as discrete tokens. The vector quantization step inside VQ-VAE and RVQ-based neural audio codecs can be viewed as a form of PQ where the codebooks are jointly learned with neural encoders. Learned quantization is an active research area, with work on differentiable PQ, joint training with retrieval objectives, and hybrid sparse-dense indexes combining BM25 with PQ-compressed dense retrieval.
PQ can be cast as a constrained vector quantization problem. An unconstrained quantizer with K_total = K^M centroids would minimize E[||x - c_i||^2]. PQ adds the constraint that each centroid c_i factorizes as a concatenation of M sub-centroids, one per sub-space, and that each sub-centroid is shared across all K^(M-1) global centroids that agree on the other sub-spaces. The problem decomposes into M independent k-means problems precisely when the sub-spaces are decorrelated and have equal variance, the assumption that motivates OPQ. Pre-rotating the data with an orthogonal R does not change the global geometry but redistributes the per-sub-space error so the sum is minimized [1][2].
M, and exact retrieval requires re-ranking with full-precision vectors.||q - x||^2 = 2 - 2 q.x, which introduces extra distortion if normalization is not preserved by the codebook.