# AQLM (Additive Quantization of Language Models)

> Source: https://aiwiki.ai/wiki/aqlm
> Updated: 2026-06-08
> Categories: AI Infrastructure, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

AQLM, short for Additive Quantization of Language Models, is a weight-only post-training [quantization](/wiki/quantization) method that compresses the weights of a [large language model](/wiki/large_language_model) to roughly 2 to 3 bits per parameter while retaining most of the original accuracy. It was introduced in "Extreme Compression of Large Language Models via Additive Quantization" by Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, [Elias Frantar](/wiki/elias_frantar), [Artem Babenko](/wiki/artem_babenko), and [Dan Alistarh](/wiki/dan_alistarh), a collaboration involving Yandex Research, the Institute of Science and Technology Austria (ISTA), and Neural Magic. The paper was first posted to arXiv in January 2024 and presented at the International Conference on Machine Learning (ICML) in 2024. [1] AQLM adapts Additive Quantization (AQ), a multi-codebook technique from the approximate nearest-neighbor-search and information-retrieval literature, to the task of representing groups of neural-network weights as sums of learned codewords. When it appeared, it was reported to be the first scheme that is Pareto-optimal below 3 bits per parameter, meaning that for a fixed memory budget in that range it produced a more accurate model than any prior approach. [1]

## Overview

A trained transformer stores most of its parameters as 16-bit floating point numbers, so a 70-billion-parameter model needs roughly 140 GB just to hold its weights. Weight-only [post-training quantization](/wiki/post_training_quantization) replaces those 16-bit values with low-bit codes after training has finished, using only a small calibration set to fit the encoding, without re-running the original training pipeline. Because autoregressive decoding is memory-bandwidth bound, shrinking the weights both fits larger models into a given amount of memory and can speed up generation, since fewer bytes must be read for each token. [1]

AQLM operates one linear layer at a time, like the earlier [GPTQ](/wiki/gptq) family, but it changes the representation. Instead of rounding each weight to a scalar grid, it quantizes short groups of weights jointly, encoding each group as a sum of vectors drawn from several learned codebooks. This is a form of [vector quantization](/wiki/vector_quantization), and the specific variant, where the codeword is a sum of entries from multiple codebooks rather than a single lookup, is what gives the method its name. The authors pair this representation with two ingredients: an input-adaptive layer-wise calibration that fits the codes and codebooks to the data, and an end-to-end fine-tuning step that jointly optimizes the codebook parameters across each transformer block. [1] AQLM was integrated into Hugging Face Transformers (version 4.38.0) and PEFT (version 0.9.0), and the authors released prequantized Llama and Mistral models. [1][7]

## Background: additive quantization

Additive Quantization was proposed by Artem Babenko and Victor Lempitsky in "Additive Quantization for Extreme Vector Compression" at CVPR 2014 as a way to store high-dimensional descriptor vectors compactly for similarity search. [2] It generalizes Product Quantization (PQ), the standard compressed-index method, which splits a vector into disjoint subvectors and quantizes each one with its own small codebook. PQ is fast but its sub-codebooks are forced to be orthogonal, which limits accuracy. Additive Quantization drops that restriction: it learns M full-dimensional codebooks, each containing a set of candidate codewords, and approximates a vector as the sum of one codeword selected from each codebook. Because the codebooks overlap in the same space, the sum can fit the data much more tightly than PQ for the same number of bits, at the cost of a harder encoding problem. [2]

Finding the best combination of codewords for a given vector is combinatorial and NP-hard in general, since the choice in one codebook interacts with the choices in the others. AQ solves it approximately with beam search, keeping a shortlist of the most promising partial code assignments and extending them codebook by codebook. The same family of multi-codebook quantization (MCQ) methods underpins much of modern vector retrieval. AQLM's contribution was to recognize that LLM weight matrices are also large collections of vectors that can be compressed this way, and to make the codebooks adapt to the layer's behavior rather than to raw reconstruction error alone. [1][2]

## How AQLM works

### Additive codebook representation

AQLM reshapes each weight matrix into many short groups of g consecutive weights (g is typically 8). Every group is approximated by a sum of M codewords, one taken from each of M learned codebooks, where each codebook holds 2^B codewords (so a single index into it costs B bits). A group is therefore stored as M indices plus a small per-group scaling factor, and the codebooks themselves are stored once and shared across the whole matrix. The nominal cost is about M times B divided by g bits per weight, plus the amortized cost of the codebooks and scales. [1]

The configurations are written in "M x B" notation. For 2 bits per weight the common settings are "1x16" (one codebook of 2^16 codewords over 8-weight groups, which is single-codebook vector quantization) and "2x8" (two codebooks of 2^8 codewords each, the genuinely additive case, where a group is the sum of two codewords). Both give 16 index bits per 8 weights, that is, 2 bits per weight before overhead; the smaller 2x8 codebooks are faster to look up while 1x16 is slightly more accurate. Larger or additional codebooks raise the rate toward 3 or 4 bits. The codebook overhead is why measured rates come out a little above nominal, for example 2.02 to 2.07 bits per parameter for the 2-bit settings. [1]

### Layer-wise calibration

AQLM fits each layer to a small calibration set rather than to the weights in isolation. For a layer with weights W and calibration inputs X, it minimizes the error in the layer's output, the squared norm of (W minus W_hat) times X, which is equivalent to weighting the weight error by the input second-moment matrix X times X transpose, the same proxy-Hessian objective used by GPTQ and [QuIP](/wiki/quip). [1] This is the "input-adaptive" part: the codes and codebooks are pushed to be accurate on the weights that actually drive the layer's outputs.

The optimization alternates two phases. First, with the codebooks held fixed, AQLM searches for the best discrete codes by casting the assignment as inference in a Markov random field and running beam search, exactly the encoding step inherited from classic AQ. Second, with the codes held fixed, it updates the continuous codebook vectors and scales by gradient descent (Adam) to reduce the calibrated error. These two steps are repeated until the layer's reconstruction converges. [1]

### Block-wise fine-tuning

A purely per-layer fit lets errors from successive layers accumulate. AQLM adds a fine-tuning pass that operates at the granularity of a whole transformer block. After all the linear layers in a block have been quantized, it jointly optimizes the continuous parameters, the codebook vectors, the scale factors, and the remaining non-quantized parameters such as the RMSNorm weights, so that the quantized layers compensate for one another instead of each minimizing its own error. The discrete codes stay frozen during this step. This end-to-end calibration over the block is responsible for a large share of AQLM's accuracy at 2 bits and is one of the two innovations the paper highlights. [1]

## Results

On the Llama 2 family, AQLM stays close to the 16-bit model at 3 bits and remains usable at 2 bits, where naive methods collapse. The following WikiText2 and C4 perplexities are from the AQLM paper (lower is better), with the 16-bit baselines for reference. [1]

| Llama 2 | fp16 Wiki2 | AQLM ~3-bit Wiki2 | AQLM ~2-bit Wiki2 | AQLM ~2-bit C4 |
| --- | --- | --- | --- | --- |
| 7B | 5.47 | 5.46 | 6.64 | 8.56 |
| 13B | 4.88 | 4.82 | 5.65 | 7.51 |
| 70B | 3.32 | 3.36 | 3.94 | 5.72 |

At roughly 3 bits the degradation is almost negligible, and the gaps widen only in the extreme 2-bit regime, most of all on the smallest model. The separation between methods is clearest at 2 bits on the 70-billion-parameter model:

| Method (Llama 2 70B, ~2-bit) | Bits/param | WikiText2 perplexity |
| --- | --- | --- |
| fp16 baseline | 16 | 3.32 |
| GPTQ | 2 | 123.9 |
| QuIP (original) | 2 | 6.33 |
| QuIP# | 2.01 | 4.16 |
| AQLM | 2.07 | 3.94 |

Sources: AQLM paper for AQLM and QuIP#; QuIP paper for GPTQ and original QuIP. [1][4]

Scalar methods such as GPTQ are destroyed at 2 bits, while AQLM and the lattice-codebook method QuIP# are the only entries that stay near the baseline. [1][4] AQLM also ships GPU and CPU kernels: on an RTX 3090 its 2-bit layers run faster than the fp16 layer, generating about 32 tokens per second for a 2-bit Llama 2 7B, with the authors matching or beating optimized fp16 throughput at far lower memory. [1]

## Relationship to other quantization methods

AQLM sits in the line of per-layer post-training quantizers that began with OBQ and GPTQ (Frantar et al., ICLR 2023), whose Hessian-weighted layer objective it reuses. [3] Where GPTQ and AWQ round weights to scalar grids and remain strong mainly at 4 bits, AQLM follows QuIP and QuIP# into the sub-3-bit regime by quantizing groups of weights jointly. [1][4][5] The contrast with QuIP# (Tseng et al., ICML 2024) is instructive: both appeared in early 2024 and reach near-baseline accuracy at 2 bits, but they make different bets. QuIP# first rotates the weights with a randomized [Hadamard transform](/wiki/hadamard_transform) to remove outliers and then quantizes 8-weight blocks to a fixed E8-lattice codebook, whereas AQLM learns its codebooks from the data and sums multiple of them. As each method added end-to-end fine-tuning their published numbers converged, with the two trading the lead across model sizes. In raw decoding speed, QuIP#'s lattice lookups were reported to be faster than AQLM's multi-codebook lookups for small models, so AQLM tends to trade some inference speed for accuracy. [4][5]

The same group followed AQLM with PV-Tuning (Malinovskii et al., NeurIPS 2024), which replaces the straight-through-estimator fine-tuning used by both AQLM and QuIP# with a better discrete optimization over the codes; applied on top of AQLM-style representations it achieved what the authors call the first clearly Pareto-optimal 2-bit Llama 2 models. [6] AQLM also helped establish learned multi-codebook quantization as a leading paradigm for extreme compression, influencing later vector-quantization methods such as Microsoft's VPTQ (EMNLP 2024) and GPTVQ, and running parallel to the trellis-coded QTIP from the QuIP authors. [5][6]

## Limitations

The chief practical cost of AQLM is calibration time. The beam-search encoding, alternating codebook updates, and block-wise fine-tuning are far more compute-intensive than one-shot methods like GPTQ, so quantizing a large model can take many GPU-hours, and the largest models can require on the order of a day of compute. [1] Inference also depends on custom kernels: although the released 2-bit kernels beat fp16 on memory-bound decoding, the additive multi-codebook lookups are heavier than scalar or lattice dequantization, which is why competing methods reported higher token throughput on some models. [1][5] Accuracy still degrades relative to the 16-bit model at 2 bits, most visibly on smaller models such as 7B, and the best results depend on the fine-tuning step rather than the layer-wise fit alone. Like all calibration-based quantizers, it needs representative calibration data and optimizes a local proxy objective, and its effective bitrate sits slightly above the nominal target because the codebooks and scales must themselves be stored. [1]

## References

1. Egiazarian, Vage; Panferov, Andrei; Kuznedelev, Denis; Frantar, Elias; Babenko, Artem; Alistarh, Dan. "Extreme Compression of Large Language Models via Additive Quantization." arXiv:2401.06118 (January 11, 2024); Proceedings of the 41st International Conference on Machine Learning (ICML 2024). https://arxiv.org/abs/2401.06118
2. Babenko, Artem; Lempitsky, Victor. "Additive Quantization for Extreme Vector Compression." IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014), pp. 931 to 938. https://openaccess.thecvf.com/content_cvpr_2014/html/Babenko_Additive_Quantization_for_2014_CVPR_paper.html
3. Frantar, Elias; Ashkboos, Saleh; Hoefler, Torsten; Alistarh, Dan. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." arXiv:2210.17323 (October 2022); ICLR 2023. https://arxiv.org/abs/2210.17323
4. Chee, Jerry; Cai, Yaohui; Kuleshov, Volodymyr; De Sa, Christopher. "QuIP: 2-Bit Quantization of Large Language Models With Guarantees." arXiv:2307.13304 (July 2023); NeurIPS 2023. https://arxiv.org/abs/2307.13304
5. Tseng, Albert; Chee, Jerry; Sun, Qingyao; Kuleshov, Volodymyr; De Sa, Christopher. "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks." arXiv:2402.04396 (February 2024); ICML 2024. https://arxiv.org/abs/2402.04396
6. Malinovskii, Vladimir; et al. "PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression." arXiv:2405.14852 (May 2024); NeurIPS 2024. https://arxiv.org/abs/2405.14852
7. Egiazarian, Vage; et al. AQLM reference implementation and prequantized models. GitHub, Vahe1994/AQLM. https://github.com/Vahe1994/AQLM