# QuIP / QuIP#

> Source: https://aiwiki.ai/wiki/quip
> Updated: 2026-06-08
> Categories: AI Infrastructure, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

QuIP (Quantization with Incoherence Processing) is a family of weight-only post-training [quantization](/wiki/quantization) methods for [large language models](/wiki/large_language_model) developed in the RelaxML group at Cornell University. The original QuIP, introduced by Jerry Chee, Yaohui Cai, [Volodymyr Kuleshov](/wiki/volodymyr_kuleshov), and [Christopher De Sa](/wiki/christopher_de_sa) in "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" and presented at NeurIPS 2023, was the first method to produce usable models at roughly two bits per weight, supported by a theoretical analysis of why such aggressive compression can work. [1] Its successor QuIP# (written "QuIP-sharp"), by Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa in "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks" (ICML 2024), sharpened the approach with the randomized [Hadamard transform](/wiki/hadamard_transform), lattice codebooks, and fine-tuning, and set the state of the art for extreme weight compression in the 2 to 4 bit range when it appeared. [2] The unifying idea is incoherence processing: rotating the weight and curvature matrices by random orthogonal transforms so that outliers are spread out and the weights become much easier to round.

## Overview

Both methods address the same engineering problem. A trained transformer stores most of its parameters as 16-bit floating point numbers, so a 70-billion-parameter model needs about 140 GB just to hold its weights. Weight-only [post-training quantization](/wiki/post_training_quantization) replaces those 16-bit weights with low-bit integers or codebook indices after training has finished, without any access to the original training pipeline, using only a small calibration set to fit the rounding. Activations are kept in higher precision. Because autoregressive decoding is bandwidth bound, shrinking the weights both fits larger models into a given amount of memory and speeds up generation, since fewer bytes must be read for each token. [1][2]

QuIP and QuIP# operate one linear layer at a time. For each weight matrix they minimize a local proxy for the error that quantization introduces into that layer's output, then move on. What distinguishes them from earlier per-layer methods is a preprocessing step, incoherence processing, that conditions the matrices before rounding, plus, in QuIP#, a codebook that quantizes groups of weights jointly rather than one at a time.

## Background: the outlier problem

The central obstacle to low-bit quantization is that the values being compressed are not uniformly distributed. A small number of weight and activation coordinates carry magnitudes far larger than the rest. These outliers stretch the numeric range that a quantizer must cover, so when only a handful of levels are available, almost all of the ordinary values collapse onto the same few codes and precision is wasted on the extremes. At 8 and 4 bits, methods such as [GPTQ](/wiki/gptq) and [AWQ](/wiki/awq) tame this well enough to keep accuracy close to the original model. At 2 and 3 bits the problem becomes severe: naive round-to-nearest quantization of OPT-30B at two bits per weight produces a WikiText2 perplexity above 41,000, meaning the model is destroyed, against a 16-bit baseline of about 9.6. [1]

Most prior approaches attack outliers head-on, for example by keeping a few salient channels in high precision, by per-channel scaling (AWQ), or by learning clipping thresholds. QuIP takes a different route: instead of protecting outliers, it removes the coordinate structure that creates them in the first place.

## Incoherence processing

QuIP formalizes the per-layer objective as minimizing the quadratic proxy loss tr((W_hat - W) H (W_hat - W)^T), where W is the original weight matrix, W_hat its quantized version, and H is a proxy Hessian equal to the second moment of the layer's input activations measured on calibration data. This is the same layerwise reconstruction objective used by GPTQ. [1]

A matrix is called incoherent when none of its entries is unusually large relative to the typical magnitude, made precise by bounding the maximum absolute entry by a small multiple of the root-mean-square entry, and when the eigenvectors of its Hessian are not aligned with the coordinate axes. Incoherence is exactly the absence of outliers. QuIP's key observation is that quantization is provably and empirically easier when both the weights and the Hessian are incoherent. [1]

To enforce this, QuIP multiplies the weight matrix on both sides by random orthogonal matrices, W becomes U W V^T, and transforms the Hessian to V H V^T, choosing the same V so that the proxy loss is left unchanged. Because the transforms are orthogonal they can be folded into the surrounding computation and inverted at inference time, so the network computes the same function. With high probability the rotation makes the matrices incoherent: it takes a few concentrated outliers and smears their energy across all coordinates, turning a spiky distribution into a smooth, near-Gaussian one that a low-bit grid can represent evenly. The effect is dramatic. With incoherence processing, even plain round-to-nearest becomes viable at two bits, cutting the OPT-30B WikiText2 perplexity from more than 41,000 to about 12.0. [1]

## QuIP

The original QuIP pairs incoherence processing with an adaptive rounding routine called LDLQ, named for the LDL (Cholesky-style) decomposition of the Hessian H that drives it. LDLQ rounds the columns of the weight matrix one at a time, and after fixing each column it feeds the resulting rounding error forward into the columns not yet processed through a linear correction derived from the LDL factors. This lets later weights compensate for the mistakes made on earlier ones. The paper proves that LDLQ is optimal, in both a worst-case and an average-case sense, within the broad class of adaptive rounding methods that use linear feedback, and it shows that GPTQ's rounding belongs to this same family, which places the widely used method on a firmer theoretical footing. [1]

For incoherence processing to be cheap, QuIP does not draw a fully dense random orthogonal matrix. It uses a Kronecker product of two smaller random orthogonal factors, which lets the multiply run in roughly O(n(p+q)) operations for an n = p times q dimensional space rather than O(n squared). [1] QuIP was the first quantization algorithm with an end-to-end theoretical analysis that scales to LLM-sized models, and the analysis quantifies how incoherence reduces the proxy loss. Empirically it delivered the first usable two-bit LLMs: on Llama 2 70B it reaches a WikiText2 perplexity of 6.33 at two bits, against 3.32 for the 16-bit model and 123.9 for GPTQ (called OPTQ in the paper) at the same bit width. [1] The main practical drawback was inference overhead, since the Kronecker rotation added work on the activation path, raising per-token latency on OPT-66B from about 53 ms for GPTQ to about 81 ms. [1]

## QuIP#

QuIP# keeps incoherence processing but improves it on three fronts. [2]

First, it replaces QuIP's Kronecker random orthogonal matrices with the randomized Hadamard transform (RHT), the product of a normalized Hadamard matrix, whose entries are all plus or minus one, and a random diagonal of sign flips. The RHT can be applied with a fast Walsh-Hadamard transform in O(n log n) time, is cheaper than the Kronecker approach at both quantization and inference time, and comes with tighter incoherence guarantees. Because the transform is so cheap, it can be done on the fly during the forward pass, which removes the latency penalty that burdened the original QuIP. [2]

Second, QuIP# moves from scalar rounding to [vector quantization](/wiki/vector_quantization). After incoherence processing the weights look like independent, identically distributed Gaussian values, a ball-shaped distribution in which most of the mass sits near a spherical shell rather than at the corners of a hypercube. Rounding each weight separately to a grid wastes bits on the empty corners. Instead QuIP# quantizes groups of eight weights at once to the nearest point of a codebook built on the E8 lattice, the lattice that gives the densest sphere packing in eight dimensions and therefore an efficient eight-dimensional quantizer for Gaussian sources. Its E8P ("E8 padded") codebook represents two bits per weight, sixteen bits across the eight-weight block, yet exploits the lattice's symmetry so that all 2^16 entries are generated from a stored table of only 256 codewords (about 1 KiB), keeping the lookup small and cache-friendly. Higher rates of three and four bits per weight are reached by quantizing the residual with a second E8P stage, a form of residual vector quantization. [2][5]

Third, QuIP# adds a fine-tuning step that the purely per-layer proxy cannot capture. Within each transformer block it tunes the parameters left unquantized so far to reduce activation error, and across the network it fine-tunes remaining components such as the sign vectors, layer norms, and the language-model head so that the quantized layers compensate for one another rather than accumulating error independently. [2]

## Results

QuIP# produces the strongest reported two-bit numbers of its generation. On Llama 2 it nearly matches the 16-bit model at four bits and stays close even at two bits (WikiText2 perplexity, 2048-token context):

| Llama 2 model | fp16 | QuIP# 4-bit | QuIP# 3-bit | QuIP# 2-bit |
| --- | --- | --- | --- | --- |
| 7B | 5.47 | 5.56 | 5.79 | 6.66 |
| 13B | 4.88 | 4.95 | 5.10 | 5.74 |
| 70B | 3.32 | 3.38 | 3.56 | 4.16 |

Source: QuIP# paper. [2]

The gap between methods is clearest at two bits on the largest model, where incoherence processing and lattice codebooks separate the viable methods from the rest:

| Method (2-bit, Llama 2 70B) | WikiText2 perplexity |
| --- | --- |
| fp16 baseline | 3.32 |
| GPTQ / OPTQ | 123.9 |
| OmniQuant | 7.81 |
| QuIP (original) | 6.33 |
| QuIP# | 4.16 |

Sources: QuIP and QuIP# papers. [1][2]

QuIP# also made low-bit inference genuinely fast. Its fused CUDA kernels sustain more than half of a GPU's peak memory bandwidth, so a two-bit Llama 2 7B runs at roughly 170 tokens per second on an RTX 4090 and a two-bit Llama 2 70B at about 33 tokens per second, well above the additive-quantization baseline AQLM at around 20 tokens per second for the same 7B model. [2] The authors released more than forty prequantized models on Hugging Face under the relaxml organization, including Llama 1 and Llama 2 from 7B to 70B and Mistral 7B. [5] A notable empirical finding was that three-bit QuIP# models can scale better than four-bit ones, hinting that the best accuracy per bit may lie below four bits. [2][5]

## Relationship to other quantization methods

QuIP sits within the line of per-layer post-training quantizers that began with OBQ and GPTQ. GPTQ (Frantar et al., ICLR 2023) introduced the layerwise Hessian-based proxy and error-feedback rounding that QuIP generalizes; the QuIP paper shows GPTQ to be a member of the adaptive-rounding-with-linear-feedback class for which LDLQ is optimal. [1][7] AWQ (Lin et al., 2023) protects salient weight channels by per-channel scaling rather than rotation, and OmniQuant (Shao et al., 2023) learns clipping and scaling parameters; both are strong at 4 bits but fall behind QuIP# at 2 bits. [2][8][9] AQLM (Egiazarian et al., 2024) is a contemporaneous additive-quantization method that reaches similar accuracy at two bits but at lower inference speed. [2][10]

The incoherence-by-rotation idea proved broadly influential. Rotation-based schemes such as QuaRot and SpinQuant adopted Hadamard or learned orthogonal transforms to suppress outliers, extending the technique to activations and the KV cache rather than weights alone. [11] The direct successor from the same Cornell group is QTIP (Tseng et al., NeurIPS 2024), which keeps incoherence processing but replaces the eight-dimensional lattice codebook with trellis-coded quantization, escaping the exponential cost that caps vector quantization near eight dimensions and pushing effective quantization to hundreds of dimensions for further quality gains. [6] Together, QuIP, QuIP#, and QTIP trace the path by which incoherence processing became a standard ingredient of extreme LLM weight compression.

## References

1. Chee, Jerry; Cai, Yaohui; Kuleshov, Volodymyr; De Sa, Christopher. "QuIP: 2-Bit Quantization of Large Language Models With Guarantees." arXiv:2307.13304 (July 25, 2023); Advances in Neural Information Processing Systems 36 (NeurIPS 2023). https://arxiv.org/abs/2307.13304
2. Tseng, Albert; Chee, Jerry; Sun, Qingyao; Kuleshov, Volodymyr; De Sa, Christopher. "QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks." arXiv:2402.04396 (February 6, 2024); Proceedings of the 41st International Conference on Machine Learning (ICML 2024). https://arxiv.org/abs/2402.04396
3. "QuIP: 2-Bit Quantization of Large Language Models With Guarantees." NeurIPS 2023 poster page. https://neurips.cc/virtual/2023/poster/69982
4. Chee, Jerry; et al. QuIP reference implementation. GitHub, Cornell-RelaxML/QuIP. https://github.com/Cornell-RelaxML/QuIP
5. Tseng, Albert; et al. QuIP# reference implementation, codebook details, and prequantized model zoo. GitHub, Cornell-RelaxML/quip-sharp. https://github.com/Cornell-RelaxML/quip-sharp
6. Tseng, Albert; Sun, Qingyao; Hou, David; De Sa, Christopher. "QTIP: Quantization with Trellises and Incoherence Processing." arXiv:2406.11235 (June 17, 2024); NeurIPS 2024. https://arxiv.org/abs/2406.11235
7. Frantar, Elias; Ashkboos, Saleh; Hoefler, Torsten; Alistarh, Dan. "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." arXiv:2210.17323 (October 2022); ICLR 2023. https://arxiv.org/abs/2210.17323
8. Lin, Ji; Tang, Jiaming; Tang, Haotian; et al. "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv:2306.00978 (June 2023); MLSys 2024. https://arxiv.org/abs/2306.00978
9. Shao, Wenqi; Chen, Mengzhao; Zhang, Zhaoyang; et al. "OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models." arXiv:2308.13137 (August 2023); ICLR 2024. https://arxiv.org/abs/2308.13137
10. Egiazarian, Vage; Panferov, Andrei; Kuznedelev, Denis; et al. "Extreme Compression of Large Language Models via Additive Quantization (AQLM)." arXiv:2401.06118 (January 2024); ICML 2024. https://arxiv.org/abs/2401.06118
11. Ashkboos, Saleh; Mohtashami, Amirkeivan; Croci, Maximilian L.; et al. "QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs." arXiv:2404.00456 (March 2024); NeurIPS 2024. https://arxiv.org/abs/2404.00456

