KV-cache quantization
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 2,231 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 2,231 words
Add missing citations, update stale details, or suggest a clearer explanation.
KV cache quantization is a family of large language model inference optimizations that store the attention key and value (KV) cache in low-bit numeric formats, typically 2 to 4 bits per value, instead of the 16-bit floating-point (FP16) representation used during ordinary decoding. Because the KV cache grows linearly with both sequence length and batch size and quickly becomes the dominant consumer of GPU memory in long-context and high-throughput serving, compressing it lets a fixed amount of hardware hold longer contexts, larger batches, and more concurrent requests. The approach is a specialization of quantization to the particular statistics of cached keys and values, and it sits alongside other KV-cache-reduction strategies such as token eviction and attention architectures that produce smaller caches in the first place.
The label covers several named techniques published mostly in 2023 and 2024. The two most cited are KIVI, a tuning-free 2-bit method, and KVQuant, a near-lossless scheme aimed at extreme context lengths; others include WKVQuant, GEAR, and Atom. These methods differ in granularity (per-channel versus per-token), in whether they use uniform or non-uniform number formats, in how they isolate outliers, and in whether they target the cache alone or quantize weights and activations as well.
During autoregressive generation a transformer caches the key and value vectors it computes for every past token at every layer, so that each new token can attend over the stored history without recomputing it. This cache is what makes incremental decoding fast, but its size is governed by
cache_bytes = 2 x layers x kv_heads x head_dim x sequence_length x batch_size x bytes_per_value,
where the leading factor of 2 accounts for storing both keys and values. Two properties make this expensive. First, the cache scales linearly with sequence length and batch size, so doubling the context or doubling the number of concurrent requests doubles the memory. Second, unlike model weights, which are fixed, the cache is allocated per request and per token at run time, so it competes directly with the memory available for batching.
A worked example shows the scale. Using standard multi-head attention, Llama-2-70B has 80 layers, 64 key/value heads, and a head dimension of 128. In 16-bit precision each token therefore needs 2 x 80 x 64 x 128 x 2 bytes, about 2.5 MB. A single 32,768-token sequence consumes roughly 80 GB, more than an entire 80 GB GPU holds, and even a small batch of 4,096-token requests can fill the same card with KV cache alone. Storing the cache at 4 bits cuts this by about 4x relative to 16-bit, and at 2 bits by about 8x, which translates directly into longer feasible contexts and larger batches.
Quantization maps a group of high-precision numbers onto a small set of discrete levels described by a scale and a zero point. The central design choice for the KV cache is the granularity, meaning which values share a scale. Per-tensor quantization shares one scale across everything and is cheap but inaccurate; per-token quantization gives each token's vector its own scale; per-channel quantization gives each feature dimension its own scale across tokens. A second choice is the number format: uniform integer levels, or non-uniform levels placed to match the data distribution.
The decisive empirical finding behind modern KV-cache methods is that keys and values have different distributions and need different schemes. Key vectors contain persistent outlier channels: a few feature dimensions carry magnitudes far larger than the rest, and these "hot" channels are roughly the same from one token to the next. Value vectors show no such consistent channel structure. This asymmetry motivates quantizing keys per-channel, so that an outlier channel's wide range is confined to its own scale rather than inflating the scale of well-behaved channels, while quantizing values per-token [1][2]. Per-token value quantization also limits error propagation, because the attention output is a weighted sum (a mixer) over value vectors, and keeping each token's value self-contained stops one token's error from contaminating the others [1].
Per-channel key quantization introduces a systems problem: a per-channel scale needs statistics along the token dimension, but during decoding tokens arrive one at a time. Methods solve this in two ways. KIVI keeps the most recent tokens in a small full-precision residual buffer and quantizes keys in fixed-size groups once enough tokens have accumulated. KVQuant instead calibrates the per-channel scales and non-uniform levels offline on a sample of data and applies them at run time [1][2].
KIVI ("KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache"), by Zirui Liu and colleagues and presented at ICML 2024, is a 2-bit method that needs no training or calibration, which is what "tuning-free" denotes [1]. It quantizes the key cache per channel and the value cache per token, following the asymmetry described above. To make per-channel key quantization compatible with streaming decode, KIVI splits the cache into a quantized "grouped" part and a full-precision "residual" of the most recent tokens: new tokens enter the residual at full precision, and once the residual reaches a set length they are quantized in groups and moved into the grouped part. The released configuration uses a group size of 32 and a residual length of 128 tokens [1].
KIVI reports up to 2.6x lower peak memory (counting model weights), which it converts into batch sizes up to 4x larger and end-to-end throughput gains of 2.35x to 3.47x on Llama, Falcon, and Mistral models, while keeping accuracy on language-modeling and long-context benchmarks close to the 16-bit baseline [1].
KVQuant ("KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization"), by Coleman Hooper and colleagues at the University of California, Berkeley and presented at NeurIPS 2024, targets near-lossless low-bit caches and extreme context lengths [2]. It combines four techniques:
KVQuant reports under 0.1 perplexity degradation at 3-bit on WikiText-2 and C4 for LLaMA, Llama-2, Llama-3, and Mistral models, and with custom kernels it reaches roughly 1.7x speedups over the 16-bit baseline. By shrinking the cache it enables serving LLaMA-7B with a 1 million token context on a single A100-80GB GPU, and up to 10 million tokens on an 8-GPU system, which is the source of the paper's title [2].
| Method | Venue | Target | Typical bits | Distinctive idea |
|---|---|---|---|---|
| KIVI | ICML 2024 | KV cache | 2 | tuning-free; per-channel keys, per-token values; full-precision residual window |
| KVQuant | NeurIPS 2024 | KV cache | 2 to 4 | pre-RoPE per-channel keys; non-uniform datatypes; dense-and-sparse outliers; up to 10M context |
| WKVQuant | arXiv 2024 | weights + KV cache | ~2 | past-only quantization; two-dimensional quantization |
| GEAR | arXiv 2024 | KV cache | ~2 | low-rank residual approximation plus sparse outlier correction |
| Atom | MLSys 2024 | weights + activations + KV | 4 | end-to-end low-bit serving system with fused kernels |
WKVQuant (Yue et al., 2024) jointly quantizes weights and the KV cache rather than the cache alone. It introduces past-only quantization, which keeps the current step's keys and values in full precision and quantizes only the already-cached past, plus a two-dimensional quantization strategy and a cross-block reconstruction regularizer; the result approaches weight-only accuracy while saving memory comparable to full weight-and-activation quantization [3]. GEAR ("GEnerative inference with Approximation error Reduction", Kang et al., 2024) treats the residual left by ultra-low-bit quantization explicitly: it quantizes the bulk of similarly scaled entries, approximates the coherent quantization error with a low-rank matrix via singular value decomposition, and corrects the remaining incoherent outliers with a sparse matrix, reporting up to 2.38x throughput and 2.29x peak memory reduction over prior alternatives [4]. Atom (Zhao et al., MLSys 2024) is a serving-oriented low-bit scheme that quantizes weights, activations, and the KV cache together to 4 bits with mixed precision and fused kernels, reporting up to 7.73x throughput over FP16; here KV cache quantization is one component of an end-to-end system rather than the whole method [5].
Quantization is one of three broad families for shrinking the KV cache, and the families are largely complementary.
Eviction and sparsity methods reduce the number of tokens kept rather than the bits per token. H2O (Heavy-Hitter Oracle, Zhang et al., 2023) observes that accumulated attention scores follow a power-law distribution and keeps only a budget of high-impact "heavy hitter" tokens plus a recent window, discarding the rest [6]. StreamingLLM and similar schemes keep a few initial "sink" tokens together with a sliding window. Such policies can be layered on top of quantization.
Architectural methods change attention so that the cache is smaller by construction. Grouped-query attention and multi-query attention share key and value heads across query heads, cutting the kv_heads term in the formula above; multi-head latent attention, introduced with DeepSeek-V2, compresses keys and values into a single low-rank latent vector that is cached in place of the full per-head tensors. Because these reduce kv_heads or the effective head dimension, they multiply with quantization's bit reduction rather than competing with it.
Finally, full-model quantization methods such as SpinQuant, QuaRot, and QServe quantize weights and activations and also store the KV cache at 4 or 8 bits as part of end-to-end low-bit inference, distinguishing them from the cache-specialized methods above.
Across these methods the consistent finding is that the KV cache can be taken to 3 or 4 bits with negligible quality loss, and to 2 bits with small loss, given the right per-channel and per-token split together with explicit outlier handling [1][2]. The memory saved is direct: a 4-bit cache roughly quarters and a 2-bit cache roughly eighths the cache footprint versus 16-bit, which is why the same GPU can hold far longer contexts or much larger batches.
The gains are not free. Quantizing and dequantizing the cache adds compute, and at small batch sizes this overhead can make generation slower rather than faster, because decoding there is latency-bound rather than memory-bound; the throughput wins appear at large batch or long context, where the cache, not arithmetic, is the bottleneck. Hugging Face's measurements illustrate the trade-off: a 4-bit cache gives roughly 2.5x memory saving and lets an 80 GB A100 reach about 128k tokens versus 40k with a 16-bit cache, while combining cache and weight quantization can cut speed substantially at higher batch sizes [7]. Aggressive 2-bit settings also rely on keeping a small full-precision residual or a sparse outlier set, so the effective bit-rate is slightly above the nominal value.
In production, the most widely deployed form is more conservative than the research frontier. Serving stacks such as vLLM and TensorRT-LLM expose 8-bit FP8 KV caches (and, in TensorRT-LLM, INT8 and INT4), running attention directly in the quantized format to roughly halve cache memory with minimal accuracy impact [8]. Hugging Face Transformers added 2-bit and 4-bit KV cache backends, via the quanto and HQQ libraries, in May 2024, bringing KIVI-style cache quantization to a mainstream library [7]. The 2-bit to 4-bit research methods remain most attractive for the longest contexts and largest batches, where their extra memory savings outweigh the added complexity and dequantization cost.