A KV cache (key-value cache) is a memory optimization technique used during transformer inference that stores previously computed key and value tensors from the attention mechanism so they do not need to be recalculated at each generation step. During autoregressive text generation, a large language model produces tokens one at a time, and each new token must attend to every previous token in the sequence. Without caching, the model would redundantly recompute the key and value projections for all prior tokens at every step, resulting in computation that grows quadratically with sequence length. By storing and reusing these intermediate results, the KV cache reduces per-step computation from O(n) projection operations to O(1), where only the new token's key and value vectors need to be computed and appended to the cache.
The KV cache is one of the most fundamental optimizations in modern LLM serving. Every major inference framework, including vLLM, TensorRT-LLM, and Hugging Face Transformers, implements KV caching by default. However, the cache introduces its own challenge: memory consumption. For large models with long context windows, the KV cache can consume tens of gigabytes of GPU memory, often exceeding the memory required by the model weights themselves. This has motivated an active area of research into KV cache compression, eviction, quantization, and memory management techniques.
To understand why the KV cache exists, it is necessary to understand how autoregressive language models generate text.
A decoder-only transformer (such as GPT, LLaMA, or Claude) generates text one token at a time. At each step t, the model takes all tokens produced so far (x_1, x_2, ..., x_t) and predicts the probability distribution over the next token x_{t+1}. This process is inherently sequential: the model cannot produce token t+1 until it has produced token t.
Inside each transformer layer, the self-attention mechanism computes three matrices from the input embeddings:
The attention output is computed as:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
Because the model is causal (each token can only attend to itself and preceding tokens), at generation step t the model computes the query for the new token and must compare it against the keys of all t tokens. The resulting attention weights are then applied to the values of all t tokens.
Without KV caching, the model would recompute the K and V projections for every token at every generation step. At step t, this means computing t key vectors and t value vectors, even though the key and value vectors for tokens 1 through t-1 are identical to what they were at step t-1. This redundant computation is the problem that KV caching solves.
The KV cache operates by storing the key and value tensors computed at each generation step and reusing them in subsequent steps. The process unfolds in two distinct phases.
When the user submits a prompt, the model processes all input tokens in parallel during the prefill phase (also called the prompt encoding phase). For each transformer layer, the model computes the key and value projections for every token in the prompt and stores them in the cache. This phase is compute-bound because it involves large matrix multiplications across the full prompt length. The prefill phase determines the time to first token (TTFT), the latency the user experiences before seeing any output.
After the prefill phase, the model enters the decode phase, generating output tokens one at a time. At each step:
This means that at decode step t, only a single token's K and V projections need to be computed, while the K and V projections for all previous tokens are simply read from the cache. The decode phase is memory-bandwidth-bound rather than compute-bound: the bottleneck is loading the cached key and value tensors from GPU memory (HBM) into the compute units, not the arithmetic itself.
Consider generating the sentence "The cat sat on" with a single-layer, single-head transformer:
| Step | Input token | Cache before step (K vectors) | Computation | Cache after step (K vectors) |
|---|---|---|---|---|
| 1 (prefill) | "The", "cat", "sat" | Empty | Compute K, V for all 3 tokens; store in cache | K_the, K_cat, K_sat |
| 2 (decode) | "on" | K_the, K_cat, K_sat | Compute K_on, V_on; append to cache; query attends over all 4 | K_the, K_cat, K_sat, K_on |
| 3 (decode) | "the" | K_the, K_cat, K_sat, K_on | Compute K_the2, V_the2; append; query attends over all 5 | K_the, K_cat, K_sat, K_on, K_the2 |
In a real model, this process happens independently at every layer and every attention head, so the total cache stores K and V tensors for every (layer, head, position) combination.
The KV cache stores two tensors (K and V) at each layer for each token. The memory required per token is:
KV cache per token (bytes) = 2 x n_layers x n_kv_heads x d_head x bytes_per_element
where:
Since n_kv_heads x d_head often equals the model's hidden dimension d_model (in the standard multi-head attention case), this simplifies to:
KV cache per token (bytes) = 2 x n_layers x d_model x bytes_per_element
The total KV cache memory for a batch of sequences is:
Total KV cache (bytes) = batch_size x seq_length x 2 x n_layers x n_kv_heads x d_head x bytes_per_element
The following table shows approximate KV cache sizes for popular models processing a single sequence of 4,096 tokens in FP16 (2 bytes per element):
| Model | Parameters | Layers | Heads (Q/KV) | d_head | d_model | KV cache per token | KV cache (4K tokens) |
|---|---|---|---|---|---|---|---|
| LLaMA 2 7B | 7B | 32 | 32/32 | 128 | 4,096 | 0.5 MB | 2.0 GB |
| LLaMA 2 13B | 13B | 40 | 40/40 | 128 | 5,120 | 0.6 MB | 2.5 GB |
| LLaMA 2 70B | 70B | 80 | 64/8 | 128 | 8,192 | 0.3 MB | 1.25 GB |
| Mistral 7B | 7B | 32 | 32/8 | 128 | 4,096 | 0.125 MB | 0.5 GB |
| GPT-3 175B | 175B | 96 | 96/96 | 128 | 12,288 | 4.5 MB | 18 GB |
Note that LLaMA 2 70B uses grouped-query attention with only 8 KV heads (instead of 64), reducing its KV cache by 8x compared to what it would be with standard multi-head attention. Similarly, Mistral 7B uses 8 KV heads instead of 32, achieving a 4x reduction.
KV cache memory grows linearly along three axes:
For production serving systems, the KV cache is often the dominant consumer of GPU memory. In some configurations, the KV cache uses more memory than the model weights themselves, particularly with large batch sizes or long context lengths.
KV caching provides two distinct benefits.
Without KV caching, generating a sequence of length T requires approximately T x (T/2) total key-value projection operations across all steps (the sum 1 + 2 + ... + T). With KV caching, only T total key-value projection operations are needed (one per step). This changes the computational complexity of the projection step from O(T^2) to O(T). In practice, this translates to a 3-5x speedup in end-to-end generation time, depending on model size and hardware.
During the decode phase, the primary bottleneck is not computation but memory bandwidth. Each decode step requires reading the entire KV cache from GPU HBM to the compute units. For a model with a 2 GB KV cache, every single token generation requires reading 2 GB of data from memory. On an NVIDIA A100 GPU with 2 TB/s memory bandwidth, this means each token takes at minimum 1 ms just for the memory read, regardless of how fast the compute is. This is why the decode phase is described as memory-bandwidth-bound.
The tension between needing the KV cache for speed and its large memory footprint has motivated numerous optimization techniques.
Multi-query attention, proposed by Noam Shazeer in 2019, reduces the KV cache by sharing a single set of key and value projections across all query heads. In standard multi-head attention with h heads, each head has its own K and V projections, resulting in h sets of key-value pairs per layer. MQA replaces these with a single shared K and V head, reducing the KV cache by a factor of h.
For a model with 32 attention heads, MQA reduces the KV cache by 32x. The trade-off is a small degradation in model quality, since all query heads must work with the same key and value representations. MQA was adopted by PaLM (Google, 2022) and Falcon (TII, 2023).
Grouped-query attention, introduced by Ainslie et al. (2023), is a compromise between standard multi-head attention and MQA. Instead of one shared KV head (MQA) or h independent KV heads (MHA), GQA uses g groups of KV heads, where 1 < g < h. Each group of h/g query heads shares one set of key-value projections.
| Attention variant | Query heads | KV heads | KV cache reduction factor |
|---|---|---|---|
| Multi-head attention (MHA) | h | h | 1x (baseline) |
| Grouped-query attention (GQA) | h | g | h/g |
| Multi-query attention (MQA) | h | 1 | h |
GQA has become the dominant attention variant in modern LLMs. LLaMA 2 70B uses 8 KV groups (8x reduction), Mistral 7B uses 8 KV heads with 32 query heads (4x reduction), and models in the Gemma, Qwen, and LLaMA 3 families all use GQA. Research shows that models originally trained with MHA can be "uptrained" with GQA using only a fraction of the original training compute, achieving quality close to MHA while providing inference efficiency closer to MQA.
Multi-head latent attention, introduced in DeepSeek-V2 (2024), takes a fundamentally different approach to KV cache compression. Rather than reducing the number of KV heads, MLA compresses the key and value representations into a low-dimensional latent vector before storing them in the cache. At inference time, the compressed latent is projected back to produce full-dimensional keys and values for each head.
Concretely, MLA replaces the standard W_KV projection with a low-rank factorization: the input is first projected down to a small latent vector c (the "compressed KV"), and only c is cached. When attention needs to be computed, c is projected back up to produce the full K and V tensors. This allows every head to have unique K and V representations (preserving the expressiveness of standard MHA) while caching only the small latent vector.
DeepSeek-V2 reported a 93.3% reduction in KV cache size compared to standard MHA, with maximum generation throughput increasing by 5.76x compared to DeepSeek 67B. Ablation studies showed that MLA maintained quality closer to full MHA than GQA did, making it a quality-preserving approach despite the aggressive compression. MLA has since been adopted by DeepSeek-V3, DeepSeek-R1, Kimi K2, and several other models.
Sliding window attention limits each token's attention to a fixed window of w preceding tokens instead of the full sequence. This allows the KV cache to be bounded at a fixed size regardless of how long the generated sequence becomes.
Mistral 7B (Mistral AI, 2023) popularized this approach with a window size of w = 4,096 while supporting a context length of 8,192 tokens. The implementation uses a rolling buffer cache: a fixed-size cache of w entries where older entries are overwritten in a circular fashion as new tokens are generated. This means the cache never grows beyond w entries, providing predictable and bounded memory usage.
A key insight is that information from tokens beyond the window is not entirely lost. Because each transformer layer applies sliding window attention, a token at layer k can indirectly access information from tokens up to k x w positions away through the cascading effect of intermediate representations. With 32 layers and w = 4,096, Mistral's theoretical attention span reaches approximately 131,072 tokens.
Combined with GQA (8 KV heads instead of 32), Mistral 7B achieves a combined 8x reduction in peak KV cache memory compared to a standard MHA model with the same hidden dimension and full-length caching.
PagedAttention, introduced by Kwon et al. at SOSP 2023, applies ideas from operating system virtual memory management to the KV cache. The core problem it addresses is memory fragmentation: standard inference systems allocate contiguous GPU memory for the maximum possible sequence length for each request, even though most sequences do not use the full allocation. This leads to 60-80% of allocated KV cache memory being wasted.
PagedAttention solves this by dividing the KV cache into fixed-size blocks (typically 16 tokens per block) that can be stored anywhere in GPU memory, similar to how an OS manages virtual memory pages:
As a sequence grows, new physical blocks are allocated on demand. When a sequence finishes, its blocks are freed and can be reused by other sequences. The only wasted memory is in the last partially filled block of each sequence (at most block_size - 1 tokens).
PagedAttention also enables memory sharing between sequences. If two requests share the same prompt prefix (common in chat applications with system prompts), their block tables can point to the same physical blocks for the shared portion, using copy-on-write semantics. This further reduces memory usage.
vLLM, the open-source inference engine built around PagedAttention, reduces KV cache memory waste to under 4% and improves serving throughput by 2-4x compared to systems like FasterTransformer and Orca. vLLM has become one of the most widely used LLM serving frameworks.
KV cache quantization reduces memory consumption by storing cached keys and values in lower numerical precision. While model weights are commonly quantized to INT8 or INT4 for inference, quantizing the KV cache presents unique challenges because keys and values exhibit different statistical properties than weights, often with more outlier channels.
Several approaches have been developed:
| Method | Precision | Compression ratio | Key technique |
|---|---|---|---|
| Standard FP16 cache | 16-bit | 1x (baseline) | No compression |
| INT8 KV cache | 8-bit | 2x | Per-tensor or per-channel quantization |
| KVQuant (2024) | 2-4 bit | 4-8x | Per-channel key quantization before RoPE; non-uniform quantization |
| Coupled Quantization (2024) | 1-2 bit | 8-16x | Exploits interdependence between channels |
| KIVI (2024) | 2-bit | 8x | Per-channel key quantization, per-token value quantization |
A particularly notable result is KVQuant (Hooper et al., 2024), which achieved 8x KV cache compression (2-bit effective precision) while maintaining model quality, enabling a LLaMA-7B model to serve a context length of 1 million tokens on a single A100 GPU.
Quantizing the KV cache is orthogonal to other KV cache reduction techniques (GQA, MLA, etc.) and can be combined with them for multiplicative savings. For example, combining GQA (4x reduction) with INT4 quantization (4x reduction) yields a 16x total reduction in KV cache memory.
Token eviction methods reduce the KV cache by selectively removing cached entries for tokens deemed less important. The challenge is identifying which tokens can be safely evicted without degrading generation quality.
H2O (Heavy-Hitter Oracle): Zhang et al. (2023) observed that a small fraction of tokens accumulate disproportionately high attention scores across generation steps. These "heavy hitter" tokens are critical to preserve. H2O maintains a cache containing both recent tokens (a sliding window) and identified heavy hitters, evicting tokens that are neither recent nor heavily attended. With only 20% of tokens retained as heavy hitters, H2O can improve throughput by up to 29x on leading inference systems.
StreamingLLM: Xiao et al. (2023) discovered that the first few tokens in a sequence ("attention sinks") receive anomalously high attention scores regardless of their semantic content. This happens because softmax attention weights must sum to 1, and the model learns to "dump" excess attention onto initial tokens. StreamingLLM preserves these initial attention sink tokens plus a sliding window of recent tokens, enabling theoretically infinite-length generation with a fixed-size cache.
Adaptive compression: More recent approaches adaptively decide which tokens and layers need full-precision caching versus which can be compressed or evicted, recognizing that different layers and attention heads have different sensitivity to cache precision.
Because the prefill phase is compute-bound while the decode phase is memory-bandwidth-bound, some serving systems separate these two phases onto different hardware. The prefill phase runs on GPUs optimized for throughput (processing large batches of prompt tokens), while the decode phase runs on GPUs optimized for memory bandwidth. The KV cache computed during prefill is transferred to the decode GPU. This approach, used in systems like Splitwise and DistServe, can improve overall serving efficiency by 1.5-2x.
The KV cache is directly tied to a model's ability to handle long contexts. As context windows have grown from 2,048 tokens (GPT-3, 2020) to 128,000 tokens (GPT-4 Turbo, 2023) to 1,000,000+ tokens (Gemini 1.5, 2024), KV cache memory requirements have grown proportionally.
For a LLaMA-70B-class model using GQA with 8 KV heads in FP16, the KV cache memory at various context lengths is:
| Context length | KV cache (batch size 1) | KV cache (batch size 32) |
|---|---|---|
| 4,096 | 1.25 GB | 40 GB |
| 32,768 | 10 GB | 320 GB |
| 131,072 | 40 GB | 1,280 GB |
| 1,000,000 | 305 GB | 9,766 GB |
At a context length of 1 million tokens with batch size 32, the KV cache alone would require nearly 10 TB of memory. This illustrates why KV cache optimization is not optional for long-context serving but a hard requirement. Techniques like KV cache quantization, eviction, and MLA are what make million-token context windows practically feasible.
A basic implementation appends new K and V tensors to Python lists or concatenates tensors at each step. While simple, this approach causes frequent memory allocations and copies. Production systems typically pre-allocate the cache to the maximum expected sequence length at the start, then fill in entries as tokens are generated. Pre-allocation avoids the overhead of dynamic memory management but wastes memory when sequences are shorter than the maximum. PagedAttention addresses this trade-off by combining dynamic allocation with efficient memory management.
When using rotary position embeddings (RoPE), which encode position information directly into the key and query vectors, the cached keys already have positional information baked in. This means cached keys do not need to be re-encoded when new tokens are generated. However, certain KV cache quantization methods (such as KVQuant) quantize keys before applying RoPE because RoPE distorts the channel-wise distribution, making post-RoPE quantization less effective.
For models distributed across multiple GPUs using tensor parallelism, the KV cache is also distributed. Each GPU stores the cache for its assigned attention heads. With GQA, the number of KV heads limits how many GPUs can participate in the KV cache distribution. For example, a model with 8 KV heads can distribute its cache across at most 8 GPUs (one KV head per GPU). This constraint influences the choice of parallelism strategy for large-scale serving.
In continuous batching (also called in-flight batching or iteration-level scheduling), the serving system manages a shared pool of KV cache memory across all active requests. When a request finishes, its cache memory is immediately recycled for new requests. This is more efficient than static batching, where memory is reserved for a fixed batch of requests even after some have completed. vLLM, TensorRT-LLM, and other modern serving frameworks all use continuous batching with PagedAttention for cache management.
Imagine you are reading a long story out loud, and after each sentence you need to answer a question about everything you have read so far. Without a KV cache, you would have to re-read the entire story from the beginning every time you finish a new sentence. That would be very slow if the story is hundreds of pages long.
A KV cache is like taking notes as you read. After each sentence, you write down the important parts (the "keys" and "values") on a notepad. When you need to answer a question about the story, you just look at your notes instead of re-reading everything. The more of the story you read, the more notes you have, and your notepad gets bigger. If your notepad gets too big, you might need tricks to keep it manageable, like only keeping notes about the most recent pages (sliding window) or writing in smaller handwriting (quantization).