A KV cache (key-value cache) is a memory optimization technique used during autoregressive inference in transformer-based models. It stores the key and value matrices computed by the self-attention mechanism for all previously generated tokens, so that these values do not need to be recomputed at each decoding step. Without a KV cache, generating each new token would require recomputing the attention over the entire preceding sequence, making inference prohibitively slow for large language models. The KV cache is one of the most fundamental mechanisms enabling practical deployment of modern LLMs, and optimizing its memory footprint has become a major area of research in inference optimization.
Autoregressive language models generate text one token at a time. At each step, the model takes the full sequence of tokens produced so far and computes attention over them to predict the next token. In a standard transformer, the attention operation for a given query token involves computing dot products against all key vectors in the sequence and then using the resulting weights to aggregate the corresponding value vectors.
Without caching, the computational cost of generating a sequence of length L is quadratic in L, because at step t, the model must process all t previous tokens. The KV cache eliminates this redundancy. Since the key and value projections for earlier tokens do not change (they depend only on the input embeddings and the fixed model weights), they can be computed once and stored. At each new decoding step, only the key and value vectors for the newest token are computed and appended to the cache. The attention computation for the new token then reads from the full cache, but no previous-token computations are repeated [1].
This reduces the per-step complexity from O(t * d) to O(d) for the key-value projection (where d is the model dimension), making token generation roughly constant-time in the projection step. The attention dot-product computation still scales linearly with sequence length, but the expensive matrix multiplications for projecting past tokens are avoided entirely.
In a multi-head attention layer, the input hidden states H are projected into queries Q, keys K, and values V using learned weight matrices:
The attention output is then computed as softmax(QK^T / sqrt(d_k)) * V, where d_k is the dimension per head.
During inference, the model processes the input prompt (the "prefill" phase) by computing K and V for every token in the prompt simultaneously. These K and V tensors are stored in the KV cache. When generation begins (the "decode" phase), each new token produces a single new row of K and a single new row of V. These are concatenated to the cached K and V tensors. The query for the new token is then multiplied against the full (cached + new) K matrix, and the resulting attention weights are applied to the full V matrix [2].
This means the KV cache grows by one row per layer per head for each generated token. The cache is typically maintained as a pair of tensors per layer, with dimensions [batch_size, num_kv_heads, sequence_length, head_dim].
The two phases of inference have very different computational profiles. The prefill phase processes all prompt tokens in parallel and is compute-bound, meaning it fully utilizes the GPU's arithmetic throughput. The decode phase generates tokens one at a time and is memory-bandwidth-bound, since the main bottleneck is reading the KV cache from GPU memory. This distinction is critical for understanding why KV cache optimization has such a large impact on overall inference performance [3].
The total memory required for the KV cache can be calculated with the following formula [4]:
KV cache size (bytes) = 2 * num_layers * num_kv_heads * head_dim * sequence_length * batch_size * bytes_per_element
The factor of 2 accounts for both the key and value tensors. The term num_kv_heads refers to the number of key-value heads, which equals num_attention_heads in standard multi-head attention but is reduced in multi-query attention and grouped-query attention. The bytes_per_element depends on the numerical precision: 2 bytes for FP16 or BF16, 1 byte for FP8, and 0.5 bytes for INT4.
Note that num_kv_heads * head_dim often equals the model's hidden size (d_model) in standard multi-head attention, so the formula can also be written as:
KV cache size (bytes) = 2 * num_layers * d_model * sequence_length * batch_size * bytes_per_element
The KV cache grows linearly with both sequence length and batch size. Doubling the context window doubles the KV cache memory. Doubling the batch size also doubles it. This linear scaling, while better than the quadratic scaling of attention score computation, still creates substantial memory pressure for large models serving long contexts or many concurrent requests.
The following table shows KV cache memory requirements for several popular models at a sequence length of 4,096 tokens, batch size of 1, using BF16 (2 bytes per element):
| Model | Layers | KV Heads | Head Dim | KV Cache (4K seq, BS=1) | KV Cache (128K seq, BS=1) |
|---|---|---|---|---|---|
| Llama 3 8B | 32 | 8 (GQA) | 128 | 0.5 GB | 16 GB |
| Llama 3 70B | 80 | 8 (GQA) | 128 | 1.25 GB | 40 GB |
| Llama 3 405B | 126 | 8 (GQA) | 128 | 2.0 GB | 63 GB |
| Mistral 7B | 32 | 8 (GQA) | 128 | 0.5 GB | 16 GB |
| GPT-3 175B | 96 | 96 (MHA) | 128 | 9.0 GB | 288 GB |
| Falcon 40B | 60 | 1 (MQA) | 64 | 0.03 GB | 0.94 GB |
The contrast between models using MHA and those using GQA or MQA is stark. GPT-3's use of standard multi-head attention results in roughly 18 times more KV cache memory than Llama 3 70B at the same sequence length, despite both being large models. Falcon 40B's use of multi-query attention produces an extremely compact cache. This demonstrates why attention head design has become a primary lever for controlling inference memory [4].
For short sequences, the KV cache is a small fraction of total GPU memory. But at long context lengths, it can rival or exceed the model weight memory. For example, Llama 3 70B in BF16 requires approximately 140 GB for model weights. At a 128K sequence length with batch size 1, the KV cache adds another 40 GB. For production serving with many concurrent requests, the aggregate KV cache across all active sequences often consumes most of the available GPU memory [5].
The growing importance of the KV cache has produced a rich ecosystem of optimization methods. These can be grouped into several categories.
| Optimization | Category | Key Idea | Memory Reduction | Adopted By |
|---|---|---|---|---|
| PagedAttention / vLLM | Memory management | Manage KV cache in non-contiguous pages, like virtual memory | Reduces waste from <4% (vs. 60-80% in naive allocation) | vLLM, SGLang, TensorRT-LLM |
| Quantized KV cache | Compression | Store K,V in FP8 or INT4 instead of FP16/BF16 | 2-4x reduction | vLLM, TensorRT-LLM, llama.cpp |
| Sliding window attention | Architectural | Attend only to the last W tokens; cache never exceeds W entries | Bounded cache size | Mistral 7B (W=4096), Gemma |
| Multi-query attention (MQA) | Architectural | Share a single KV head across all query heads | num_heads x reduction | PaLM, Falcon, StarCoder |
| Grouped-query attention (GQA) | Architectural | Share KV heads among groups of query heads | (num_heads / num_groups) x reduction | Llama 3, Mistral, Gemini |
| Prefix caching | Reuse | Cache and reuse KV entries for shared prompt prefixes across requests | Avoids redundant prefill computation | vLLM (automatic prefix caching), SGLang |
| Attention sinks / StreamingLLM | Eviction | Keep initial "sink" tokens plus a sliding window of recent tokens | Enables infinite-length generation with fixed cache | StreamingLLM (MIT HAN Lab) |
| Selective eviction (H2O, SnapKV) | Eviction | Evict low-importance KV entries based on attention scores | 70-90% reduction possible | Research systems |
| KV cache offloading | Offloading | Move inactive KV entries to CPU RAM or SSD | Extends effective capacity beyond GPU memory | FlexGen, InfiniGen |
PagedAttention, introduced by Kwon et al. in 2023, applies the concept of virtual memory paging to the KV cache. In naive implementations, each request's KV cache is allocated as a contiguous block of GPU memory sized for the maximum possible sequence length. This leads to severe internal fragmentation, because most sequences are much shorter than the maximum. Measurements showed that 60-80% of KV cache memory was wasted in such systems [6].
PagedAttention divides the KV cache into fixed-size blocks (pages) that can be stored anywhere in GPU memory. Each request maintains a block table (analogous to a page table in an operating system) that maps logical KV cache positions to physical block locations. The attention kernel follows these indirection pointers to read the correct blocks. This reduces memory waste to under 4% and enables 2-4x higher throughput by allowing more concurrent requests to fit in memory [6].
vLLM, the open-source serving engine built around PagedAttention, has become one of the most widely used LLM serving systems. It also supports automatic prefix caching, where KV cache blocks for shared prompt prefixes are reused across requests, further improving memory efficiency for applications that use common system prompts.
Storing the KV cache in lower precision is a straightforward way to reduce its memory footprint. Most production models compute attention in FP16 or BF16, but the KV cache can often be quantized to FP8 or even INT4 with minimal impact on output quality. vLLM supports FP8 KV cache quantization with both per-tensor and per-attention-head scaling strategies [7].
NVIDIA has introduced NVFP4 KV cache support, which stores keys and values in 4-bit floating point, achieving a 4x reduction compared to FP16. This is particularly beneficial for long-context inference and large batch sizes, where the KV cache dominates memory usage. Research from 2025 demonstrated that combining FP4 quantization with PagedAttention can yield up to 64x compression with over 90% accuracy retention on standard benchmarks [8].
Sliding window attention restricts each token to attend only to the W most recent tokens, where W is the window size. This places a hard upper bound on the KV cache size: regardless of the total sequence length, each layer stores at most W key-value pairs. Mistral 7B uses a sliding window of 4,096 tokens. During inference, once the cache reaches the window size, old entries are evicted as new ones are added, maintaining constant memory usage [9].
The limitation is that information beyond the window boundary is not directly accessible through attention, though in practice, information can propagate through the layers of the network. Early layers can attend to their own window, and later layers can attend to tokens that already incorporate information from earlier windows, creating an effective receptive field larger than W.
Multi-query attention (MQA), proposed by Shazeer in 2019, reduces the number of key and value heads to just one, shared across all query heads. This yields a dramatic reduction in KV cache size (proportional to the number of attention heads), but can reduce model quality because all query heads are forced to use the same key-value representation [10].
Grouped-query attention (GQA), introduced by Ainslie et al. in 2023, provides a middle ground. Query heads are divided into groups, and each group shares a single set of key-value heads. If the model has h query heads and g KV head groups, the KV cache is reduced by a factor of h/g compared to standard MHA. Most modern LLMs, including Llama 3, Mistral, and Gemini, use GQA because it offers most of the memory savings of MQA with minimal quality loss [11].
For Llama 3 70B, standard multi-head attention with 64 KV heads would require approximately 16.8 GB of KV cache at 128K context (BF16), while its GQA configuration with 8 KV heads requires only about 2.1 GB, an 8x reduction.
Research from MIT's HAN Lab discovered that autoregressive language models allocate a disproportionately large amount of attention to the very first tokens in a sequence, regardless of their semantic content. These initial tokens act as "attention sinks," absorbing excess attention probability because the softmax function requires attention weights to sum to one. When these initial tokens are evicted from a sliding window cache, model performance degrades sharply [12].
StreamingLLM exploits this observation by maintaining a small set of initial "sink" tokens (typically four) alongside a sliding window of recent tokens. This simple combination enables LLMs to generate text over effectively infinite sequence lengths with a fixed-size KV cache, achieving up to 22.2x speedup over the alternative of recomputing the full sliding window from scratch. The approach was published at ICLR 2024 [12].
Prefix caching stores the KV cache for common prompt prefixes and reuses them across multiple requests. In many production applications, requests share the same system prompt or few-shot examples. Without prefix caching, the KV cache for these shared tokens is recomputed for every request. With prefix caching, the shared portion is computed once and its KV cache blocks are shared (read-only) across all requests that use the same prefix [6].
vLLM implements automatic prefix caching, which uses a hash-based lookup to detect when a new request's prompt shares a prefix with a cached sequence. SGLang uses a radix tree data structure for the same purpose. Prefix caching is especially valuable for retrieval-augmented generation (RAG) pipelines and chat applications with long system prompts.
Several research methods selectively evict low-importance KV cache entries based on attention patterns. H2O (Heavy Hitter Oracle) identifies tokens that receive the most cumulative attention and retains only those, along with recent tokens. SnapKV uses attention score patterns observed during the prefill phase to decide which tokens to keep for the decode phase. RazorAttention uses submodular optimization to select a diverse, informative subset of cached tokens [13].
PyramidKV and related methods exploit the observation that earlier transformer layers tend to have more diffuse attention patterns (needing more cached tokens) while later layers are more focused (needing fewer). They allocate cache budgets in a pyramidal fashion, with more cache for early layers and less for later ones [13].
Modern LLM serving stacks combine multiple KV cache optimizations simultaneously. A typical production deployment might use GQA at the architecture level, PagedAttention for memory management, FP8 quantization for compression, and prefix caching for repeated prompts. The layering of these techniques is what makes serving models with 100K+ context lengths economically viable.
A newer trend in 2025-2026 is disaggregated serving, where the prefill and decode phases are handled by different GPU pools. Since prefill is compute-bound and decode is memory-bandwidth-bound, optimizing for both simultaneously on the same hardware is difficult. In a disaggregated architecture, the KV cache generated during prefill on one GPU is transferred to a decode GPU, often over high-speed interconnects. This allows each GPU pool to be optimized for its specific bottleneck. Systems like DistServe and Splitwise implement this approach [14].
For extremely long contexts or memory-constrained environments, KV cache entries can be offloaded to CPU RAM or even SSD storage. FlexGen pioneered this approach by treating the KV cache as a hierarchical storage problem across GPU memory, CPU memory, and disk. More recent systems like InfiniGen use speculative prefetching, predicting which KV cache entries will be needed for the next decoding step and pre-loading them from CPU to GPU memory to hide the transfer latency [15].
As of early 2026, KV cache optimization remains one of the most active areas of LLM systems research. Several trends define the current landscape:
Extremely low-bit quantization (4-bit and below) for KV caches has moved from research to production, with NVIDIA providing native FP4 support and vLLM integrating quantized KV cache as a standard feature. The combination of FP4 KV caches with GQA-based architectures means that a model like Llama 3 70B can serve 128K context windows with KV cache memory under 10 GB, a dramatic improvement over the 288 GB that a hypothetically equivalent MHA model in FP16 would require.
Dynamic cache management systems now make intelligent decisions about which cache entries to retain, evict, or offload based on predicted access patterns and content importance. Layer-wise adaptive allocation, where different transformer layers receive different cache budgets, has emerged as a promising direction with methods like CAKE and Ada-KV.
The integration of KV cache optimization into standard serving frameworks has matured considerably. What was cutting-edge research in 2023 (PagedAttention, prefix caching) is now table-stakes functionality in production serving stacks. The frontier has moved to more sophisticated techniques: cross-request cache sharing, hierarchical storage management, and hardware-aware cache scheduling.
Looking ahead, the continued growth of context windows (now regularly exceeding 1 million tokens in models like Claude and Gemini) ensures that KV cache optimization will remain essential. Each doubling of context length doubles the potential cache size, creating a persistent demand for more efficient management strategies.