KV Cache

Introduction

A KV cache (key-value cache) is a memory optimization technique used during transformer inference that stores previously computed key and value tensors from the attention mechanism so they do not need to be recalculated at each generation step. During autoregressive text generation, a large language model produces tokens one at a time, and each new token must attend to every previous token in the sequence. Without caching, the model would redundantly recompute the key and value projections for all prior tokens at every step, resulting in computation that grows quadratically with sequence length. By storing and reusing these intermediate results, the KV cache reduces per-step computation from O(n) projection operations to O(1), where only the new token's key and value vectors need to be computed and appended to the cache.

The KV cache is one of the most fundamental optimizations in modern LLM serving. Every major inference framework, including vLLM, TensorRT-LLM, and Hugging Face Transformers, implements KV caching by default. However, the cache introduces its own challenge: memory consumption. For large models with long context windows, the KV cache can consume tens of gigabytes of GPU memory, often exceeding the memory required by the model weights themselves. This has motivated an active area of research into KV cache compression, eviction, quantization, and memory management techniques.

Background: autoregressive generation

To understand why the KV cache exists, it is necessary to understand how autoregressive language models generate text.

A decoder-only transformer (such as GPT, LLaMA, or Claude) generates text one token at a time. At each step t, the model takes all tokens produced so far (x_1, x_2, ..., x_t) and predicts the probability distribution over the next token x_{t+1}. This process is inherently sequential: the model cannot produce token t+1 until it has produced token t.

Inside each transformer layer, the self-attention mechanism computes three matrices from the input embeddings:

Queries (Q): what the current token is looking for
Keys (K): what each token offers as a matching signal
Values (V): the content each token contributes to the output

The attention output is computed as:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

Because the model is causal (each token can only attend to itself and preceding tokens), at generation step t the model computes the query for the new token and must compare it against the keys of all t tokens. The resulting attention weights are then applied to the values of all t tokens.

Without KV caching, the model would recompute the K and V projections for every token at every generation step. At step t, this means computing t key vectors and t value vectors, even though the key and value vectors for tokens 1 through t-1 are identical to what they were at step t-1. This redundant computation is the problem that KV caching solves.

How the KV cache works

The KV cache operates by storing the key and value tensors computed at each generation step and reusing them in subsequent steps. The process unfolds in two distinct phases.

Prefill phase

When the user submits a prompt, the model processes all input tokens in parallel during the prefill phase (also called the prompt encoding phase). For each transformer layer, the model computes the key and value projections for every token in the prompt and stores them in the cache. This phase is compute-bound because it involves large matrix multiplications across the full prompt length. The prefill phase determines the time to first token (TTFT), the latency the user experiences before seeing any output.

Decode phase

After the prefill phase, the model enters the decode phase, generating output tokens one at a time. At each step:

The model computes Q, K, and V projections for only the new token.
The new K and V vectors are appended to the existing cache.
The query for the new token attends over all cached keys (including the newly appended one) to produce attention weights.
The attention weights are applied to all cached values to produce the output for that position.

This means that at decode step t, only a single token's K and V projections need to be computed, while the K and V projections for all previous tokens are simply read from the cache. The decode phase is memory-bandwidth-bound rather than compute-bound: the bottleneck is loading the cached key and value tensors from GPU memory (HBM) into the compute units, not the arithmetic itself.

Step-by-step example

Consider generating the sentence "The cat sat on" with a single-layer, single-head transformer:

Step	Input token	Cache before step (K vectors)	Computation	Cache after step (K vectors)
1 (prefill)	"The", "cat", "sat"	Empty	Compute K, V for all 3 tokens; store in cache	K_the, K_cat, K_sat
2 (decode)	"on"	K_the, K_cat, K_sat	Compute K_on, V_on; append to cache; query attends over all 4	K_the, K_cat, K_sat, K_on
3 (decode)	"the"	K_the, K_cat, K_sat, K_on	Compute K_the2, V_the2; append; query attends over all 5	K_the, K_cat, K_sat, K_on, K_the2

In a real model, this process happens independently at every layer and every attention head, so the total cache stores K and V tensors for every (layer, head, position) combination.

Memory analysis

Per-token memory formula

The KV cache stores two tensors (K and V) at each layer for each token. The memory required per token is:

KV cache per token (bytes) = 2 x n_layers x n_kv_heads x d_head x bytes_per_element

where:

2 accounts for both K and V matrices
n_layers is the number of transformer layers
n_kv_heads is the number of key-value heads (equals n_heads for standard multi-head attention, fewer for grouped-query attention)
d_head is the dimension of each attention head
bytes_per_element depends on the data type (2 for FP16/BF16, 4 for FP32, 1 for INT8)

Since n_kv_heads x d_head often equals the model's hidden dimension d_model (in the standard multi-head attention case), this simplifies to:

KV cache per token (bytes) = 2 x n_layers x d_model x bytes_per_element

Total memory formula

The total KV cache memory for a batch of sequences is:

Total KV cache (bytes) = batch_size x seq_length x 2 x n_layers x n_kv_heads x d_head x bytes_per_element

Example calculations

The following table shows approximate KV cache sizes for popular models processing a single sequence of 4,096 tokens in FP16 (2 bytes per element):

Model	Parameters	Layers	Heads (Q/KV)	d_head	d_model	KV cache per token	KV cache (4K tokens)
LLaMA 2 7B	7B	32	32/32	128	4,096	0.5 MB	2.0 GB
LLaMA 2 13B	13B	40	40/40	128	5,120	0.6 MB	2.5 GB
LLaMA 2 70B	70B	80	64/8	128	8,192	0.3 MB	1.25 GB
Mistral 7B	7B	32	32/8	128	4,096	0.125 MB	0.5 GB
GPT-3 175B	175B	96	96/96	128	12,288	4.5 MB	18 GB

Note that LLaMA 2 70B uses grouped-query attention with only 8 KV heads (instead of 64), reducing its KV cache by 8x compared to what it would be with standard multi-head attention. Similarly, Mistral 7B uses 8 KV heads instead of 32, achieving a 4x reduction.

Memory scaling behavior

KV cache memory grows linearly along three axes:

Sequence length: Doubling the context window doubles the cache. A model with a 128K context window needs 32x more cache memory than the same model with a 4K window.
Batch size: Serving 64 concurrent requests requires 64x the cache of a single request.
Model depth: Deeper models (more layers) require proportionally more cache.

For production serving systems, the KV cache is often the dominant consumer of GPU memory. In some configurations, the KV cache uses more memory than the model weights themselves, particularly with large batch sizes or long context lengths.

Why KV caching matters for performance

KV caching provides two distinct benefits.

Computational savings

Without KV caching, generating a sequence of length T requires approximately T x (T/2) total key-value projection operations across all steps (the sum 1 + 2 + ... + T). With KV caching, only T total key-value projection operations are needed (one per step). This changes the computational complexity of the projection step from O(T^2) to O(T). In practice, this translates to a 3-5x speedup in end-to-end generation time, depending on model size and hardware.

Memory bandwidth bottleneck

During the decode phase, the primary bottleneck is not computation but memory bandwidth. Each decode step requires reading the entire KV cache from GPU HBM to the compute units. For a model with a 2 GB KV cache, every single token generation requires reading 2 GB of data from memory. On an NVIDIA A100 GPU with 2 TB/s memory bandwidth, this means each token takes at minimum 1 ms just for the memory read, regardless of how fast the compute is. This is why the decode phase is described as memory-bandwidth-bound.

KV cache optimization techniques

The tension between needing the KV cache for speed and its large memory footprint has motivated numerous optimization techniques.

Multi-query attention (MQA)

Multi-query attention, proposed by Noam Shazeer in 2019, reduces the KV cache by sharing a single set of key and value projections across all query heads. In standard multi-head attention with h heads, each head has its own K and V projections, resulting in h sets of key-value pairs per layer. MQA replaces these with a single shared K and V head, reducing the KV cache by a factor of h.

For a model with 32 attention heads, MQA reduces the KV cache by 32x. The trade-off is a small degradation in model quality, since all query heads must work with the same key and value representations. MQA was adopted by PaLM (Google, 2022) and Falcon (TII, 2023).

Grouped-query attention (GQA)

Grouped-query attention, introduced by Ainslie et al. (2023), is a compromise between standard multi-head attention and MQA. Instead of one shared KV head (MQA) or h independent KV heads (MHA), GQA uses g groups of KV heads, where 1 < g < h. Each group of h/g query heads shares one set of key-value projections.

Attention variant	Query heads	KV heads	KV cache reduction factor
Multi-head attention (MHA)	h	h	1x (baseline)
Grouped-query attention (GQA)	h	g	h/g
Multi-query attention (MQA)	h	1	h

GQA has become the dominant attention variant in modern LLMs. LLaMA 2 70B uses 8 KV groups (8x reduction), Mistral 7B uses 8 KV heads with 32 query heads (4x reduction), and models in the Gemma, Qwen, and LLaMA 3 families all use GQA. Research shows that models originally trained with MHA can be "uptrained" with GQA using only a fraction of the original training compute, achieving quality close to MHA while providing inference efficiency closer to MQA.

Multi-head latent attention (MLA)

Multi-head latent attention, introduced in DeepSeek-V2 (2024), takes a fundamentally different approach to KV cache compression. Rather than reducing the number of KV heads, MLA compresses the key and value representations into a low-dimensional latent vector before storing them in the cache. At inference time, the compressed latent is projected back to produce full-dimensional keys and values for each head.

Concretely, MLA replaces the standard W_KV projection with a low-rank factorization: the input is first projected down to a small latent vector c (the "compressed KV"), and only c is cached. When attention needs to be computed, c is projected back up to produce the full K and V tensors. This allows every head to have unique K and V representations (preserving the expressiveness of standard MHA) while caching only the small latent vector.

DeepSeek-V2 reported a 93.3% reduction in KV cache size compared to standard MHA, with maximum generation throughput increasing by 5.76x compared to DeepSeek 67B. Ablation studies showed that MLA maintained quality closer to full MHA than GQA did, making it a quality-preserving approach despite the aggressive compression. MLA has since been adopted by DeepSeek-V3, DeepSeek-R1, Kimi K2, and several other models.

Sliding window attention

Sliding window attention limits each token's attention to a fixed window of w preceding tokens instead of the full sequence. This allows the KV cache to be bounded at a fixed size regardless of how long the generated sequence becomes.

Mistral 7B (Mistral AI, 2023) popularized this approach with a window size of w = 4,096 while supporting a context length of 8,192 tokens. The implementation uses a rolling buffer cache: a fixed-size cache of w entries where older entries are overwritten in a circular fashion as new tokens are generated. This means the cache never grows beyond w entries, providing predictable and bounded memory usage.

A key insight is that information from tokens beyond the window is not entirely lost. Because each transformer layer applies sliding window attention, a token at layer k can indirectly access information from tokens up to k x w positions away through the cascading effect of intermediate representations. With 32 layers and w = 4,096, Mistral's theoretical attention span reaches approximately 131,072 tokens.

Combined with GQA (8 KV heads instead of 32), Mistral 7B achieves a combined 8x reduction in peak KV cache memory compared to a standard MHA model with the same hidden dimension and full-length caching.

PagedAttention and vLLM

PagedAttention, introduced by Kwon et al. at SOSP 2023, applies ideas from operating system virtual memory management to the KV cache. The core problem it addresses is memory fragmentation: standard inference systems allocate contiguous GPU memory for the maximum possible sequence length for each request, even though most sequences do not use the full allocation. This leads to 60-80% of allocated KV cache memory being wasted.

PagedAttention solves this by dividing the KV cache into fixed-size blocks (typically 16 tokens per block) that can be stored anywhere in GPU memory, similar to how an OS manages virtual memory pages:

Logical blocks represent a contiguous view of the KV cache for each sequence.
Physical blocks are the actual GPU memory allocations, which may be non-contiguous.
A block table maps logical blocks to physical blocks, analogous to a page table in virtual memory.

As a sequence grows, new physical blocks are allocated on demand. When a sequence finishes, its blocks are freed and can be reused by other sequences. The only wasted memory is in the last partially filled block of each sequence (at most block_size - 1 tokens).

PagedAttention also enables memory sharing between sequences. If two requests share the same prompt prefix (common in chat applications with system prompts), their block tables can point to the same physical blocks for the shared portion, using copy-on-write semantics. This further reduces memory usage.

vLLM, the open-source inference engine built around PagedAttention, reduces KV cache memory waste to under 4% and improves serving throughput by 2-4x compared to systems like FasterTransformer and Orca. vLLM has become one of the most widely used LLM serving frameworks.

KV cache quantization

KV cache quantization reduces memory consumption by storing cached keys and values in lower numerical precision. While model weights are commonly quantized to INT8 or INT4 for inference, quantizing the KV cache presents unique challenges because keys and values exhibit different statistical properties than weights, often with more outlier channels.

Several approaches have been developed:

Method	Precision	Compression ratio	Key technique
Standard FP16 cache	16-bit	1x (baseline)	No compression
INT8 KV cache	8-bit	2x	Per-tensor or per-channel quantization
KVQuant (2024)	2-4 bit	4-8x	Per-channel key quantization before RoPE; non-uniform quantization
Coupled Quantization (2024)	1-2 bit	8-16x	Exploits interdependence between channels
KIVI (2024)	2-bit	8x	Per-channel key quantization, per-token value quantization

A particularly notable result is KVQuant (Hooper et al., 2024), which achieved 8x KV cache compression (2-bit effective precision) while maintaining model quality, enabling a LLaMA-7B model to serve a context length of 1 million tokens on a single A100 GPU.

Quantizing the KV cache is orthogonal to other KV cache reduction techniques (GQA, MLA, etc.) and can be combined with them for multiplicative savings. For example, combining GQA (4x reduction) with INT4 quantization (4x reduction) yields a 16x total reduction in KV cache memory.

Token eviction and cache compression

Token eviction methods reduce the KV cache by selectively removing cached entries for tokens deemed less important. The challenge is identifying which tokens can be safely evicted without degrading generation quality.

H2O (Heavy-Hitter Oracle): Zhang et al. (2023) observed that a small fraction of tokens accumulate disproportionately high attention scores across generation steps. These "heavy hitter" tokens are critical to preserve. H2O maintains a cache containing both recent tokens (a sliding window) and identified heavy hitters, evicting tokens that are neither recent nor heavily attended. With only 20% of tokens retained as heavy hitters, H2O can improve throughput by up to 29x on leading inference systems.

StreamingLLM: Xiao et al. (2023) discovered that the first few tokens in a sequence ("attention sinks") receive anomalously high attention scores regardless of their semantic content. This happens because softmax attention weights must sum to 1, and the model learns to "dump" excess attention onto initial tokens. StreamingLLM preserves these initial attention sink tokens plus a sliding window of recent tokens, enabling theoretically infinite-length generation with a fixed-size cache.

Adaptive compression: More recent approaches adaptively decide which tokens and layers need full-precision caching versus which can be compressed or evicted, recognizing that different layers and attention heads have different sensitivity to cache precision.

Disaggregated prefill and decode

Because the prefill phase is compute-bound while the decode phase is memory-bandwidth-bound, some serving systems separate these two phases onto different hardware. The prefill phase runs on GPUs optimized for throughput (processing large batches of prompt tokens), while the decode phase runs on GPUs optimized for memory bandwidth. The KV cache computed during prefill is transferred to the decode GPU. This approach, used in systems like Splitwise and DistServe, can improve overall serving efficiency by 1.5-2x.

Relationship to context length

The KV cache is directly tied to a model's ability to handle long contexts. As context windows have grown from 2,048 tokens (GPT-3, 2020) to 128,000 tokens (GPT-4 Turbo, 2023) to 1,000,000+ tokens (Gemini 1.5, 2024), KV cache memory requirements have grown proportionally.

For a LLaMA-70B-class model using GQA with 8 KV heads in FP16, the KV cache memory at various context lengths is:

Context length	KV cache (batch size 1)	KV cache (batch size 32)
4,096	1.25 GB	40 GB
32,768	10 GB	320 GB
131,072	40 GB	1,280 GB
1,000,000	305 GB	9,766 GB

At a context length of 1 million tokens with batch size 32, the KV cache alone would require nearly 10 TB of memory. This illustrates why KV cache optimization is not optional for long-context serving but a hard requirement. Techniques like KV cache quantization, eviction, and MLA are what make million-token context windows practically feasible.

Implementation details

Pre-allocation vs. dynamic allocation

A basic implementation appends new K and V tensors to Python lists or concatenates tensors at each step. While simple, this approach causes frequent memory allocations and copies. Production systems typically pre-allocate the cache to the maximum expected sequence length at the start, then fill in entries as tokens are generated. Pre-allocation avoids the overhead of dynamic memory management but wastes memory when sequences are shorter than the maximum. PagedAttention addresses this trade-off by combining dynamic allocation with efficient memory management.

Position embeddings and the cache

When using rotary position embeddings (RoPE), which encode position information directly into the key and query vectors, the cached keys already have positional information baked in. This means cached keys do not need to be re-encoded when new tokens are generated. However, certain KV cache quantization methods (such as KVQuant) quantize keys before applying RoPE because RoPE distorts the channel-wise distribution, making post-RoPE quantization less effective.

Multi-GPU and distributed caching

For models distributed across multiple GPUs using tensor parallelism, the KV cache is also distributed. Each GPU stores the cache for its assigned attention heads. With GQA, the number of KV heads limits how many GPUs can participate in the KV cache distribution. For example, a model with 8 KV heads can distribute its cache across at most 8 GPUs (one KV head per GPU). This constraint influences the choice of parallelism strategy for large-scale serving.

Cache management in batched serving

In continuous batching (also called in-flight batching or iteration-level scheduling), the serving system manages a shared pool of KV cache memory across all active requests. When a request finishes, its cache memory is immediately recycled for new requests. This is more efficient than static batching, where memory is reserved for a fixed batch of requests even after some have completed. vLLM, TensorRT-LLM, and other modern serving frameworks all use continuous batching with PagedAttention for cache management.

Explain like I'm 5 (ELI5)

Imagine you are reading a long story out loud, and after each sentence you need to answer a question about everything you have read so far. Without a KV cache, you would have to re-read the entire story from the beginning every time you finish a new sentence. That would be very slow if the story is hundreds of pages long.

A KV cache is like taking notes as you read. After each sentence, you write down the important parts (the "keys" and "values") on a notepad. When you need to answer a question about the story, you just look at your notes instead of re-reading everything. The more of the story you read, the more notes you have, and your notepad gets bigger. If your notepad gets too big, you might need tricks to keep it manageable, like only keeping notes about the most recent pages (sliding window) or writing in smaller handwriting (quantization).

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS) 2017*.
Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head Is All You Need." *arXiv preprint arXiv:1911.02150*.
Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." *Proceedings of EMNLP 2023*.
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *Proceedings of the 29th Symposium on Operating Systems Principles (SOSP 2023)*.
DeepSeek-AI. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." *arXiv preprint arXiv:2405.04434*.
Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). "Mistral 7B." *arXiv preprint arXiv:2310.06825*.
Hooper, C., Kim, S., Mohammadzadeh, H., et al. (2024). "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization." *Advances in Neural Information Processing Systems (NeurIPS) 2024*.
Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Re, C., Barrett, C., Wang, Z., and Chen, B. (2023). "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models." *Advances in Neural Information Processing Systems (NeurIPS) 2023*.
Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. (2023). "Efficient Streaming Language Models with Attention Sinks." *arXiv preprint arXiv:2309.17453*.
Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Heek, J., Xiao, K., Agrawal, S., and Dean, J. (2023). "Efficiently Scaling Transformer Inference." *Proceedings of Machine Learning and Systems (MLSys) 2023*.
Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. (2024). "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache." *Proceedings of ICML 2024*.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." *Advances in Neural Information Processing Systems (NeurIPS) 2022*.
Touvron, H., Martin, L., Stone, K., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." *arXiv preprint arXiv:2307.09288*.

Introduction

Background: autoregressive generation

How the KV cache works

Prefill phase

Decode phase

Step-by-step example

Memory analysis

Per-token memory formula

Total memory formula

Example calculations

Memory scaling behavior

Why KV caching matters for performance

Computational savings

Memory bandwidth bottleneck

KV cache optimization techniques

Multi-query attention (MQA)

Grouped-query attention (GQA)

Multi-head latent attention (MLA)

Sliding window attention

PagedAttention and vLLM

KV cache quantization

Token eviction and cache compression

Disaggregated prefill and decode

Relationship to context length

Implementation details

Pre-allocation vs. dynamic allocation

Position embeddings and the cache

Multi-GPU and distributed caching

Cache management in batched serving

Explain like I'm 5 (ELI5)

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

Multi-Head Self-Attention

Speculative Decoding

Grouped-Query Attention

Self-attention

Introduction

Background: autoregressive generation

How the KV cache works

Prefill phase

Decode phase

Step-by-step example

Memory analysis

Per-token memory formula

Total memory formula

Example calculations

Memory scaling behavior

Why KV caching matters for performance

Computational savings

Memory bandwidth bottleneck

KV cache optimization techniques

Multi-query attention (MQA)

Grouped-query attention (GQA)

Multi-head latent attention (MLA)

Sliding window attention

PagedAttention and vLLM

KV cache quantization

Token eviction and cache compression

Disaggregated prefill and decode

Relationship to context length

Implementation details

Pre-allocation vs. dynamic allocation

Position embeddings and the cache

Multi-GPU and distributed caching

Cache management in batched serving

Explain like I'm 5 (ELI5)

References

Related Articles

Sparse autoencoder

ARC-AGI 2

Multi-Head Self-Attention

Speculative Decoding

Grouped-Query Attention

Self-attention