KV Cache
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 ยท 7,479 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 24, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v5 ยท 7,479 words
Add missing citations, update stale details, or suggest a clearer explanation.
A KV cache (key-value cache) is a memory optimization technique used during transformer inference that stores previously computed key and value tensors from the attention mechanism so they do not need to be recalculated at each generation step. During autoregressive text generation, a large language model produces tokens one at a time, and each new token must attend to every previous token in the sequence. Without caching, the model would redundantly recompute the key and value projections for all prior tokens at every step, resulting in computation that grows quadratically with sequence length. By storing and reusing these intermediate results, the KV cache reduces per-step computation from O(n) projection operations to O(1), where only the new token's key and value vectors need to be computed and appended to the cache.[1]
The KV cache is one of the most fundamental optimizations in modern LLM serving. Every major inference framework, including vLLM, TensorRT-LLM, Hugging Face Text Generation Inference, and SGLang, implements KV caching by default.[2][3][4] However, the cache introduces its own challenge: memory consumption. For large models with long context windows, the KV cache can consume tens of gigabytes of GPU memory, often exceeding the memory required by the model weights themselves.[5] This has motivated an active area of research into KV cache compression, eviction, quantization, and memory management techniques, as well as commercial features such as prompt caching that expose KV cache reuse directly to API users.[6][7][8]
To understand why the KV cache exists, it is necessary to understand how autoregressive language models generate text.
A decoder-only transformer (such as GPT, LLaMA, or Claude) generates text one token at a time. At each step t, the model takes all tokens produced so far (x_1, x_2, ..., x_t) and predicts the probability distribution over the next token x_{t+1}. This process is inherently sequential: the model cannot produce token t+1 until it has produced token t.
Inside each transformer layer, the self-attention mechanism computes three matrices from the input embeddings:
The attention output is computed as:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
The construction is causal in decoder-only models: each token can only attend to itself and preceding tokens. At generation step t the model computes the query for the new token and must compare it against the keys of all t tokens. The resulting attention weights are then applied to the values of all t tokens.
Without KV caching, the model would recompute the K and V projections for every token at every generation step. At step t, this means computing t key vectors and t value vectors, even though the key and value vectors for tokens 1 through t-1 are identical to what they were at step t-1. This redundant computation is the problem that KV caching solves.[1]
The KV cache operates by storing the key and value tensors computed at each generation step and reusing them in subsequent steps. The process unfolds in two distinct phases.
When the user submits a prompt, the model processes all input tokens in parallel during the prefill phase (also called the prompt encoding phase). For each transformer layer, the model computes the key and value projections for every token in the prompt and stores them in the cache. This phase is compute-bound because it involves large matrix multiplications across the full prompt length. The prefill phase determines the time to first token (TTFT), the latency the user experiences before seeing any output.[9]
After the prefill phase, the model enters the decode phase, generating output tokens one at a time. At each step:
This means that at decode step t, only a single token's K and V projections need to be computed, while the K and V projections for all previous tokens are simply read from the cache. The decode phase is memory-bandwidth-bound rather than compute-bound: the bottleneck is loading the cached key and value tensors from GPU memory (HBM) into the compute units, not the arithmetic itself.[1][10] Shazeer (2019) identified this memory-bandwidth bottleneck as the original motivation for multi-query attention.[11]
Consider generating the sentence "The cat sat on" with a single-layer, single-head transformer:
| Step | Input token | Cache before step (K vectors) | Computation | Cache after step (K vectors) |
|---|---|---|---|---|
| 1 (prefill) | "The", "cat", "sat" | Empty | Compute K, V for all 3 tokens; store in cache | K_the, K_cat, K_sat |
| 2 (decode) | "on" | K_the, K_cat, K_sat | Compute K_on, V_on; append to cache; query attends over all 4 | K_the, K_cat, K_sat, K_on |
| 3 (decode) | "the" | K_the, K_cat, K_sat, K_on | Compute K_the2, V_the2; append; query attends over all 5 | K_the, K_cat, K_sat, K_on, K_the2 |
In a real model, this process happens independently at every layer and every attention head, so the total cache stores K and V tensors for every (layer, head, position) combination.
The idea of caching keys and values during autoregressive decoding predates many of the systems that popularized it. The original Transformer paper by Vaswani et al. (2017) introduced the encoder-decoder architecture with multi-head self-attention but did not single out KV caching as a separate technique: it was implicit in the design of incremental decoding.[12] Early implementations in Tensor2Tensor, HuggingFace transformers, and Fairseq exposed a past_key_values argument that returned cached projections from previous decoding steps, making the optimization explicit at the API level.
The first paper to formally describe KV caching as the central bottleneck of LLM inference was Shazeer's 2019 work on multi-query attention, which observed that the memory-bandwidth cost of repeatedly loading the large keys and values tensors was the dominant cost of incremental decoding.[11] Pope et al. (2023) at Google extended this analysis to TPU v4 deployments of PaLM-scale models, providing a closed-form analytical model that accounts for KV cache memory, parameter memory, and attention flops.[10] The 2023 SOSP paper by Kwon et al. shifted the framing from per-request KV cache to system-level memory management, treating the cache as a shared resource that can be paged, shared, and copied on write.[13] That paper introduced vLLM and established the now-standard block-based approach used by most modern serving stacks.
The KV cache stores two tensors (K and V) at each layer for each token. The memory required per token is:
KV cache per token (bytes) = 2 x n_layers x n_kv_heads x d_head x bytes_per_element
where:
Since n_kv_heads x d_head often equals the model's hidden dimension d_model (in the standard multi-head attention case), this simplifies to:
KV cache per token (bytes) = 2 x n_layers x d_model x bytes_per_element
The total KV cache memory for a batch of sequences is:
Total KV cache (bytes) = batch_size x seq_length x 2 x n_layers x n_kv_heads x d_head x bytes_per_element
The following table shows approximate KV cache sizes for popular models processing a single sequence of 4,096 tokens in FP16 (2 bytes per element):
| Model | Parameters | Layers | Heads (Q/KV) | d_head | d_model | KV cache per token | KV cache (4K tokens) |
|---|---|---|---|---|---|---|---|
| LLaMA 2 7B | 7B | 32 | 32/32 | 128 | 4,096 | 0.5 MB | 2.0 GB |
| LLaMA 2 13B | 13B | 40 | 40/40 | 128 | 5,120 | 0.6 MB | 2.5 GB |
| LLaMA 2 70B | 70B | 80 | 64/8 | 128 | 8,192 | 0.3 MB | 1.25 GB |
| Llama 3 70B | 70B | 80 | 64/8 | 128 | 8,192 | 0.3 MB | 1.25 GB |
| Mistral 7B | 7B | 32 | 32/8 | 128 | 4,096 | 0.125 MB | 0.5 GB |
| GPT-3 175B | 175B | 96 | 96/96 | 128 | 12,288 | 4.5 MB | 18 GB |
Note that LLaMA 2 70B uses grouped-query attention with only 8 KV heads (instead of 64), reducing its KV cache by 8x compared to what it would be with standard multi-head attention.[14] Llama 3 and Llama 3.1 70B inherit the same shape (80 layers, 8 KV heads, head dimension 128) and reach about 2.5 GB at 8K tokens, 10 GB at 32K tokens, and roughly 40 GB at 128K tokens for a single sequence in FP16.[15][16] Mistral 7B uses 8 KV heads instead of 32, achieving a 4x reduction.[17]
Llama 3.1 405B has 126 transformer layers, 128 query heads, 8 KV heads, head dimension 128, and a 128K-token context window.[15][16] In FP16 the per-token KV cache footprint is 2 x 126 x 8 x 128 x 2 = 516,096 bytes (about 0.49 MB). For a single 128K-token sequence, the cache alone occupies roughly 64 GB in FP16; FP8 halves that to about 32 GB.[18] A batch of 32 such sequences in FP16 requires approximately 2 TB of KV cache memory, which is why large-context production deployments rely on quantization, GQA, and offloading.
KV cache memory grows linearly along three axes:
For production serving systems, the KV cache is often the dominant consumer of GPU memory. In some configurations, the KV cache uses more memory than the model weights themselves, particularly with large batch sizes or long context lengths.[13]
KV caching provides two distinct benefits.
Without KV caching, generating a sequence of length T requires approximately T x (T/2) total key-value projection operations across all steps (the sum 1 + 2 + ... + T). With KV caching, only T total key-value projection operations are needed (one per step). This changes the computational complexity of the projection step from O(T^2) to O(T). In practice, this translates to a 3-5x speedup in end-to-end generation time, depending on model size and hardware.[11]
During the decode phase, the primary bottleneck is not computation but memory bandwidth. Each decode step requires reading the entire KV cache from GPU HBM to the compute units. For a model with a 2 GB KV cache, every single token generation requires reading 2 GB of data from memory. On an NVIDIA A100 GPU with 2 TB/s memory bandwidth, this means each token takes at minimum 1 ms just for the memory read, regardless of how fast the compute is.[11][19] On an NVIDIA H100 with about 3.35 TB/s of HBM3 bandwidth the floor is lower, but the cache reads remain the dominant cost during decode for any model that fits in HBM.[20] This is why the decode phase is described as memory-bandwidth-bound.
The tension between needing the KV cache for speed and its large memory footprint has motivated numerous optimization techniques. The most common approaches fall into four categories: attention-head sharing (MQA, GQA, MLA), eviction or selection (sliding window, H2O, StreamingLLM, SnapKV, FastGen), quantization (INT8, KIVI, KVQuant), and system-level memory management (PagedAttention, vAttention, RadixAttention, prefix caching).
Multi-query attention, proposed by Noam Shazeer in 2019, reduces the KV cache by sharing a single set of key and value projections across all query heads.[11] In standard multi-head attention with h heads, each head has its own K and V projections, resulting in h sets of key-value pairs per layer. MQA replaces these with a single shared K and V head, reducing the KV cache by a factor of h.
For a model with 32 attention heads, MQA reduces the KV cache by 32x. The trade-off is a small degradation in model quality, since all query heads must work with the same key and value representations. MQA was adopted by PaLM (Google, 2022) and Falcon (TII, 2023).
Grouped-query attention, introduced by Ainslie, Lee-Thorp, de Jong, Zemlyanskiy, Lebron, and Sanghai in 2023, is a compromise between standard multi-head attention and MQA.[14] Instead of one shared KV head (MQA) or h independent KV heads (MHA), GQA uses g groups of KV heads, where 1 < g < h. Each group of h/g query heads shares one set of key-value projections.
| Attention variant | Query heads | KV heads | KV cache reduction factor |
|---|---|---|---|
| Multi-head attention (MHA) | h | h | 1x (baseline) |
| Grouped-query attention (GQA) | h | g | h/g |
| Multi-query attention (MQA) | h | 1 | h |
GQA has become the dominant attention variant in modern LLMs. LLaMA 2 70B uses 8 KV groups (8x reduction), Mistral 7B uses 8 KV heads with 32 query heads (4x reduction), and models in the Gemma, Qwen, and Llama 3 families all use GQA.[14][15][17] The Ainslie et al. paper showed that models originally trained with MHA can be "uptrained" with GQA using only 5% of the original pre-training compute, achieving quality close to MHA while providing inference efficiency closer to MQA.[14]
Multi-head latent attention, introduced in DeepSeek-V2 (2024), takes a fundamentally different approach to KV cache compression.[21] Rather than reducing the number of KV heads, MLA compresses the key and value representations into a low-dimensional latent vector before storing them in the cache. At inference time, the compressed latent is projected back to produce full-dimensional keys and values for each head.
Concretely, MLA replaces the standard W_KV projection with a low-rank factorization: the input is first projected down to a small latent vector c (the "compressed KV"), and only c is cached. When attention needs to be computed, c is projected back up to produce the full K and V tensors. In DeepSeek-V2 the latent dimension is set to 512, while the model has 64 heads with head dimension 128, yielding a roughly 16x compression in the per-token KV state versus uncompressed MHA before accounting for additional decoupled RoPE channels.[21]
DeepSeek-V2 reported a 93.3% reduction in KV cache size compared to standard MHA, with maximum generation throughput increasing by 5.76x compared to DeepSeek 67B.[21] Ablation studies showed that MLA maintained quality closer to full MHA than GQA did, making it a quality-preserving approach despite the aggressive compression. MLA has since been adopted by DeepSeek V3, DeepSeek-R1, Kimi K2, and several other models.[22][23] The DeepSeek team also developed a "weight absorption" trick that avoids materializing the full-rank K and V tensors at inference time by folding the up-projection into the score computation.[21]
The following table summarizes the four main attention designs with respect to KV cache footprint, quality, and adoption. Cache reduction is relative to vanilla MHA at the same hidden size.
| Variant | Year | Cache reduction (vs MHA) | Quality | Production adopters |
|---|---|---|---|---|
| Multi-head attention (MHA) | 2017 | 1x | Baseline | Original Transformer, GPT-2/3, LLaMA 1 |
| Multi-query attention (MQA) | 2019 | h x (e.g., 32x) | Slight degradation | PaLM, Falcon |
| Grouped-query attention (GQA) | 2023 | h/g x (e.g., 4 to 8x) | Near-MHA | Llama 2/3, Mistral, Mixtral, Gemma, Qwen, Phi-3 |
| Multi-head latent attention (MLA) | 2024 | ~14x to ~28x | Slightly better than MHA in ablations | DeepSeek-V2/V3/R1, Kimi K2 |
Sliding window attention limits each token's attention to a fixed window of w preceding tokens instead of the full sequence. This allows the KV cache to be bounded at a fixed size regardless of how long the generated sequence becomes.
Mistral 7B (Mistral AI, 2023) popularized this approach with a window size of w = 4,096 while supporting a context length of 8,192 tokens.[17] The implementation uses a rolling buffer cache: a fixed-size cache of w entries where older entries are overwritten in a circular fashion as new tokens are generated. Concretely, at time step i the key and value for that token are written to position i mod W in the buffer. This means the cache never grows beyond w entries, providing predictable and bounded memory usage.[17]
A key insight is that information from tokens beyond the window is not entirely lost. Because each transformer layer applies sliding window attention, a token at layer k can indirectly access information from tokens up to k x w positions away through the cascading effect of intermediate representations. With 32 layers and w = 4,096, Mistral's theoretical attention span reaches approximately 131,072 tokens.[17]
Combined with GQA (8 KV heads instead of 32), Mistral 7B achieves a combined 8x reduction in peak KV cache memory compared to a standard MHA model with the same hidden dimension and full-length caching.
PagedAttention, introduced by Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, and Stoica at SOSP 2023, applies ideas from operating system virtual memory management to the KV cache.[13] The core problem it addresses is memory fragmentation: standard inference systems allocate contiguous GPU memory for the maximum possible sequence length for each request, even though most sequences do not use the full allocation. According to the paper, this leads to 60-80% of allocated KV cache memory being wasted by internal and external fragmentation in baseline systems such as FasterTransformer and Orca.[13]
PagedAttention solves this by dividing the KV cache into fixed-size blocks (typically 16 tokens per block) that can be stored anywhere in GPU memory, similar to how an OS manages virtual memory pages:
As a sequence grows, new physical blocks are allocated on demand. When a sequence finishes, its blocks are freed and can be reused by other sequences. The only wasted memory is in the last partially filled block of each sequence (at most block_size - 1 tokens).
PagedAttention also enables memory sharing between sequences. If two requests share the same prompt prefix (common in chat applications with system prompts) or if a single request uses parallel sampling (multiple completions for one prompt), their block tables can point to the same physical blocks for the shared portion, using copy-on-write semantics. This further reduces memory usage.[13]
vLLM, the open-source inference engine built around PagedAttention, reduces KV cache memory waste to under 4% and improves serving throughput by 2 to 4x compared to systems like FasterTransformer and Orca according to the paper's evaluation.[13] Subsequent vLLM releases, including the V1 engine that became the default in mid-2025, further reduced scheduling overhead at high concurrency and added speculative decoding integrations.[2] By 2025, vLLM was reported as the serving engine behind production deployments at Meta, IBM, Cohere, and many third-party LLM providers.[2]
vAttention (Prabhu, Nayak, Mohan, Ramjee, and Panwar, Microsoft Research, 2024) argues that the non-contiguous block layout of PagedAttention complicates attention kernel implementation and reduces portability.[24] The system decouples virtual and physical memory using CUDA's cuMemMap APIs: each sequence is given a contiguous virtual address range, and physical pages are mapped on demand. Attention kernels see the cache as a contiguous tensor, allowing unmodified FlashAttention and FlashInfer kernels to be used. The authors report higher throughput than PagedAttention-based kernels on several workloads.[24]
Most modern KV cache optimizations are paired with continuous batching (also called iteration-level scheduling or in-flight batching), introduced by Yu et al. at OSDI 2022 in the Orca paper.[25] Continuous batching schedules at the granularity of a single decode iteration: when one request finishes, its KV cache slot is reclaimed and a new request joins the batch immediately. Orca reported up to 36.9x throughput improvement over FasterTransformer at equivalent latency targets.[25] vLLM, TGI, TensorRT-LLM, and SGLang all build on this scheduling model.
KV cache quantization reduces memory consumption by storing cached keys and values in lower numerical precision. Keys and values exhibit different statistical properties than weights (often with more outlier channels), so dedicated KV-cache quantization schemes are necessary.
| Method | Year | Precision | Compression ratio | Key technique |
|---|---|---|---|---|
| Standard FP16 cache | 2017 | 16-bit | 1x (baseline) | No compression |
| INT8 KV cache | various | 8-bit | 2x | Per-tensor or per-channel quantization |
| KIVI (ICML 2024) | 2024 | 2-bit | ~8x | Per-channel key quantization, per-token value quantization |
| KVQuant (NeurIPS 2024) | 2024 | 2 to 4 bit | 4 to 8x | Per-channel key quantization before RoPE; non-uniform quantization; dense-and-sparse |
| Coupled Quantization | 2024 | 1 to 2 bit | 8 to 16x | Exploits interdependence between channels |
| FP8 KV cache (TRT-LLM, vLLM) | 2024+ | 8-bit | 2x | Hardware-supported FP8 quantization on Hopper/Blackwell |
KVQuant (Hooper et al., NeurIPS 2024) combines per-channel key quantization, pre-RoPE key quantization, non-uniform per-layer datatypes, and per-vector dense-and-sparse separation of outliers.[26] The paper reports less than 0.1 perplexity degradation at 3-bit precision on Wikitext-2 and C4 for LLaMA, Llama-2, Llama-3, and Mistral, and demonstrates serving LLaMA-7B at up to 1 million tokens of context on a single A100-80GB and 10 million tokens on an 8-GPU system, with custom CUDA kernels yielding up to 1.7x speedups.[26]
KIVI (Liu et al., ICML 2024) demonstrates that asymmetric quantization is necessary because keys and values have different distributions: keys should be quantized per channel while values should be quantized per token.[27] With 2-bit quantization, KIVI enables Llama, Falcon, and Mistral models to maintain near-baseline quality while using 2.6x less peak memory and supporting up to 4x larger batch sizes, yielding 2.35x to 3.47x throughput on real workloads.[27]
Quantization is orthogonal to other KV cache reduction techniques (GQA, MLA, etc.) and can be combined for multiplicative savings: GQA (4x) combined with INT4 quantization (4x) yields 16x total reduction.
Token eviction methods reduce the KV cache by selectively removing cached entries for tokens deemed less important. The challenge is identifying which tokens can be safely evicted without degrading generation quality.
H2O (Heavy-Hitter Oracle). Zhang et al. (NeurIPS 2023) observed that a small fraction of tokens accumulate disproportionately high attention scores across generation steps. H2O maintains a cache of both recent tokens (a sliding window) and identified heavy hitters, evicting tokens that are neither recent nor heavily attended. The authors formalize the eviction problem as a dynamic submodular optimization. With only 20% of tokens retained, H2O improves throughput by up to 29x over DeepSpeed Zero-Inference and HuggingFace Accelerate, and 3x over FlexGen on OPT-6.7B and OPT-30B.[28]
StreamingLLM. Xiao, Tian, Chen, Han, and Lewis (ICLR 2024) discovered that the first few tokens in a sequence ("attention sinks") receive anomalously high attention scores regardless of their semantic content.[29] This happens because softmax attention weights must sum to 1, and the model learns to dump excess attention onto initial tokens. StreamingLLM preserves these initial attention sink tokens plus a sliding window of recent tokens, enabling stable generation up to 4 million tokens with a fixed-size cache. Compared to a sliding window with recomputation baseline, StreamingLLM achieves up to 22.2x speedup, and the method has been integrated into NVIDIA TRT-LLM and on-device stacks.[29]
SnapKV. Li et al. (NeurIPS 2024) observed that each attention head consistently focuses on specific prompt features and that this pattern can be predicted from a small observation window at the end of the prompt.[30] SnapKV uses a clustering pooling mechanism to select and compress the most important attention features, achieving a 3.6x increase in generation speed and an 8.2x improvement in memory efficiency at 16K input tokens, and passing a 380,000-token Needle-in-a-Haystack pressure test.[30]
FastGen. Ge, Zhang, Liu, and collaborators (2023) showed that different attention heads have different preferred eviction strategies: some heads focus on local context, some on special tokens, and some attend broadly.[31] FastGen profiles each head briefly and then uses a different eviction policy per head, requiring no retraining.
MiniCache. Liu et al. (2024) compress the KV cache along the depth dimension by merging similar key-value states between adjacent middle-to-deep layers.[32] The states are decomposed into magnitude and direction, with directions interpolated and a token retention strategy preserving highly distinct pairs. On ShareGPT, LLaMA-2-7B with 4-bit MiniCache reaches up to 5.02x compression ratio and 41% memory reduction versus the FP16 full cache baseline.[32]
Adaptive policies (TOVA, Scissorhands, KVCompose, and others) further refine these ideas by deciding per token, per head, or per layer how aggressively to compress.
Because prefill is compute-bound while decode is memory-bandwidth-bound, some serving systems separate these phases onto different hardware. Prefill runs on GPUs optimized for throughput, decode on GPUs optimized for memory bandwidth, and the KV cache computed during prefill is transferred to the decode GPU.
Splitwise (Microsoft Research) and DistServe (Zhong et al., 2024) were among the first systems to fully disaggregate prefill and decode. DistServe reports up to 4.48x goodput or 10.2x tighter SLOs compared to colocated baselines.[33] Mooncake, the KVCache-centric architecture behind Moonshot AI's Kimi service, treats KVCache as a first-class scheduling entity stored across GPU, DRAM, and SSD tiers, and reports handling 75% more requests under real workloads.[34] By 2025 most production-grade serving frameworks (NVIDIA Dynamo, llm-d, Ray Serve LLM, SGLang, vLLM, LMCache, Mooncake) had adopted some form of prefill-decode disaggregation.[34]
When the KV cache no longer fits in GPU HBM, modern serving systems spill it down a memory hierarchy: HBM, host DRAM, local NVMe SSD, then network-attached storage. SGLang HiCache organizes these tiers as L1/L2/L3 layers of an extended RadixAttention tree with asynchronous prefetching.[4] LMCache, an open-source KV cache layer for enterprise-scale LLM inference, stores cache blocks across HBM, CPU memory, local disk, remote disk, and Redis, transferring over Ethernet, RDMA, or NVLink, and integrates with vLLM for cross-instance KV cache reuse.[35] CacheBlend (Yao et al., EuroSys 2025) allows reuse of KV caches for arbitrary chunks of a RAG context (not only prefixes), selectively recomputing the KV cache of a small set of critical tokens to preserve quality and reducing TTFT by 3x; paired with vLLM it speeds up RAG by 4.5x.[36] KVSwap (2025) targets on-device LLMs by aggressively offloading KV cache to disk, using compact in-memory metadata to predict which entries to preload, and reports up to 1.8x throughput on NVMe and 4.1x on eMMC at 32K context length.[37]
Prefix caching detects that a new request shares a prompt prefix with a previous request and reuses the KV blocks already computed for the shared portion. vLLM implements automatic prefix caching with block-level hashing on top of PagedAttention.[2] SGLang's RadixAttention organizes the KV cache as a radix tree where each node is the cache of a consecutive span of tokens; common prefixes share nodes, and an LRU policy evicts unused branches.[4] RadixAttention works at token-level granularity (rather than fixed block boundaries) and is particularly effective for multi-turn dialogues, agent workflows, and structured generation. TensorRT-LLM provides similar functionality through its KV cache reuse API with a priority-based eviction policy.[3]
In 2024 the major frontier-model API providers exposed KV cache reuse to API users as a billable feature, typically under the name "prompt caching" or "context caching." This is the user-visible product surface of prefix caching.
Anthropic launched prompt caching for Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku in public beta on August 14, 2024, with general availability on December 17, 2024.[6][38] The feature uses cache_control markers to designate the portion of the prompt that should be cached. Cache writes are charged at 1.25x the base input rate for a 5-minute TTL and 2x for a 1-hour extended TTL, while cache reads are charged at 0.1x of the base input rate. Anthropic reports up to 90% cost reduction and up to 85% latency reduction on long prompts.[6]
OpenAI introduced Prompt Caching on October 1, 2024, automatically applying it to GPT-4o, GPT-4o mini, o1-preview, and o1-mini.[7] The implementation requires no API changes, applies to prompts longer than 1,024 tokens, caches in 128-token increments, and discounts cached input tokens by 50% (later increased to 90% for some models).[7] Caches are cleared after 5 to 10 minutes of inactivity and always within one hour.
Google's Gemini API offered explicit context caching from May 2024 and added implicit caching for Gemini 2.5 models in 2025, billing cache reads at 10% of input price and adding a per-hour storage charge (about $4.50 per million tokens per hour for Pro, $1.00 for Flash).[8] The discount grew from 75% on Gemini 2.0 to 90% on Gemini 2.5 and later models.[8]
A complementary line of work optimizes the attention computation itself rather than the cache layout. FlashAttention (Dao, Fu, Ermon, Rudra, and Re, NeurIPS 2022) is an IO-aware exact attention algorithm that tiles the attention computation over GPU on-chip SRAM, avoiding materialization of the full T x T attention matrix in HBM.[39] FlashAttention-2 (Dao, 2023) improved work partitioning, achieving roughly 2x speedup over FlashAttention with up to 225 TFLOPs/s and 72% model flop utilization on A100.[40] FlashAttention-3 (Shah, Bikshandi, Zhang, Thakkar, Ramani, and Dao, NeurIPS 2024) exploits Hopper-specific features through warp-specialization, interleaved softmax and matmul, and block-quantized FP8 with incoherent processing.[41] It reports about 1.5 to 2.0x speedup over FlashAttention-2 in FP16, reaching up to 740 TFLOPs/s (75% utilization on H100), and 1.2 PFLOPs/s in FP8 with 2.6x lower numerical error than a baseline FP8 attention.[41] FlashAttention kernels are the standard backend for vLLM, SGLang, and TGI for both prefill and paged decode.
The KV cache is directly tied to a model's ability to handle long contexts. As context windows have grown from 2,048 tokens (GPT-3, 2020) to 128,000 tokens (GPT-4 Turbo, 2023) to 1,000,000+ tokens (Gemini, 2024), KV cache memory requirements have grown proportionally.[42]
For a Llama-3-70B-class model using GQA with 8 KV heads in FP16, the KV cache memory at various context lengths is:
| Context length | KV cache (batch size 1) | KV cache (batch size 32) |
|---|---|---|
| 4,096 | 1.25 GB | 40 GB |
| 32,768 | 10 GB | 320 GB |
| 131,072 | 40 GB | 1,280 GB |
| 1,000,000 | 305 GB | 9,766 GB |
At a context length of 1 million tokens with batch size 32, the KV cache alone would require nearly 10 TB of memory. This illustrates why KV cache optimization is not optional for long-context serving but a hard requirement.[15][16] Techniques like KV cache quantization, eviction, MLA, and hierarchical offloading are what make million-token context windows practically feasible.[26][21][34]
Ring attention and other sequence-parallel attention algorithms distribute the KV cache across GPUs in a ring topology, enabling extremely long contexts to be processed without any single GPU holding the entire cache.
A basic implementation appends new K and V tensors to Python lists or concatenates tensors at each step. While simple, this approach causes frequent memory allocations and copies. Production systems typically pre-allocate the cache to the maximum expected sequence length at the start, then fill in entries as tokens are generated. Pre-allocation avoids the overhead of dynamic memory management but wastes memory when sequences are shorter than the maximum. PagedAttention addresses this trade-off by combining dynamic allocation with efficient block-based memory management; vAttention provides a different solution using CUDA virtual memory primitives.[13][24]
When using rotary position embeddings (RoPE), which encode position information directly into the key and query vectors, the cached keys already have positional information baked in. This means cached keys do not need to be re-encoded when new tokens are generated. However, certain KV cache quantization methods (such as KVQuant) quantize keys before applying RoPE because RoPE distorts the channel-wise distribution and makes post-RoPE quantization less effective.[26] DeepSeek-V2's MLA likewise decouples a small number of "rotary" KV channels from the compressed latent path so RoPE can be applied without disturbing the low-rank compression.[21]
For models distributed across multiple GPUs using tensor parallelism, the KV cache is also distributed. Each GPU stores the cache for its assigned attention heads. With GQA, the number of KV heads limits how many GPUs can participate in the KV cache distribution. For example, a model with 8 KV heads can distribute its cache across at most 8 GPUs (one KV head per GPU). This constraint influences the choice of parallelism strategy for large-scale serving.[10]
In continuous batching (also called in-flight batching or iteration-level scheduling), the serving system manages a shared pool of KV cache memory across all active requests.[25] When a request finishes, its cache memory is immediately recycled for new requests. This is more efficient than static batching, where memory is reserved for a fixed batch of requests even after some have completed. vLLM, TensorRT-LLM, SGLang, and Hugging Face TGI all use continuous batching with paged or radix-style cache management.[2][3][4]
Block size is a key tuning knob for paged KV caches. Larger blocks improve the efficiency of attention kernels (more contiguous memory per kernel launch), but reduce the granularity of prefix sharing because only entire blocks can be shared between requests.[3] TensorRT-LLM exposes a configurable block size and observes that this is a workload-dependent trade-off: chat workloads with long shared system prompts often benefit from smaller blocks, while batch-completion workloads benefit from larger blocks.[3]
KV cache management has become a defining feature of LLM serving stacks. Notable systems as of 2025-2026:
These systems also work alongside speculative decoding techniques that further reduce the per-token cost of decode by drafting multiple candidate tokens at once.
KV caching makes autoregressive generation feasible at scale, but it has limitations. For long-context serving, KV cache memory frequently exceeds parameter memory and is the binding constraint on batch size during decode. Techniques like MLA, KIVI, KVQuant, and CacheBlend each address a slice of the problem but no single technique closes the gap to "memory-free" long-context inference.[21][26][27][36]
Aggressive eviction (H2O, SnapKV) and quantization (KIVI, KVQuant) can preserve perplexity on standard benchmarks while degrading on long-context tasks such as needle-in-a-haystack retrieval, multi-document summarization, and tool-use chains. Adaptive policies that pick a per-head or per-layer strategy are an active research area.[28][30][31][32]
Prefix caching is brittle to small prompt edits: a single token change near the start of a long prompt invalidates the entire downstream cache. Techniques like CacheBlend that allow non-prefix reuse are emerging but require careful quality control.[36] Disaggregation reduces prefill-decode interference but increases system complexity, KV cache transfer costs, and failure surface; Mooncake and Dynamo invest heavily in fast KV cache transport (RDMA, NVLink, NIC offload) to make disaggregation pay off.[34] Finally, different KV cache compression papers report results on different models, datasets, sequence lengths, and quality metrics, making cross-paper comparison difficult; survey papers such as Li et al. (2024) provide consolidated taxonomies but the field lacks a canonical leaderboard.[44]
Imagine you are reading a long story out loud, and after each sentence you need to answer a question about everything you have read so far. Without a KV cache, you would have to re-read the entire story from the beginning every time you finish a new sentence. That would be very slow if the story is hundreds of pages long.
A KV cache is like taking notes as you read. After each sentence, you write down the important parts (the "keys" and "values") on a notepad. When you need to answer a question, you just look at your notes instead of re-reading everything. The more you read, the bigger your notepad gets. If it gets too big, you can shrink it by only keeping recent pages (sliding window), writing in smaller handwriting (quantization), or only keeping the most important pages (heavy hitters).