# KV Cache

> Source: https://aiwiki.ai/wiki/kv_cache
> Updated: 2026-07-11
> Categories: AI Inference, Deep Learning, Machine Learning, Transformer Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

## What is a KV cache?

A KV cache (key-value cache) is a memory optimization technique used during [transformer](/wiki/transformer) inference that stores previously computed key and value tensors from the [attention](/wiki/attention) mechanism so they do not need to be recalculated at each generation step. During autoregressive text generation, a [large language model](/wiki/large_language_model) produces tokens one at a time, and each new token must attend to every previous token in the sequence. Without caching, the model would redundantly recompute the key and value projections for all prior tokens at every step, resulting in computation that grows quadratically with sequence length. By storing and reusing these intermediate results, the KV cache reduces per-step computation from $$O(n)$$ projection operations to $$O(1)$$, where only the new token's key and value vectors need to be computed and appended to the cache.[^1] In short, the KV cache is the single optimization that makes the decode (token-by-token) phase of LLM inference affordable, and its memory footprint, which can run to tens of gigabytes per long-context request, is the binding constraint on how many requests a server can run at once.[^5][^13]

The KV cache is one of the most fundamental optimizations in modern LLM serving. Every major inference framework, including [vLLM](/wiki/vllm), [TensorRT](/wiki/tensorrt)-LLM, Hugging Face [Text Generation Inference](/wiki/huggingface_tgi), and [SGLang](/wiki/sglang), implements KV caching by default.[^2][^3][^4] However, the cache introduces its own challenge: memory consumption. For large models with long context windows, the KV cache can consume tens of gigabytes of GPU memory, often exceeding the memory required by the model weights themselves.[^5] This has motivated an active area of research into KV cache compression, eviction, [quantization](/wiki/quantization), and memory management techniques, as well as commercial features such as [prompt caching](/wiki/prompt_caching) that expose KV cache reuse directly to API users.[^6][^7][^8]

### At a glance

| Property | Detail |
|---|---|
| What it caches | Per-token key (K) and value (V) tensors at every transformer layer |
| Why it exists | Avoids recomputing K and V for all prior tokens at each decode step |
| Compute saving | Projection work per step drops from $$O(n)$$ to $$O(1)$$; end-to-end generation roughly 3-5x faster[^11] |
| Decode bottleneck | Memory bandwidth, not arithmetic: each token reads the whole cache from HBM[^1][^10] |
| Memory growth | Linear in sequence length x batch size x layers x KV heads |
| First formal analysis | Shazeer, "Fast Transformer Decoding" (2019), motivating multi-query attention[^11] |
| System-level framing | Kwon et al., PagedAttention / vLLM (SOSP 2023)[^13] |

## Background: how does autoregressive generation work?

To understand why the KV cache exists, it is necessary to understand how autoregressive [language models](/wiki/language_model) generate text.

A decoder-only transformer (such as [GPT](/wiki/gpt), [LLaMA](/wiki/llama), or [Claude](/wiki/claude)) generates text one token at a time. At each step t, the model takes all tokens produced so far (x_1, x_2, ..., x_t) and predicts the probability distribution over the next token x_{t+1}. This process is inherently sequential: the model cannot produce token t+1 until it has produced token t.

Inside each transformer layer, the [self-attention](/wiki/self_attention) mechanism computes three matrices from the input embeddings:

- **Queries (Q):** what the current token is looking for
- **Keys (K):** what each token offers as a matching signal
- **Values (V):** the content each token contributes to the output

The attention output is computed as:

$$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V
$$

The construction is causal in decoder-only models: each token can only attend to itself and preceding tokens. At generation step t the model computes the query for the new token and must compare it against the keys of all t tokens. The resulting attention weights are then applied to the values of all t tokens.

Without KV caching, the model would recompute the K and V projections for every token at every generation step. At step t, this means computing t key vectors and t value vectors, even though the key and value vectors for tokens 1 through t-1 are identical to what they were at step t-1. This redundant computation is the problem that KV caching solves.[^1]

## How does the KV cache work?

The KV cache operates by storing the key and value tensors computed at each generation step and reusing them in subsequent steps. The process unfolds in two distinct phases.

### Prefill phase

When the user submits a prompt, the model processes all input tokens in parallel during the **prefill phase** (also called the prompt encoding phase). For each transformer layer, the model computes the key and value projections for every token in the prompt and stores them in the cache. This phase is compute-bound because it involves large matrix multiplications across the full prompt length. The prefill phase determines the **time to first token (TTFT)**, the latency the user experiences before seeing any output.[^9]

### Decode phase

After the prefill phase, the model enters the **decode phase**, generating output tokens one at a time. At each step:

1. The model computes Q, K, and V projections for only the new token.
2. The new K and V vectors are appended to the existing cache.
3. The query for the new token attends over all cached keys (including the newly appended one) to produce attention weights.
4. The attention weights are applied to all cached values to produce the output for that position.

This means that at decode step t, only a single token's K and V projections need to be computed, while the K and V projections for all previous tokens are simply read from the cache. The decode phase is **memory-bandwidth-bound** rather than compute-bound: the bottleneck is loading the cached key and value tensors from GPU memory (HBM) into the compute units, not the arithmetic itself.[^1][^10] Shazeer (2019) identified this memory-bandwidth bottleneck as the original motivation for multi-query attention, writing that the dominant cost of incremental decoding is "the memory-bandwidth cost of repeatedly loading the large 'keys' and 'values' tensors" at each step.[^11]

### Step-by-step example

Consider generating the sentence "The cat sat on" with a single-layer, single-head transformer:

| Step | Input token | Cache before step (K vectors) | Computation | Cache after step (K vectors) |
|---|---|---|---|---|
| 1 (prefill) | "The", "cat", "sat" | Empty | Compute K, V for all 3 tokens; store in cache | K_the, K_cat, K_sat |
| 2 (decode) | "on" | K_the, K_cat, K_sat | Compute K_on, V_on; append to cache; query attends over all 4 | K_the, K_cat, K_sat, K_on |
| 3 (decode) | "the" | K_the, K_cat, K_sat, K_on | Compute K_the2, V_the2; append; query attends over all 5 | K_the, K_cat, K_sat, K_on, K_the2 |

In a real model, this process happens independently at every layer and every attention head, so the total cache stores K and V tensors for every (layer, head, position) combination.

## History and origins

The idea of caching keys and values during autoregressive decoding predates many of the systems that popularized it. The original Transformer paper by Vaswani et al. (2017) introduced the encoder-decoder architecture with multi-head [self-attention](/wiki/self_attention) but did not single out KV caching as a separate technique: it was implicit in the design of incremental decoding.[^12] Early implementations in Tensor2Tensor, HuggingFace `transformers`, and Fairseq exposed a `past_key_values` argument that returned cached projections from previous decoding steps, making the optimization explicit at the API level.

The first paper to formally describe KV caching as the central bottleneck of LLM inference was Shazeer's 2019 work on multi-query attention, which observed that the memory-bandwidth cost of repeatedly loading the large keys and values tensors was the dominant cost of incremental decoding.[^11] Pope et al. (2023) at Google extended this analysis to TPU v4 deployments of PaLM-scale models, providing a closed-form analytical model that accounts for KV cache memory, parameter memory, and attention flops.[^10] The 2023 SOSP paper by Kwon et al. shifted the framing from per-request KV cache to system-level memory management, treating the cache as a shared resource that can be paged, shared, and copied on write.[^13] That paper introduced [vLLM](/wiki/vllm) and established the now-standard block-based approach used by most modern serving stacks.

## Memory analysis

### Per-token memory formula

The KV cache stores two tensors (K and V) at each layer for each token. The memory required per token is:

$$
\text{KV cache per token (bytes)} = 2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \times \text{bytes\_per\_element}
$$

where:
- 2 accounts for both K and V matrices
- n_layers is the number of transformer layers
- n_kv_heads is the number of key-value heads (equals n_heads for standard [multi-head attention](/wiki/multi-head_self-attention), fewer for [grouped-query attention](/wiki/grouped-query_attention))
- d_head is the dimension of each attention head
- bytes_per_element depends on the data type (2 for FP16/BF16, 4 for FP32, 1 for INT8)

Since $$n_{\text{kv\_heads}} \times d_{\text{head}}$$ often equals the model's hidden dimension $$d_{\text{model}}$$ (in the standard multi-head attention case), this simplifies to:

$$
\text{KV cache per token (bytes)} = 2 \times n_{\text{layers}} \times d_{\text{model}} \times \text{bytes\_per\_element}
$$

### Total memory formula

The total KV cache memory for a batch of sequences is:

$$
\text{Total KV cache (bytes)} = \text{batch\_size} \times \text{seq\_length} \times 2 \times n_{\text{layers}} \times n_{\text{kv\_heads}} \times d_{\text{head}} \times \text{bytes\_per\_element}
$$

### Example calculations

The following table shows approximate KV cache sizes for popular models processing a single sequence of 4,096 tokens in FP16 (2 bytes per element):

| Model | Parameters | Layers | Heads (Q/KV) | d_head | d_model | KV cache per token | KV cache (4K tokens) |
|---|---|---|---|---|---|---|---|
| [LLaMA 2](/wiki/llama) 7B | 7B | 32 | 32/32 | 128 | 4,096 | 0.5 MB | 2.0 GB |
| [LLaMA 2](/wiki/llama) 13B | 13B | 40 | 40/40 | 128 | 5,120 | 0.6 MB | 2.5 GB |
| [LLaMA 2](/wiki/llama) 70B | 70B | 80 | 64/8 | 128 | 8,192 | 0.3 MB | 1.25 GB |
| [Llama 3](/wiki/llama_3) 70B | 70B | 80 | 64/8 | 128 | 8,192 | 0.3 MB | 1.25 GB |
| [Mistral](/wiki/mistral) 7B | 7B | 32 | 32/8 | 128 | 4,096 | 0.125 MB | 0.5 GB |
| [GPT-3](/wiki/gpt3) 175B | 175B | 96 | 96/96 | 128 | 12,288 | 4.5 MB | 18 GB |

Note that LLaMA 2 70B uses [grouped-query attention](/wiki/grouped-query_attention) with only 8 KV heads (instead of 64), reducing its KV cache by 8x compared to what it would be with standard multi-head attention.[^14][^45] Llama 3 and Llama 3.1 70B inherit the same shape (80 layers, 8 KV heads, head dimension 128) and reach about 2.5 GB at 8K tokens, 10 GB at 32K tokens, and roughly 40 GB at 128K tokens for a single sequence in FP16.[^15][^16] Mistral 7B uses 8 KV heads instead of 32, achieving a 4x reduction.[^17]

### Llama 3.1 405B numerical example

Llama 3.1 405B has 126 transformer layers, 128 query heads, 8 KV heads, head dimension 128, and a 128K-token context window.[^15][^16] In FP16 the per-token KV cache footprint is $$2 \times 126 \times 8 \times 128 \times 2 = 516{,}096$$ bytes (about 0.49 MB). For a single 128K-token sequence, the cache alone occupies roughly 64 GB in FP16; FP8 halves that to about 32 GB.[^18] A batch of 32 such sequences in FP16 requires approximately 2 TB of KV cache memory, which is why large-context production deployments rely on quantization, GQA, and offloading.

### Memory scaling behavior

KV cache memory grows linearly along three axes:

- **Sequence length:** Doubling the context window doubles the cache. A model with a 128K context window needs 32x more cache memory than the same model with a 4K window.
- **Batch size:** Serving 64 concurrent requests requires 64x the cache of a single request.
- **Model depth:** Deeper models (more layers) require proportionally more cache.

For production serving systems, the KV cache is often the dominant consumer of GPU memory. In some configurations, the KV cache uses more memory than the model weights themselves, particularly with large batch sizes or long context lengths.[^13]

## Why does KV caching matter for performance?

KV caching provides two distinct benefits.

### Computational savings

Without KV caching, generating a sequence of length T requires approximately $$T \times (T/2)$$ total key-value projection operations across all steps (the sum $$1 + 2 + \cdots + T$$). With KV caching, only T total key-value projection operations are needed (one per step). This changes the computational complexity of the projection step from $$O(T^2)$$ to $$O(T)$$. In practice, this translates to a 3-5x speedup in end-to-end generation time, depending on model size and hardware.[^11]

### Memory bandwidth bottleneck

During the decode phase, the primary bottleneck is not computation but memory bandwidth. Each decode step requires reading the entire KV cache from GPU HBM to the compute units. For a model with a 2 GB KV cache, every single token generation requires reading 2 GB of data from memory. On an [NVIDIA A100](/wiki/nvidia_a100) GPU with 2 TB/s memory bandwidth, this means each token takes at minimum 1 ms just for the memory read, regardless of how fast the compute is.[^11][^19] On an [NVIDIA H100](/wiki/nvidia_h100) with about 3.35 TB/s of HBM3 bandwidth the floor is lower, but the cache reads remain the dominant cost during decode for any model that fits in HBM.[^20] This is why the decode phase is described as memory-bandwidth-bound.

## KV cache optimization techniques

The tension between needing the KV cache for speed and its large memory footprint has motivated numerous optimization techniques. The most common approaches fall into four categories: attention-head sharing (MQA, GQA, MLA), eviction or selection (sliding window, H2O, StreamingLLM, SnapKV, FastGen), quantization (INT8, KIVI, KVQuant), and system-level memory management (PagedAttention, vAttention, RadixAttention, prefix caching).

### Multi-query attention (MQA)

[Multi-query attention](/wiki/grouped_query_attention), proposed by Noam Shazeer in 2019, reduces the KV cache by sharing a single set of key and value projections across all query heads.[^11] In standard multi-head attention with h heads, each head has its own K and V projections, resulting in h sets of key-value pairs per layer. As the paper puts it, in the multi-query variant "the keys and values are shared across all of the different attention 'heads'," reducing the KV cache by a factor of h.[^11]

For a model with 32 attention heads, MQA reduces the KV cache by 32x. The trade-off is a small degradation in model quality, since all query heads must work with the same key and value representations. MQA was adopted by [PaLM](/wiki/palm) (Google, 2022) and Falcon (TII, 2023).

### Grouped-query attention (GQA)

[Grouped-query attention](/wiki/grouped-query_attention), introduced by Ainslie, Lee-Thorp, de Jong, Zemlyanskiy, Lebron, and Sanghai in 2023, is a compromise between standard multi-head attention and MQA.[^14] The paper describes GQA as "a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads."[^14] Instead of one shared KV head (MQA) or h independent KV heads (MHA), GQA uses g groups of KV heads, where $$1 < g < h$$. Each group of $$h/g$$ query heads shares one set of key-value projections.

| Attention variant | Query heads | KV heads | KV cache reduction factor |
|---|---|---|---|
| Multi-head attention (MHA) | $$h$$ | $$h$$ | 1x (baseline) |
| Grouped-query attention (GQA) | $$h$$ | $$g$$ | $$h/g$$ |
| Multi-query attention (MQA) | $$h$$ | 1 | $$h$$ |

GQA has become the dominant attention variant in modern LLMs. [LLaMA 2](/wiki/llama) 70B uses 8 KV groups (8x reduction), [Mistral 7B](/wiki/mistral_7b) uses 8 KV heads with 32 query heads (4x reduction), and models in the [Gemma](/wiki/gemma), [Qwen](/wiki/qwen), and Llama 3 families all use GQA.[^14][^15][^17] The Ainslie et al. paper showed that models originally trained with MHA can be "uptrained" with GQA "using 5% of original pre-training compute," achieving quality close to MHA while providing inference efficiency closer to MQA.[^14]

### Multi-head latent attention (MLA)

[Multi-head latent attention](/wiki/multi-head_latent_attention), introduced in [DeepSeek-V2](/wiki/deepseek) (2024), takes a fundamentally different approach to KV cache compression.[^21] The DeepSeek-V2 paper states that "MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector."[^21] Rather than reducing the number of KV heads, MLA compresses the key and value representations into a low-dimensional latent vector before storing them in the cache. At inference time, the compressed latent is projected back to produce full-dimensional keys and values for each head.

Concretely, MLA replaces the standard $$W_{KV}$$ projection with a low-rank factorization: the input is first projected down to a small latent vector c (the "compressed KV"), and only c is cached. When attention needs to be computed, c is projected back up to produce the full K and V tensors. In DeepSeek-V2 the latent dimension is set to 512, while the model has 64 heads with head dimension 128, yielding a roughly 16x compression in the per-token KV state versus uncompressed MHA before accounting for additional decoupled RoPE channels.[^21]

DeepSeek-V2 reported a 93.3% reduction in KV cache size compared to standard MHA, with maximum generation throughput increasing by 5.76x compared to DeepSeek 67B.[^21] Ablation studies showed that MLA maintained quality closer to full MHA than GQA did, making it a quality-preserving approach despite the aggressive compression. MLA has since been adopted by [DeepSeek V3](/wiki/deepseek_v3), [DeepSeek-R1](/wiki/deepseek_r1), [Kimi K2](/wiki/kimi_k2), and several other models.[^22][^23] The DeepSeek team also developed a "weight absorption" trick that avoids materializing the full-rank K and V tensors at inference time by folding the up-projection into the score computation.[^21]

### Comparison of attention variants

The following table summarizes the four main attention designs with respect to KV cache footprint, quality, and adoption. Cache reduction is relative to vanilla MHA at the same hidden size.

| Variant | Year | Cache reduction (vs MHA) | Quality | Production adopters |
|---|---|---|---|---|
| Multi-head attention (MHA) | 2017 | 1x | Baseline | Original Transformer, GPT-2/3, LLaMA 1 |
| Multi-query attention (MQA) | 2019 | h x (e.g., 32x) | Slight degradation | PaLM, Falcon |
| Grouped-query attention (GQA) | 2023 | h/g x (e.g., 4 to 8x) | Near-MHA | Llama 2/3, Mistral, Mixtral, Gemma, Qwen, Phi-3 |
| Multi-head latent attention (MLA) | 2024 | ~14x to ~28x | Slightly better than MHA in ablations | DeepSeek-V2/V3/R1, Kimi K2 |

### Sliding window attention

[Sliding window attention](/wiki/sliding_window_attention) limits each token's attention to a fixed window of w preceding tokens instead of the full sequence. This allows the KV cache to be bounded at a fixed size regardless of how long the generated sequence becomes.

[Mistral 7B](/wiki/mistral_7b) (Mistral AI, 2023) popularized this approach with a window size of $$w = 4{,}096$$ while supporting a context length of 8,192 tokens.[^17] The implementation uses a **rolling buffer cache**: a fixed-size cache of w entries where older entries are overwritten in a circular fashion as new tokens are generated. Concretely, at time step i the key and value for that token are written to position $$i \bmod W$$ in the buffer. This means the cache never grows beyond w entries, providing predictable and bounded memory usage.[^17]

A key insight is that information from tokens beyond the window is not entirely lost. Because each transformer layer applies sliding window attention, a token at layer k can indirectly access information from tokens up to $$k \times w$$ positions away through the cascading effect of intermediate representations. With 32 layers and $$w = 4{,}096$$, Mistral's theoretical attention span reaches approximately 131,072 tokens.[^17]

Combined with GQA (8 KV heads instead of 32), Mistral 7B achieves a combined 8x reduction in peak KV cache memory compared to a standard MHA model with the same hidden dimension and full-length caching.

### PagedAttention and vLLM

PagedAttention, introduced by Kwon, Li, Zhuang, Sheng, Zheng, Yu, Gonzalez, Zhang, and Stoica at SOSP 2023, applies ideas from operating system virtual memory management to the KV cache.[^13] The core problem it addresses is memory fragmentation: standard inference systems allocate contiguous GPU memory for the maximum possible sequence length for each request, even though most sequences do not use the full allocation. According to the paper, this leads to 60-80% of allocated KV cache memory being wasted by internal and external fragmentation in baseline systems such as FasterTransformer and Orca.[^13]

PagedAttention solves this by dividing the KV cache into fixed-size **blocks** (typically 16 tokens per block) that can be stored anywhere in GPU memory, similar to how an OS manages virtual memory pages:

1. **Logical blocks** represent a contiguous view of the KV cache for each sequence.
2. **Physical blocks** are the actual GPU memory allocations, which may be non-contiguous.
3. A **block table** maps logical blocks to physical blocks, analogous to a page table in virtual memory.

As a sequence grows, new physical blocks are allocated on demand. When a sequence finishes, its blocks are freed and can be reused by other sequences. The only wasted memory is in the last partially filled block of each sequence (at most block_size - 1 tokens).

PagedAttention also enables **memory sharing** between sequences. If two requests share the same prompt prefix (common in chat applications with system prompts) or if a single request uses parallel sampling (multiple completions for one prompt), their block tables can point to the same physical blocks for the shared portion, using copy-on-write semantics. This further reduces memory usage.[^13]

[vLLM](/wiki/vllm), the open-source [inference](/wiki/inference) engine built around PagedAttention, reduces KV cache memory waste to under 4% and, in the words of the paper, "improves the throughput of popular LLMs by 2-4x with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca."[^13] Subsequent vLLM releases, including the V1 engine that became the default in mid-2025, further reduced scheduling overhead at high concurrency and added speculative decoding integrations.[^2] By 2025, vLLM was reported as the serving engine behind production deployments at Meta, IBM, Cohere, and many third-party LLM providers.[^2]

### vAttention and CUDA virtual memory

vAttention (Prabhu, Nayak, Mohan, Ramjee, and Panwar, Microsoft Research, 2024) argues that the non-contiguous block layout of PagedAttention complicates attention kernel implementation and reduces portability.[^24] The system decouples virtual and physical memory using CUDA's `cuMemMap` APIs: each sequence is given a contiguous virtual address range, and physical pages are mapped on demand. Attention kernels see the cache as a contiguous tensor, allowing unmodified FlashAttention and FlashInfer kernels to be used. The authors report higher throughput than PagedAttention-based kernels on several workloads.[^24]

### Continuous batching and Orca

Most modern KV cache optimizations are paired with [continuous batching](/wiki/continuous_batching) (also called iteration-level scheduling or in-flight batching), introduced by Yu et al. at OSDI 2022 in the Orca paper.[^25] Continuous batching schedules at the granularity of a single decode iteration: when one request finishes, its KV cache slot is reclaimed and a new request joins the batch immediately. Orca reported up to 36.9x throughput improvement over FasterTransformer at equivalent latency targets.[^25] vLLM, TGI, TensorRT-LLM, and SGLang all build on this scheduling model.

### KV cache quantization

KV cache [quantization](/wiki/quantization) reduces memory consumption by storing cached keys and values in lower numerical precision. Keys and values exhibit different statistical properties than weights (often with more outlier channels), so dedicated KV-cache quantization schemes are necessary.

| Method | Year | Precision | Compression ratio | Key technique |
|---|---|---|---|---|
| Standard FP16 cache | 2017 | 16-bit | 1x (baseline) | No compression |
| INT8 KV cache | various | 8-bit | 2x | Per-tensor or per-channel quantization |
| KIVI (ICML 2024) | 2024 | 2-bit | ~8x | Per-channel key quantization, per-token value quantization |
| KVQuant (NeurIPS 2024) | 2024 | 2 to 4 bit | 4 to 8x | Per-channel key quantization before RoPE; non-uniform quantization; dense-and-sparse |
| Coupled Quantization | 2024 | 1 to 2 bit | 8 to 16x | Exploits interdependence between channels |
| FP8 KV cache (TRT-LLM, vLLM) | 2024+ | 8-bit | 2x | Hardware-supported FP8 quantization on Hopper/Blackwell |

KVQuant (Hooper et al., NeurIPS 2024) combines per-channel key quantization, pre-RoPE key quantization, non-uniform per-layer datatypes, and per-vector dense-and-sparse separation of outliers.[^26] The paper reports less than 0.1 perplexity degradation at 3-bit precision on Wikitext-2 and C4 for LLaMA, Llama-2, Llama-3, and Mistral, and demonstrates serving LLaMA-7B at up to 1 million tokens of context on a single A100-80GB and 10 million tokens on an 8-GPU system, with custom CUDA kernels yielding up to 1.7x speedups.[^26]

KIVI (Liu et al., ICML 2024) demonstrates that asymmetric quantization is necessary because keys and values have different distributions: keys should be quantized per channel while values should be quantized per token.[^27] With 2-bit quantization, KIVI enables Llama, Falcon, and Mistral models to maintain near-baseline quality while using 2.6x less peak memory and supporting up to 4x larger batch sizes, yielding 2.35x to 3.47x throughput on real workloads.[^27]

Quantization is orthogonal to other KV cache reduction techniques (GQA, MLA, etc.) and can be combined for multiplicative savings: GQA (4x) combined with INT4 quantization (4x) yields 16x total reduction.

### Token eviction and cache compression

Token eviction methods reduce the KV cache by selectively removing cached entries for tokens deemed less important. The challenge is identifying which tokens can be safely evicted without degrading generation quality.

**H2O (Heavy-Hitter Oracle).** Zhang et al. (NeurIPS 2023) observed that a small fraction of tokens accumulate disproportionately high attention scores across generation steps. H2O maintains a cache of both recent tokens (a sliding window) and identified heavy hitters, evicting tokens that are neither recent nor heavily attended. The authors formalize the eviction problem as a dynamic submodular optimization. With only 20% of tokens retained, H2O improves throughput by up to 29x over DeepSpeed Zero-Inference and HuggingFace Accelerate, and 3x over FlexGen on OPT-6.7B and OPT-30B.[^28]

**StreamingLLM.** Xiao, Tian, Chen, Han, and Lewis (ICLR 2024) discovered that the first few tokens in a sequence ("attention sinks") receive anomalously high attention scores regardless of their semantic content.[^29] This happens because [softmax](/wiki/softmax) attention weights must sum to 1, and the model learns to dump excess attention onto initial tokens. StreamingLLM preserves these initial attention sink tokens plus a sliding window of recent tokens, enabling stable generation up to 4 million tokens with a fixed-size cache. Compared to a sliding window with recomputation baseline, StreamingLLM achieves up to 22.2x speedup, and the method has been integrated into NVIDIA TRT-LLM and on-device stacks.[^29]

**SnapKV.** Li et al. (NeurIPS 2024) observed that each attention head consistently focuses on specific prompt features and that this pattern can be predicted from a small observation window at the end of the prompt.[^30] SnapKV uses a clustering pooling mechanism to select and compress the most important attention features, achieving a 3.6x increase in generation speed and an 8.2x improvement in memory efficiency at 16K input tokens, and passing a 380,000-token Needle-in-a-Haystack pressure test.[^30]

**FastGen.** Ge, Zhang, Liu, and collaborators (2023) showed that different attention heads have different preferred eviction strategies: some heads focus on local context, some on special tokens, and some attend broadly.[^31] FastGen profiles each head briefly and then uses a different eviction policy per head, requiring no retraining.

**MiniCache.** Liu et al. (2024) compress the KV cache along the depth dimension by merging similar key-value states between adjacent middle-to-deep layers.[^32] The states are decomposed into magnitude and direction, with directions interpolated and a token retention strategy preserving highly distinct pairs. On ShareGPT, LLaMA-2-7B with 4-bit MiniCache reaches up to 5.02x compression ratio and 41% memory reduction versus the FP16 full cache baseline.[^32]

Adaptive policies (TOVA, Scissorhands, KVCompose, and others) further refine these ideas by deciding per token, per head, or per layer how aggressively to compress.

### Disaggregated prefill and decode

Because prefill is compute-bound while decode is memory-bandwidth-bound, some serving systems separate these phases onto different hardware. Prefill runs on GPUs optimized for throughput, decode on GPUs optimized for memory bandwidth, and the KV cache computed during prefill is transferred to the decode GPU.

Splitwise (Microsoft Research) and DistServe (Zhong et al., 2024) were among the first systems to fully disaggregate prefill and decode. DistServe reports up to 4.48x goodput or 10.2x tighter SLOs compared to colocated baselines.[^33] Mooncake, the KVCache-centric architecture behind Moonshot AI's [Kimi](/wiki/kimi) service, treats KVCache as a first-class scheduling entity stored across GPU, DRAM, and SSD tiers, and reports handling 75% more requests under real workloads.[^34] By 2025 most production-grade serving frameworks (NVIDIA Dynamo, llm-d, Ray Serve LLM, SGLang, vLLM, LMCache, Mooncake) had adopted some form of prefill-decode disaggregation.[^34]

### Hierarchical and offloaded KV caches

When the KV cache no longer fits in GPU HBM, modern serving systems spill it down a memory hierarchy: HBM, host DRAM, local NVMe SSD, then network-attached storage. SGLang HiCache organizes these tiers as L1/L2/L3 layers of an extended RadixAttention tree with asynchronous prefetching.[^4] LMCache, an open-source KV cache layer for enterprise-scale LLM inference, stores cache blocks across HBM, CPU memory, local disk, remote disk, and Redis, transferring over Ethernet, RDMA, or NVLink, and integrates with vLLM for cross-instance KV cache reuse.[^35] CacheBlend (Yao et al., EuroSys 2025) allows reuse of KV caches for arbitrary chunks of a RAG context (not only prefixes), selectively recomputing the KV cache of a small set of critical tokens to preserve quality and reducing TTFT by 3x; paired with vLLM it speeds up RAG by 4.5x.[^36] KVSwap (2025) targets on-device LLMs by aggressively offloading KV cache to disk, using compact in-memory metadata to predict which entries to preload, and reports up to 1.8x throughput on NVMe and 4.1x on eMMC at 32K context length.[^37]

### Prefix caching and RadixAttention

Prefix caching detects that a new request shares a prompt prefix with a previous request and reuses the KV blocks already computed for the shared portion. vLLM implements automatic prefix caching with block-level hashing on top of PagedAttention.[^2] SGLang's RadixAttention organizes the KV cache as a radix tree where each node is the cache of a consecutive span of tokens; common prefixes share nodes, and an LRU policy evicts unused branches.[^4] RadixAttention works at token-level granularity (rather than fixed block boundaries) and is particularly effective for multi-turn dialogues, agent workflows, and structured generation. TensorRT-LLM provides similar functionality through its KV cache reuse API with a priority-based eviction policy.[^3]

### What is prompt caching, and how does it expose the KV cache to API users?

In 2024 the major frontier-model API providers exposed KV cache reuse to API users as a billable feature, typically under the name "prompt caching" or "context caching." This is the user-visible product surface of prefix caching.

[Anthropic](/wiki/anthropic) launched [prompt caching](/wiki/prompt_caching) for [Claude](/wiki/claude) 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku in public beta on August 14, 2024, with general availability on December 17, 2024.[^6][^38] The feature uses `cache_control` markers to designate the portion of the prompt that should be cached. Cache writes are charged at 1.25x the base input rate for a 5-minute TTL and 2x for a 1-hour extended TTL, while cache reads are charged at 0.1x of the base input rate. Anthropic reports up to 90% cost reduction and up to 85% latency reduction on long prompts.[^6]

OpenAI introduced Prompt Caching on October 1, 2024, automatically applying it to GPT-4o, GPT-4o mini, o1-preview, and o1-mini.[^7] The implementation requires no API changes, applies to prompts longer than 1,024 tokens, caches in 128-token increments, and discounts cached input tokens by 50% (later increased to 90% for some models).[^7] Caches are cleared after 5 to 10 minutes of inactivity and always within one hour.

Google's Gemini API offered explicit context caching from May 2024 and added implicit caching for Gemini 2.5 models in 2025, billing cache reads at 10% of input price and adding a per-hour storage charge (about $4.50 per million tokens per hour for Pro, $1.00 for Flash).[^8] The discount grew from 75% on Gemini 2.0 to 90% on Gemini 2.5 and later models.[^8]

## FlashAttention and the prefill/decode kernels

A complementary line of work optimizes the attention computation itself rather than the cache layout. [FlashAttention](/wiki/flash_attention) (Dao, Fu, Ermon, Rudra, and Re, NeurIPS 2022) is an IO-aware exact attention algorithm that tiles the attention computation over GPU on-chip SRAM, avoiding materialization of the full T x T attention matrix in HBM.[^39] FlashAttention-2 (Dao, 2023) improved work partitioning, achieving roughly 2x speedup over FlashAttention with up to 225 TFLOPs/s and 72% model flop utilization on A100.[^40] [FlashAttention-3](/wiki/flash_attention_3) (Shah, Bikshandi, Zhang, Thakkar, Ramani, and Dao, NeurIPS 2024) exploits Hopper-specific features through warp-specialization, interleaved softmax and matmul, and block-quantized FP8 with incoherent processing.[^41] It reports about 1.5 to 2.0x speedup over FlashAttention-2 in FP16, reaching up to 740 TFLOPs/s (75% utilization on H100), and 1.2 PFLOPs/s in FP8 with 2.6x lower numerical error than a baseline FP8 attention.[^41] FlashAttention kernels are the standard backend for vLLM, SGLang, and TGI for both prefill and paged decode.

## How does the KV cache relate to context length?

The KV cache is directly tied to a model's ability to handle long contexts. As context windows have grown from 2,048 tokens (GPT-3, 2020) to 128,000 tokens ([GPT-4 Turbo](/wiki/gpt-4), 2023) to 1,000,000+ tokens ([Gemini](/wiki/gemini), 2024), KV cache memory requirements have grown proportionally.[^42]

For a Llama-3-70B-class model using GQA with 8 KV heads in FP16, the KV cache memory at various context lengths is:

| Context length | KV cache (batch size 1) | KV cache (batch size 32) |
|---|---|---|
| 4,096 | 1.25 GB | 40 GB |
| 32,768 | 10 GB | 320 GB |
| 131,072 | 40 GB | 1,280 GB |
| 1,000,000 | 305 GB | 9,766 GB |

At a context length of 1 million tokens with batch size 32, the KV cache alone would require nearly 10 TB of memory. This illustrates why KV cache optimization is not optional for long-context serving but a hard requirement.[^15][^16] Techniques like KV cache quantization, eviction, MLA, and hierarchical offloading are what make million-token context windows practically feasible.[^26][^21][^34]

[Ring attention](/wiki/ring_attention) and other sequence-parallel attention algorithms distribute the KV cache across GPUs in a ring topology, enabling extremely long contexts to be processed without any single GPU holding the entire cache.

## Implementation details

### Pre-allocation vs. dynamic allocation

A basic implementation appends new K and V tensors to Python lists or concatenates tensors at each step. While simple, this approach causes frequent memory allocations and copies. Production systems typically **pre-allocate** the cache to the maximum expected sequence length at the start, then fill in entries as tokens are generated. Pre-allocation avoids the overhead of dynamic memory management but wastes memory when sequences are shorter than the maximum. PagedAttention addresses this trade-off by combining dynamic allocation with efficient block-based memory management; vAttention provides a different solution using CUDA virtual memory primitives.[^13][^24]

### Position embeddings and the cache

When using [rotary position embeddings](/wiki/rotary_position_embedding) (RoPE), which encode position information directly into the key and query vectors, the cached keys already have positional information baked in. This means cached keys do not need to be re-encoded when new tokens are generated. However, certain KV cache quantization methods (such as KVQuant) quantize keys *before* applying RoPE because RoPE distorts the channel-wise distribution and makes post-RoPE quantization less effective.[^26] DeepSeek-V2's MLA likewise decouples a small number of "rotary" KV channels from the compressed latent path so RoPE can be applied without disturbing the low-rank compression.[^21]

### Multi-GPU and distributed caching

For models distributed across multiple GPUs using tensor parallelism, the KV cache is also distributed. Each GPU stores the cache for its assigned attention heads. With GQA, the number of KV heads limits how many GPUs can participate in the KV cache distribution. For example, a model with 8 KV heads can distribute its cache across at most 8 GPUs (one KV head per GPU). This constraint influences the choice of parallelism strategy for large-scale serving.[^10]

### Cache management in batched serving

In continuous batching (also called in-flight batching or iteration-level scheduling), the serving system manages a shared pool of KV cache memory across all active requests.[^25] When a request finishes, its cache memory is immediately recycled for new requests. This is more efficient than static batching, where memory is reserved for a fixed batch of requests even after some have completed. vLLM, TensorRT-LLM, SGLang, and Hugging Face TGI all use continuous batching with paged or radix-style cache management.[^2][^3][^4]

### Block size and reuse trade-offs

Block size is a key tuning knob for paged KV caches. Larger blocks improve the efficiency of attention kernels (more contiguous memory per kernel launch), but reduce the granularity of prefix sharing because only entire blocks can be shared between requests.[^3] TensorRT-LLM exposes a configurable block size and observes that this is a workload-dependent trade-off: chat workloads with long shared system prompts often benefit from smaller blocks, while batch-completion workloads benefit from larger blocks.[^3]

## Adoption and ecosystem

KV cache management has become a defining feature of LLM serving stacks. Notable systems as of 2025-2026:

- **[vLLM](/wiki/vllm).** Open-source serving engine built around PagedAttention; used in production at Meta, IBM, Cohere, and Mistral AI, with 2 to 24x throughput improvements over baseline TGI reported in high-concurrency workloads.[^2][^13]
- **[SGLang](/wiki/sglang).** Open-source framework from LMSYS with RadixAttention and HiCache for hierarchical, token-level KV cache management.[^4][^43]
- **[TensorRT](/wiki/tensorrt)-LLM.** NVIDIA's optimized serving stack with paged KV cache, in-flight batching, KV cache reuse, and priority-based eviction; targets Hopper and Blackwell GPUs.[^3]
- **[Text Generation Inference](/wiki/huggingface_tgi).** Hugging Face's production server with continuous batching and paged attention.
- **MLC-LLM.** Compiles models for diverse backends (CUDA, ROCm, Metal, WebGPU) with paged KV cache and quantization.
- **Mooncake.** Powers Kimi (Moonshot AI) with KVCache-centric disaggregated architecture and a global KVCache store.[^34]
- **LMCache.** Pluggable KV cache layer that adds hierarchical storage, cross-instance reuse, and CacheBlend RAG acceleration to vLLM.[^35][^36]

These systems also work alongside [speculative decoding](/wiki/speculative_decoding) techniques that further reduce the per-token cost of decode by drafting multiple candidate tokens at once.

## Limitations and open problems

KV caching makes autoregressive generation feasible at scale, but it has limitations. For long-context serving, KV cache memory frequently exceeds parameter memory and is the binding constraint on batch size during decode. Techniques like MLA, KIVI, KVQuant, and CacheBlend each address a slice of the problem but no single technique closes the gap to "memory-free" long-context inference.[^21][^26][^27][^36]

Aggressive eviction (H2O, SnapKV) and quantization (KIVI, KVQuant) can preserve perplexity on standard benchmarks while degrading on long-context tasks such as needle-in-a-haystack retrieval, multi-document summarization, and tool-use chains. Adaptive policies that pick a per-head or per-layer strategy are an active research area.[^28][^30][^31][^32]

Prefix caching is brittle to small prompt edits: a single token change near the start of a long prompt invalidates the entire downstream cache. Techniques like CacheBlend that allow non-prefix reuse are emerging but require careful quality control.[^36] Disaggregation reduces prefill-decode interference but increases system complexity, KV cache transfer costs, and failure surface; Mooncake and Dynamo invest heavily in fast KV cache transport (RDMA, NVLink, NIC offload) to make disaggregation pay off.[^34] Finally, different KV cache compression papers report results on different models, datasets, sequence lengths, and quality metrics, making cross-paper comparison difficult; survey papers such as Li et al. (2024) provide consolidated taxonomies but the field lacks a canonical leaderboard.[^44]

## Explain like I'm 5 (ELI5)

Imagine you are reading a long story out loud, and after each sentence you need to answer a question about everything you have read so far. Without a KV cache, you would have to re-read the entire story from the beginning every time you finish a new sentence. That would be very slow if the story is hundreds of pages long.

A KV cache is like taking notes as you read. After each sentence, you write down the important parts (the "keys" and "values") on a notepad. When you need to answer a question, you just look at your notes instead of re-reading everything. The more you read, the bigger your notepad gets. If it gets too big, you can shrink it by only keeping recent pages (sliding window), writing in smaller handwriting (quantization), or only keeping the most important pages (heavy hitters).

## See also

- [attention](/wiki/attention)
- [self attention](/wiki/self_attention)
- [multi-head self-attention](/wiki/multi-head_self-attention)
- [grouped-query attention](/wiki/grouped-query_attention)
- [multi-head latent attention](/wiki/multi-head_latent_attention)
- [sliding window attention](/wiki/sliding_window_attention)
- [vllm](/wiki/vllm)
- [sglang](/wiki/sglang)
- [tensorrt](/wiki/tensorrt)
- [huggingface tgi](/wiki/huggingface_tgi)
- [flash attention](/wiki/flash_attention)
- [flash attention 3](/wiki/flash_attention_3)
- [rotary position embedding](/wiki/rotary_position_embedding)
- [continuous batching](/wiki/continuous_batching)
- [speculative decoding](/wiki/speculative_decoding)
- [prompt caching](/wiki/prompt_caching)
- [quantization](/wiki/quantization)
- [ring attention](/wiki/ring_attention)
- [transformer](/wiki/transformer)
- [large language model](/wiki/large_language_model)

## References

[^1]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I., "Attention Is All You Need", *Advances in Neural Information Processing Systems (NeurIPS)*, 2017-12-04. https://arxiv.org/abs/1706.03762. Accessed 2026-05-24.
[^2]: vLLM project, "vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs", GitHub, 2026. https://github.com/vllm-project/vllm. Accessed 2026-05-24.
[^3]: NVIDIA, "KV cache reuse", TensorRT-LLM Documentation, 2025. https://nvidia.github.io/TensorRT-LLM/advanced/kv-cache-reuse.html. Accessed 2026-05-24.
[^4]: LMSYS, "Fast and Expressive LLM Inference with RadixAttention and SGLang", LMSYS Blog, 2024-01-17. https://www.lmsys.org/blog/2024-01-17-sglang/. Accessed 2026-05-24.
[^5]: NVIDIA, "Mastering LLM Techniques: Inference Optimization", NVIDIA Technical Blog, 2023-11-17. https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/. Accessed 2026-05-24.
[^6]: Anthropic, "Prompt caching with Claude", Anthropic News, 2024-08-14. https://www.anthropic.com/news/prompt-caching. Accessed 2026-05-24.
[^7]: OpenAI, "Prompt Caching in the API", OpenAI Blog, 2024-10-01. https://openai.com/index/api-prompt-caching/. Accessed 2026-05-24.
[^8]: Google, "Context caching, generateContent API", Gemini API Documentation, 2026. https://ai.google.dev/gemini-api/docs/caching. Accessed 2026-05-24.
[^9]: NVIDIA, "Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill", NVIDIA Technical Blog, 2024-08-01. https://developer.nvidia.com/blog/streamlining-ai-inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked-prefill/. Accessed 2026-05-24.
[^10]: Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Heek, J., Xiao, K., Agrawal, S., and Dean, J., "Efficiently Scaling Transformer Inference", *Proceedings of Machine Learning and Systems (MLSys)*, 2023-06-04. https://arxiv.org/abs/2211.05102. Accessed 2026-05-24.
[^11]: Shazeer, N., "Fast Transformer Decoding: One Write-Head Is All You Need", arXiv preprint arXiv:1911.02150, 2019-11-06. https://arxiv.org/abs/1911.02150. Accessed 2026-05-24.
[^12]: Vaswani, A. et al., "Attention Is All You Need", arXiv preprint arXiv:1706.03762, 2017-06-12. https://arxiv.org/abs/1706.03762. Accessed 2026-05-24.
[^13]: Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I., "Efficient Memory Management for Large Language Model Serving with PagedAttention", *Proceedings of the 29th Symposium on Operating Systems Principles (SOSP)*, 2023-09-12. https://arxiv.org/abs/2309.06180. Accessed 2026-05-24.
[^14]: Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S., "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints", *Proceedings of EMNLP*, 2023-12-06. https://arxiv.org/abs/2305.13245. Accessed 2026-05-24.
[^15]: Meta AI, "Introducing Llama 3.1: Our most capable models to date", Meta AI Blog, 2024-07-23. https://ai.meta.com/blog/meta-llama-3-1/. Accessed 2026-05-24.
[^16]: Hugging Face, "Llama 3.1, 405B, 70B & 8B with multilinguality and long context", Hugging Face Blog, 2024-07-23. https://huggingface.co/blog/llama31. Accessed 2026-05-24.
[^17]: Jiang, A. Q., Sablayrolles, A., Mensch, A., et al., "Mistral 7B", arXiv preprint arXiv:2310.06825, 2023-10-10. https://arxiv.org/abs/2310.06825. Accessed 2026-05-24.
[^18]: AMD, "Llama-3.3-70B-Instruct-FP8-KV model card", Hugging Face, 2025. https://huggingface.co/amd/Llama-3.3-70B-Instruct-FP8-KV. Accessed 2026-05-24.
[^19]: NVIDIA, "NVIDIA A100 Tensor Core GPU Architecture", NVIDIA Whitepaper, 2020. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf. Accessed 2026-05-24.
[^20]: NVIDIA, "NVIDIA H100 Tensor Core GPU Architecture", NVIDIA Whitepaper, 2022. https://resources.nvidia.com/en-us-tensor-core. Accessed 2026-05-24.
[^21]: DeepSeek-AI, "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model", arXiv preprint arXiv:2405.04434, 2024-05-07. https://arxiv.org/abs/2405.04434. Accessed 2026-05-24.
[^22]: DeepSeek-AI, "DeepSeek-V3 Technical Report", arXiv preprint arXiv:2412.19437, 2024-12-27. https://arxiv.org/abs/2412.19437. Accessed 2026-05-24.
[^23]: Red Hat, "How we optimized vLLM for DeepSeek-R1", Red Hat Developer, 2025-03-19. https://developers.redhat.com/articles/2025/03/19/how-we-optimized-vllm-deepseek-r1. Accessed 2026-05-24.
[^24]: Prabhu, R., Nayak, A., Mohan, J., Ramjee, R., and Panwar, A., "vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention", arXiv preprint arXiv:2405.04437, 2024-05-07. https://arxiv.org/abs/2405.04437. Accessed 2026-05-24.
[^25]: Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., and Chun, B.-G., "Orca: A Distributed Serving System for Transformer-Based Generative Models", *Proceedings of OSDI*, 2022-07-11. https://www.usenix.org/conference/osdi22/presentation/yu. Accessed 2026-05-24.
[^26]: Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A., "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization", *Advances in Neural Information Processing Systems (NeurIPS)*, 2024-12-10. https://arxiv.org/abs/2401.18079. Accessed 2026-05-24.
[^27]: Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X., "KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache", *Proceedings of ICML*, 2024-07-21. https://arxiv.org/abs/2402.02750. Accessed 2026-05-24.
[^28]: Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Re, C., Barrett, C., Wang, Z., and Chen, B., "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models", *Advances in Neural Information Processing Systems (NeurIPS)*, 2023-12-10. https://arxiv.org/abs/2306.14048. Accessed 2026-05-24.
[^29]: Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M., "Efficient Streaming Language Models with Attention Sinks", *International Conference on Learning Representations (ICLR)*, 2024-05-07. https://arxiv.org/abs/2309.17453. Accessed 2026-05-24.
[^30]: Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D., "SnapKV: LLM Knows What You are Looking for Before Generation", *Advances in Neural Information Processing Systems (NeurIPS)*, 2024-12-10. https://arxiv.org/abs/2404.14469. Accessed 2026-05-24.
[^31]: Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J., "Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs", arXiv preprint arXiv:2310.01801, 2023-10-03. https://arxiv.org/abs/2310.01801. Accessed 2026-05-24.
[^32]: Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., and Zhuang, B., "MiniCache: KV Cache Compression in Depth Dimension for Large Language Models", arXiv preprint arXiv:2405.14366, 2024-05-23. https://arxiv.org/abs/2405.14366. Accessed 2026-05-24.
[^33]: Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., and Zhang, H., "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving", *Proceedings of OSDI*, 2024-07-10. https://arxiv.org/abs/2401.09670. Accessed 2026-05-24.
[^34]: Qin, R., Li, Z., He, W., Zhang, M., Wu, Y., Zheng, W., and Xu, X., "Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving", arXiv preprint arXiv:2407.00079, 2024-06-28. https://arxiv.org/abs/2407.00079. Accessed 2026-05-24.
[^35]: Cheng, Y., Du, K., Hu, J., Lu, Z., Pang, Y., et al., "LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference", arXiv preprint arXiv:2510.09665, 2025-10-10. https://arxiv.org/abs/2510.09665. Accessed 2026-05-24.
[^36]: Yao, J., Li, H., Liu, Y., Stoica, I., et al., "CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion", *Proceedings of EuroSys*, 2025-03-30. https://arxiv.org/abs/2405.16444. Accessed 2026-05-24.
[^37]: KVSwap authors, "KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference", arXiv preprint arXiv:2511.11907, 2025-11-15. https://arxiv.org/abs/2511.11907. Accessed 2026-05-24.
[^38]: Anthropic, "Prompt caching", Claude API Documentation, 2026. https://platform.claude.com/docs/en/build-with-claude/prompt-caching. Accessed 2026-05-24.
[^39]: Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness", *Advances in Neural Information Processing Systems (NeurIPS)*, 2022-12-06. https://arxiv.org/abs/2205.14135. Accessed 2026-05-24.
[^40]: Dao, T., "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning", arXiv preprint arXiv:2307.08691, 2023-07-17. https://arxiv.org/abs/2307.08691. Accessed 2026-05-24.
[^41]: Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T., "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision", *Advances in Neural Information Processing Systems (NeurIPS)*, 2024-12-10. https://arxiv.org/abs/2407.08608. Accessed 2026-05-24.
[^42]: Google DeepMind, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context", Google DeepMind Blog, 2024-02-15. https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/. Accessed 2026-05-24.
[^43]: LMSYS, "SGLang HiCache: Fast Hierarchical KV Caching with Your Favorite Storage Backends", LMSYS Blog, 2025-09-10. https://www.lmsys.org/blog/2025-09-10-sglang-hicache/. Accessed 2026-05-24.
[^44]: Shi, L., Zhang, H., Yao, Y., Li, Z., and Zhao, H., "A Survey on Large Language Model Acceleration based on KV Cache Management", arXiv preprint arXiv:2412.19442, 2024-12-27. https://arxiv.org/abs/2412.19442. Accessed 2026-05-24.
[^45]: Touvron, H., Martin, L., Stone, K., et al., "Llama 2: Open Foundation and Fine-Tuned Chat Models", arXiv preprint arXiv:2307.09288, 2023-07-18. https://arxiv.org/abs/2307.09288. Accessed 2026-05-24.