H2O (Heavy-Hitter Oracle for KV Cache)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,151 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 4,151 words
Add missing citations, update stale details, or suggest a clearer explanation.
H2O (Heavy-Hitter Oracle) is a training-free, runtime KV cache eviction policy for autoregressive large language model inference. It identifies a small subset of "Heavy Hitter" tokens (the ones that consistently receive disproportionately large attention weights from later queries) and combines them with a sliding window of recent tokens to form a fixed-size cache, evicting everything else. Introduced by Zhenyu Zhang and collaborators at NeurIPS 2023, H2O reports up to a five-fold reduction in KV cache memory at roughly 20% cache budget while matching full-cache accuracy on standard benchmarks, with reported end-to-end throughput improvements of up to 29x over DeepSpeed Zero-Inference and Hugging Face Accelerate and up to 3x over FlexGen on OPT models.[^1][^2][^3] The technique formulates eviction as a dynamic submodular maximization problem and provides an approximation guarantee for its greedy algorithm.[^1][^4]
| Field | Value |
|---|---|
| First public release | June 24, 2023 (arXiv v1)[^1] |
| Final arXiv revision | December 18, 2023 (v3)[^1] |
| Venue | 37th Conference on Neural Information Processing Systems (NeurIPS 2023)[^5] |
| Lead author | Zhenyu Zhang (Stanford University / UT Austin)[^1] |
| Reference implementation | github.com/FMInference/H2O[^2] |
| License (reference code) | MIT License[^2] |
| Validated on | OPT, LLaMA, GPT-NeoX[^1][^3] |
| Reported cache budget | ~20% of full KV cache (10% heavy hitters + 10% recent)[^3] |
In a Transformer decoder, every generated token depends on the keys and values of all preceding tokens through self-attention. To avoid recomputing those projections on each step, modern inference engines cache them in GPU memory; this structure is the KV cache. The cache size scales linearly with the number of layers, the number of attention heads, the head dimension, the sequence length, and the batch size. For OPT-30B at sequence length 1024 and batch size 128 the H2O authors estimate the KV cache exceeds 180 GB, larger than the model weights themselves.[^3] For long-context workloads or large batches the cache can dominate both memory footprint and the cost of each attention step because the keys and values for every retained token must be streamed through high-bandwidth memory on each decoding pass.[^3][^6]
This bottleneck has motivated a family of techniques collectively called KV cache compression: quantization of cached tensors, low-rank compression, layer-wise sharing, structural changes such as Grouped-Query Attention and Multi-Query Attention, paging and reuse with systems like PagedAttention in vLLM and SGLang, and the family of eviction policies to which H2O belongs.[^6][^7]
Empirical measurements in the H2O paper show that attention score matrices in pre-trained LLMs are extremely sparse during generation, with sparsity above 95% in almost all layers.[^3] In other words, on any given decoding step, a large fraction of the cached keys receive near-zero softmax mass from the current query. This suggests that the cache stores far more state than any single forward pass actually consumes, and motivates the question of whether one can identify which keys will matter ahead of time and discard the rest.
The two most obvious eviction strategies are full cache (the baseline, no compression) and a pure recency or local window (keep only the most recent k tokens, equivalent to sliding window attention applied at inference time). H2O reports that the local-window baseline collapses quickly: on long-context generation tasks the local strategy fails to match full-cache accuracy even when given a budget several times larger than what H2O needs.[^3] The companion paper Scissorhands, presented at the same NeurIPS 2023, independently hypothesized that token importance is persistent across decoding steps and demonstrated up to a five-fold KV cache reduction by keeping the historically most-attended tokens.[^8] StreamingLLM, published shortly afterward, observed that initial tokens act as "attention sinks" that absorb excess softmax mass; keeping a handful of those sinks together with a sliding window stabilizes streaming inference to multi-million-token sequences.[^9] H2O slots between these two: it does not require a model to be retrained, but rather than only keeping recent tokens or only keeping initial sinks, it explicitly tracks which tokens have accumulated the largest attention weight and protects them.
Earlier related work in attention sparsification, particularly sparse attention patterns such as the strided and fixed-window schemes from the Sparse Transformer, attempted to reduce both compute and memory by restricting which positions can attend to which, but those schemes operate at training time and impose a static pattern. The H2O paper benchmarks variants of these static patterns and reports that pairing them with heavy-hitter selection recovers most of the accuracy that the pattern alone gives up: a strided Sparse Transformer baseline at 20% budget shows a 35 percentage-point accuracy drop in their setup, but combining the same pattern with heavy-hitter scoring narrows that gap to roughly 2 points.[^3] This is one of the empirical arguments that motivates dynamic importance-based eviction over fixed structural sparsity.
The central empirical claim of H2O is that, when the per-row attention scores are summed across query positions, the resulting distribution over keys follows a power-law: a small subset of tokens accounts for the bulk of the total mass.[^3] The authors call these high-mass tokens "Heavy Hitters" (abbreviated H2 in the paper, hence the chemical-flavored name H2O for the algorithm). Two further observations make the heavy hitter idea practical for streaming generation:
At each decoding step i, H2O maintains a fixed cache budget k. For every token j currently in the cache it tracks an accumulated attention score, which is the running sum of the (softmax-normalized) attention weight that j has received from queries at every step since j entered the cache.[^3] Let o_i = D_i^{-1} exp(Q_i K_{S_{i-1}}^T) denote the normalized attention vector at step i, where S_{i-1} is the current cache. The cache score function is
F_score(T) = sum over s in T of o_s,
i.e. the total attention mass landing on the candidate cache set T at the current step.[^3]
While the number of stored tokens is at or below k, H2O simply appends each new key and value. Once the budget is reached, on each new token i:
G_i = S_{i-1} ∪ {i} (current cache plus the new token).v in G_i, compute F_score(G_i \ {v}), the cache score if v were dropped.u whose removal preserves the largest F_score, i.e. u = argmax_{v in G_i} F_score(G_i \ {v}).S_i = G_i \ {u}.In practice the budget is split into two parts: roughly half is reserved for the most recent tokens (a sliding window for local context) and the rest is allocated to the heavy hitters selected by the accumulated score.[^3] The standard configuration cited in the paper sets the total budget at approximately 20% of the full sequence length, with about 10% recent tokens and 10% heavy hitters, which suffices to match full-cache accuracy on most benchmarks.[^3]
The authors formulate KV cache eviction as a dynamic submodular maximization problem.[^1][^3] In the static submodular setting, the classical greedy algorithm achieves a (1 - 1/e) ≈ 0.632 approximation of the optimal k-element subset selection. H2O extends this analysis to a setting where the underlying set function changes at every decoding step. Their Theorem 4.4 (informal statement) shows that under mild assumptions on the attention scheme being approximately submodular, the greedy eviction policy generates cache sets S_tilde_i satisfying
f(S_tilde_i) >= (1 - alpha)(1 - 1/e) * max_{|S|=k} f(S) - beta,
where alpha and beta are small problem-dependent constants.[^3] The result is not a tight optimality proof for arbitrary attention matrices, since true submodularity of attention is an empirical rather than provable property, but it gives a principled reason to expect a low-cost greedy procedure to be competitive with much more expensive search.
The accumulated score is updated incrementally at the same time as the attention softmax is computed, so the per-step overhead is on the order of the existing attention work over the cached tokens plus a single argmax. No retraining, fine-tuning, or auxiliary network is required, which is the principal practical advantage of the method: any pre-trained decoder can be served with H2O eviction by replacing the cache management layer.[^2][^3]
The reference implementation in h2o_flexgen is built on top of the FlexGen high-throughput offloading engine.[^2] Two design choices reduce engineering overhead: a circular queue stores the last-K tokens so that recency tracking does not require buffer shuffles, and evictions overwrite the slot in place rather than triggering a memory move.[^3] A separate h2o_hf package wraps Hugging Face Transformers and provides both an attention-masking simulation mode (useful for accuracy studies because no real eviction is performed) and a true cache-dropping mode that frees memory at runtime.[^2]
The simulation mode is important methodologically because it lets researchers separate the question of "does the heavy-hitter selection signal correctly identify the important tokens" from the question of "does the eviction implementation correctly free memory at runtime." In simulation mode the model still allocates the full cache but multiplies the attention output for non-selected tokens by zero, mimicking the effect of eviction on the forward pass without changing the memory layout; in the real eviction mode the same selection rule is used to physically drop entries from the cache buffer. The two modes give matching downstream-task numbers in the paper's experiments, which is taken as evidence that the simulation results carry over to the deployed system.[^3]
The paper evaluates H2O on three model families (OPT 6.7B / 30B / 66B / 175B, LLaMA 7B / 13B, and GPT-NeoX 20B) and a battery of tasks drawn from HELM and lm-eval-harness, including COPA, HellaSwag, OpenBookQA, PIQA, RTE, WinoGrande, MathQA, and the XSUM and CNN/DailyMail summarization datasets.[^3] At a 20% cache budget H2O matches full-cache accuracy within a percentage point on most tasks. Selected published numbers for OPT-30B are:
| Task | Full cache | H2O (20% budget) | Local window only (20% budget) |
|---|---|---|---|
| COPA | 85.0 | 84.0 | 48.0 |
| PIQA | 78.51 | 78.45 | 55.82 |
| WinoGrande | 70.24 | 69.06 | 49.17 |
| OpenBookQA | 43.2 | 43.0 | 25.2 |
Values reported by the H2O paper.[^3] The local-window column illustrates the failure mode of pure recency: at the same budget it loses double-digit accuracy on every task in the table.
An ablation in the paper isolates the two components of the cache: a heavy-hitter-only configuration (no recent tokens) is much closer to full cache than a recent-only configuration, but combining heavy hitters with a small sliding window yields the best accuracy at fixed budget.[^3]
The headline system result is a 5x reduction in KV cache memory at roughly the same generation quality, validated on OPT-6.7B and OPT-30B.[^1][^3] At sequence lengths used in the paper this translates into throughput improvements when measured against existing inference systems on the same hardware:
| Baseline | Reported throughput speedup vs baseline (OPT-6.7B / 30B) |
|---|---|
| Hugging Face Accelerate | up to 29x[^1][^3] |
| DeepSpeed Zero-Inference | up to 29x[^1][^3] |
| FlexGen | up to 3x[^1][^3] |
| Same batch size, FlexGen (A100) | up to 1.9x latency reduction[^3] |
Because much of the gain comes from being able to fit a larger effective batch in the same memory budget, the largest speedups are in regimes where the previous baseline was either out-of-memory or forced to use a tiny batch. On an A100 with input length 7000 plus 1024 generated tokens, batch size 1, the paper reports H2O reduces per-request latency from roughly 57 seconds (FlexGen) to roughly 50 seconds, and at input 2048 plus 2048 with batch 24 it raises throughput from 494 to 919 tokens per second; at batch 64, FlexGen runs out of memory while H2O continues to serve.[^3]
The authors report that H2O composes orthogonally with 4-bit weight quantization: applying H2O eviction on top of a 4-bit OPT-30B preserves accuracy on the same downstream tasks within experimental noise.[^3] Later work has noted, however, that the interaction between low-bit KV cache quantization and heavy-hitter selection is more delicate, because the noise introduced by quantizing the cached attention scores can perturb which tokens are identified as heavy hitters; this observation motivated follow-up systems such as Q-Hitter.[^10]
H2O sits within a broader family of cache compression and eviction policies developed in 2023 and 2024. The contemporaneous and immediately following methods most often compared with it are Scissorhands, StreamingLLM, and SnapKV.
| Method | Year | Selection signal | Cache contents | Reported memory reduction or budget | Training required |
|---|---|---|---|---|---|
| H2O | 2023[^1] | Accumulated attention score (heavy hitters) + recency | Heavy hitters + sliding window | ~5x at 20% budget[^1][^3] | No |
| Scissorhands | 2023[^8] | Persistence of historical attention | Pivotal tokens at fixed budget | ~5x; ~20x combined with 4-bit weights[^8] | No |
| StreamingLLM | 2023[^9] | Fixed "attention sink" positions + recency | First few tokens + sliding window | Constant-size cache, scales to ~4M tokens[^9] | Optional sink token at pretraining |
| SnapKV | 2024[^11] | Pooled attention from end-of-prompt observation window | Clustered important KVs per head | ~8x memory and ~3.6x decode speed on 16k inputs[^11] | No |
The relationships between these methods are most easily described by what each one assumes is the right summary of "importance."
A handful of other systems are closely related at the level of motivation but operate by different mechanisms. PagedAttention in vLLM does not evict, but virtualizes the cache into page-sized blocks so memory is allocated only as needed.[^12] Infini-Attention compresses old KV states into a fixed-size memory rather than evicting them. Grouped-Query Attention and Multi-Query Attention reduce per-token cache size by sharing keys and values across heads, an orthogonal axis to eviction. These can be combined with H2O.
H2O has become one of the most-cited reference points in the KV-cache compression literature. It is the standard "attention-based heavy-hitter" baseline in surveys and follow-on papers, alongside SnapKV and StreamingLLM.[^7][^10] A representative example, MarkTechPost's 2026 round-up of the leading KV cache compression techniques for LLM inference, lists H2O among the top approaches and notes its lineage and continued use as a comparison point in 2024-2026 work.[^7] Specific follow-on systems include Q-Hitter (MLSys 2024), which targets the interaction between heavy-hitter selection and KV cache quantization and is explicitly framed as "a better token oracle" relative to H2O,[^10] and a number of fragility, reasoning-aware, and head-aware variants that adopt H2O's accumulated-attention score as the importance signal but vary the protection or budget rules.[^13]
H2O was proposed for integration into the vLLM serving engine in issue #3532 in March 2024; that proposal was eventually closed as not planned, with the maintainers prioritizing prefix caching and paging rather than per-token eviction.[^14] At the time of writing the most widely used production deployment vector for H2O remains the reference h2o_flexgen and h2o_hf implementations published under the MIT license.[^2] The algorithm's training-free nature means it can in principle be applied to any decoder-only LLM, and the academic literature reports H2O-style eviction running on OPT, LLaMA (1, 2, 3), Llama 2, and Mistral checkpoints in third-party reproductions.[^7][^13]
The H2O paper and subsequent work identify a number of limitations.
Reliance on a submodularity assumption. The (1 - 1/e) approximation guarantee depends on a submodularity property of the attention-based score function. The authors acknowledge this is an empirical rather than rigorously proved property of softmax attention, and the formal bound therefore holds modulo a parameter alpha that captures how far the actual scheme deviates from exact submodularity.[^3]
Sensitivity to budget on long-context, multi-step reasoning. While H2O matches full-cache accuracy on the lm-eval-harness suite at 20% budget, more recent evaluations focused on multi-step reasoning find that all attention-based eviction methods, including H2O, can degrade more sharply when the budget is pushed substantially below the prompt length on reasoning-heavy workloads. In some of these settings the recency component of the cache turns out to be more protective than the heavy-hitter component, an opposite ordering from what holds on shorter benchmarks.[^13]
Greedy local statistics may miss future-relevant tokens. Because the accumulated score is computed online from past queries only, the algorithm can in principle evict a token that has been historically ignored but would have become important to a future query. The persistence observations in the paper argue this is uncommon in practice, but it is not ruled out, and methods such as SnapKV that look at an explicit observation window of queries near the end of the prompt have shown stronger results on long-prompt question answering for exactly this reason.[^11]
Interaction with KV cache quantization. Combining H2O selection with low-bit quantization of the cached keys and values can perturb the heavy-hitter set, because the quantization noise affects which tokens are scored highest. Q-Hitter explicitly investigates this issue and proposes a modified oracle.[^10]
No reduction in the prefill cost. H2O's eviction operates per decoding step; the prefill pass still computes full attention over the prompt, which means the method does not address the latency of prompt processing itself. Systems such as SnapKV that compress at the end of prefill are complementary on that axis.[^11]
Online overhead of the accumulated score. Although the additional bookkeeping is small relative to the attention computation itself, the algorithm does add an argmax and a running sum over the cache budget on each step. For very tight budgets and very fast decoders the overhead is negligible, but in settings where the per-step compute is dominated by a small number of matmuls (for example heavily quantized smaller models running on accelerators with extremely fast tensor cores) the relative overhead grows. The reference implementation amortizes this by fusing the score update with the attention softmax pass, an optimization that is implementation-specific and not automatic for arbitrary serving stacks.[^2][^3]
Single-head global scoring. The original H2O algorithm computes one accumulated score per token, shared across attention heads, rather than per-head scores. Per-head importance can differ substantially in modern decoder LLMs, and subsequent work, including SnapKV and various head-aware variants, demonstrates measurable gains from per-head selection on long-context question answering.[^11][^13] The per-head choice is a trade-off: scoring per head increases bookkeeping linearly in the number of heads but better tracks the actual attention patterns of methods such as Grouped-Query Attention.
H2O contributed three things that are now standard reference points in the KV-cache compression literature:
Together with PagedAttention (which addresses cache fragmentation rather than redundancy), Scissorhands (which formalizes persistence of importance), StreamingLLM (which formalizes attention sinks), and SnapKV (which formalizes per-head observation-window selection), H2O is one of the building blocks of the contemporary toolkit for serving large language models under tight memory budgets.[^7][^9][^11][^12] The technique is orthogonal to quantization, pruning, GQA and MQA head sharing, Speculative Decoding, and inference engines such as vLLM, SGLang, and TensorRT, so it can be stacked with most of them. Within the broader effort to make long-context LLM serving cheaper, H2O occupies the position of a simple, training-free, theoretically grounded eviction baseline against which newer cache-compression methods are typically compared.