H2O (Heavy-Hitter Oracle for KV Cache)

AI Inference Large Language Models

21 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

14 citations

Revision

v4 · 4,147 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

H2O (Heavy-Hitter Oracle) is a training-free, runtime KV cache eviction policy for autoregressive large language model inference. It identifies a small subset of "Heavy Hitter" tokens (the ones that consistently receive disproportionately large attention weights from later queries) and combines them with a sliding window of recent tokens to form a fixed-size cache, evicting everything else. Introduced by Zhenyu Zhang and collaborators at NeurIPS 2023, H2O reports up to a five-fold reduction in KV cache memory at roughly 20% cache budget while matching full-cache accuracy on standard benchmarks, with reported end-to-end throughput improvements of up to 29x over DeepSpeed Zero-Inference and Hugging Face Accelerate and up to 3x over FlexGen on OPT models.^[1]^[2]^[3] The technique formulates eviction as a dynamic submodular maximization problem and provides an approximation guarantee for its greedy algorithm.^[1]^[4]

Field	Value
First public release	June 24, 2023 (arXiv v1)^[1]
Final arXiv revision	December 18, 2023 (v3)^[1]
Venue	37th Conference on Neural Information Processing Systems (NeurIPS 2023)^[5]
Lead author	Zhenyu Zhang (Stanford University / UT Austin)^[1]
Reference implementation	github.com/FMInference/H2O^[2]
License (reference code)	MIT License^[2]
Validated on	OPT, LLaMA, GPT-NeoX^[1]^[3]
Reported cache budget	~20% of full KV cache (10% heavy hitters + 10% recent)^[3]

Background

KV cache memory in autoregressive inference

In a Transformer decoder, every generated token depends on the keys and values of all preceding tokens through self-attention. To avoid recomputing those projections on each step, modern inference engines cache them in GPU memory; this structure is the KV cache. The cache size scales linearly with the number of layers, the number of attention heads, the head dimension, the sequence length, and the batch size. For OPT-30B at sequence length 1024 and batch size 128 the H2O authors estimate the KV cache exceeds 180 GB, larger than the model weights themselves.^[3] For long-context workloads or large batches the cache can dominate both memory footprint and the cost of each attention step because the keys and values for every retained token must be streamed through high-bandwidth memory on each decoding pass.^[3]^[6]

This bottleneck has motivated a family of techniques collectively called KV cache compression: quantization of cached tensors, low-rank compression, layer-wise sharing, structural changes such as Grouped-Query Attention and Multi-Query Attention, paging and reuse with systems like PagedAttention in vLLM and SGLang, and the family of eviction policies to which H2O belongs.^[6]^[7]

Sparsity of attention at inference

Empirical measurements in the H2O paper show that attention score matrices in pre-trained LLMs are extremely sparse during generation, with sparsity above 95% in almost all layers.^[3] In other words, on any given decoding step, a large fraction of the cached keys receive near-zero softmax mass from the current query. This suggests that the cache stores far more state than any single forward pass actually consumes, and motivates the question of whether one can identify which keys will matter ahead of time and discard the rest.

Prior eviction baselines

The two most obvious eviction strategies are full cache (the baseline, no compression) and a pure recency or local window (keep only the most recent k tokens, equivalent to sliding window attention applied at inference time). H2O reports that the local-window baseline collapses quickly: on long-context generation tasks the local strategy fails to match full-cache accuracy even when given a budget several times larger than what H2O needs.^[3] The companion paper Scissorhands, presented at the same NeurIPS 2023, independently hypothesized that token importance is persistent across decoding steps and demonstrated up to a five-fold KV cache reduction by keeping the historically most-attended tokens.^[8] StreamingLLM, published shortly afterward, observed that initial tokens act as "attention sinks" that absorb excess softmax mass; keeping a handful of those sinks together with a sliding window stabilizes streaming inference to multi-million-token sequences.^[9] H2O slots between these two: it does not require a model to be retrained, but rather than only keeping recent tokens or only keeping initial sinks, it explicitly tracks which tokens have accumulated the largest attention weight and protects them.

Earlier related work in attention sparsification, particularly sparse attention patterns such as the strided and fixed-window schemes from the Sparse Transformer, attempted to reduce both compute and memory by restricting which positions can attend to which, but those schemes operate at training time and impose a static pattern. The H2O paper benchmarks variants of these static patterns and reports that pairing them with heavy-hitter selection recovers most of the accuracy that the pattern alone gives up: a strided Sparse Transformer baseline at 20% budget shows a 35 percentage-point accuracy drop in their setup, but combining the same pattern with heavy-hitter scoring narrows that gap to roughly 2 points.^[3] This is one of the empirical arguments that motivates dynamic importance-based eviction over fixed structural sparsity.

The Heavy Hitters observation

The central empirical claim of H2O is that, when the per-row attention scores are summed across query positions, the resulting distribution over keys follows a power-law: a small subset of tokens accounts for the bulk of the total mass.^[3] The authors call these high-mass tokens "Heavy Hitters" (abbreviated H2 in the paper, hence the chemical-flavored name H2O for the algorithm). Two further observations make the heavy hitter idea practical for streaming generation:

Co-occurrence correlation. Heavy hitters are not random outliers. They correspond strongly to frequently co-occurring tokens in the underlying text, so they tend to be semantically meaningful anchors rather than artifacts of a particular query.^[3]
Persistence. A token that is a heavy hitter in the first few decoding steps tends to remain a heavy hitter throughout generation, and a local statistic computed from only the preceding tokens is empirically almost as good as the (unavailable) global statistic that would require knowing the future.^[3] This persistence claim was articulated independently and given a more formal treatment in Scissorhands as the "persistence of importance hypothesis."^[8]
Necessity. Ablating heavy hitters, even while keeping every other token, causes severe performance degradation on downstream tasks, demonstrating that the heavy-hitter set carries a disproportionate share of the predictive signal.^[3]

The H2O algorithm

Score function

At each decoding step i, H2O maintains a fixed cache budget k. For every token j currently in the cache it tracks an accumulated attention score, which is the running sum of the (softmax-normalized) attention weight that j has received from queries at every step since j entered the cache.^[3] Let o_i = D_i^{-1} exp(Q_i K_{S_{i-1}}^T) denote the normalized attention vector at step i, where S_{i-1} is the current cache. The cache score function is

F_score(T) = sum over s in T of o_s,

i.e. the total attention mass landing on the candidate cache set T at the current step.^[3]

Eviction step

While the number of stored tokens is at or below k, H2O simply appends each new key and value. Once the budget is reached, on each new token i:

Form the candidate set G_i = S_{i-1} ∪ {i} (current cache plus the new token).
For each v in G_i, compute F_score(G_i \ {v}), the cache score if v were dropped.
Pick the token u whose removal preserves the largest F_score, i.e. u = argmax_{v in G_i} F_score(G_i \ {v}).
Set S_i = G_i \ {u}.

In practice the budget is split into two parts: roughly half is reserved for the most recent tokens (a sliding window for local context) and the rest is allocated to the heavy hitters selected by the accumulated score.^[3] The standard configuration cited in the paper sets the total budget at approximately 20% of the full sequence length, with about 10% recent tokens and 10% heavy hitters, which suffices to match full-cache accuracy on most benchmarks.^[3]

Theoretical guarantee

The authors formulate KV cache eviction as a dynamic submodular maximization problem.^[1]^[3] In the static submodular setting, the classical greedy algorithm achieves a (1 - 1/e) ≈ 0.632 approximation of the optimal k-element subset selection. H2O extends this analysis to a setting where the underlying set function changes at every decoding step. Their Theorem 4.4 (informal statement) shows that under mild assumptions on the attention scheme being approximately submodular, the greedy eviction policy generates cache sets S_tilde_i satisfying

f(S_tilde_i) >= (1 - alpha)(1 - 1/e) * max_{|S|=k} f(S) - beta,

where alpha and beta are small problem-dependent constants.^[3] The result is not a tight optimality proof for arbitrary attention matrices, since true submodularity of attention is an empirical rather than provable property, but it gives a principled reason to expect a low-cost greedy procedure to be competitive with much more expensive search.

Cost and online operation

The accumulated score is updated incrementally at the same time as the attention softmax is computed, so the per-step overhead is on the order of the existing attention work over the cached tokens plus a single argmax. No retraining, fine-tuning, or auxiliary network is required, which is the principal practical advantage of the method: any pre-trained decoder can be served with H2O eviction by replacing the cache management layer.^[2]^[3]

Implementation in FlexGen

The reference implementation in h2o_flexgen is built on top of the FlexGen high-throughput offloading engine.^[2] Two design choices reduce engineering overhead: a circular queue stores the last-K tokens so that recency tracking does not require buffer shuffles, and evictions overwrite the slot in place rather than triggering a memory move.^[3] A separate h2o_hf package wraps Hugging Face Transformers and provides both an attention-masking simulation mode (useful for accuracy studies because no real eviction is performed) and a true cache-dropping mode that frees memory at runtime.^[2]

The simulation mode is important methodologically because it lets researchers separate the question of "does the heavy-hitter selection signal correctly identify the important tokens" from the question of "does the eviction implementation correctly free memory at runtime." In simulation mode the model still allocates the full cache but multiplies the attention output for non-selected tokens by zero, mimicking the effect of eviction on the forward pass without changing the memory layout; in the real eviction mode the same selection rule is used to physically drop entries from the cache buffer. The two modes give matching downstream-task numbers in the paper's experiments, which is taken as evidence that the simulation results carry over to the deployed system.^[3]

Empirical results

Accuracy on standard tasks

The paper evaluates H2O on three model families (OPT 6.7B / 30B / 66B / 175B, LLaMA 7B / 13B, and GPT-NeoX 20B) and a battery of tasks drawn from HELM and lm-eval-harness, including COPA, HellaSwag, OpenBookQA, PIQA, RTE, WinoGrande, MathQA, and the XSUM and CNN/DailyMail summarization datasets.^[3] At a 20% cache budget H2O matches full-cache accuracy within a percentage point on most tasks. Selected published numbers for OPT-30B are:

Task	Full cache	H2O (20% budget)	Local window only (20% budget)
COPA	85.0	84.0	48.0
PIQA	78.51	78.45	55.82
WinoGrande	70.24	69.06	49.17
OpenBookQA	43.2	43.0	25.2

Values reported by the H2O paper.^[3] The local-window column illustrates the failure mode of pure recency: at the same budget it loses double-digit accuracy on every task in the table.

An ablation in the paper isolates the two components of the cache: a heavy-hitter-only configuration (no recent tokens) is much closer to full cache than a recent-only configuration, but combining heavy hitters with a small sliding window yields the best accuracy at fixed budget.^[3]

Memory and throughput

The headline system result is a 5x reduction in KV cache memory at roughly the same generation quality, validated on OPT-6.7B and OPT-30B.^[1]^[3] At sequence lengths used in the paper this translates into throughput improvements when measured against existing inference systems on the same hardware:

Baseline	Reported throughput speedup vs baseline (OPT-6.7B / 30B)
Hugging Face Accelerate	up to 29x^[1]^[3]
DeepSpeed Zero-Inference	up to 29x^[1]^[3]
FlexGen	up to 3x^[1]^[3]
Same batch size, FlexGen (A100)	up to 1.9x latency reduction^[3]

Because much of the gain comes from being able to fit a larger effective batch in the same memory budget, the largest speedups are in regimes where the previous baseline was either out-of-memory or forced to use a tiny batch. On an A100 with input length 7000 plus 1024 generated tokens, batch size 1, the paper reports H2O reduces per-request latency from roughly 57 seconds (FlexGen) to roughly 50 seconds, and at input 2048 plus 2048 with batch 24 it raises throughput from 494 to 919 tokens per second; at batch 64, FlexGen runs out of memory while H2O continues to serve.^[3]

Compatibility with quantization

The authors report that H2O composes orthogonally with 4-bit weight quantization: applying H2O eviction on top of a 4-bit OPT-30B preserves accuracy on the same downstream tasks within experimental noise.^[3] Later work has noted, however, that the interaction between low-bit KV cache quantization and heavy-hitter selection is more delicate, because the noise introduced by quantizing the cached attention scores can perturb which tokens are identified as heavy hitters; this observation motivated follow-up systems such as Q-Hitter.^[10]

Comparison with other KV cache eviction methods

H2O sits within a broader family of cache compression and eviction policies developed in 2023 and 2024. The contemporaneous and immediately following methods most often compared with it are Scissorhands, StreamingLLM, and SnapKV.

Method	Year	Selection signal	Cache contents	Reported memory reduction or budget	Training required
H2O	2023^[1]	Accumulated attention score (heavy hitters) + recency	Heavy hitters + sliding window	~5x at 20% budget^[1]^[3]	No
Scissorhands	2023^[8]	Persistence of historical attention	Pivotal tokens at fixed budget	~5x; ~20x combined with 4-bit weights^[8]	No
StreamingLLM	2023^[9]	Fixed "attention sink" positions + recency	First few tokens + sliding window	Constant-size cache, scales to ~4M tokens^[9]	Optional sink token at pretraining
SnapKV	2024^[11]	Pooled attention from end-of-prompt observation window	Clustered important KVs per head	~8x memory and ~3.6x decode speed on 16k inputs^[11]	No

The relationships between these methods are most easily described by what each one assumes is the right summary of "importance."

Scissorhands keeps the historically most-attended tokens under the persistence of importance hypothesis. H2O's accumulated-attention score is closely related, and the two methods report similar 5x memory reductions on overlapping benchmarks. The H2O paper additionally provides the submodular formulation and approximation guarantee.^[3]^[8]
StreamingLLM addresses a different failure mode. It does not attempt to identify semantically important tokens; it observes that the first few tokens act as a sink for excess softmax mass and that retaining them, together with a sliding window, prevents the perplexity blowup that ordinary window attention suffers when the sequence grows beyond the cache. It can therefore stream to multi-million-token contexts but does not by itself reduce cache memory below the chosen window plus sink size. H2O does not preserve initial tokens specifically, but its heavy hitters often coincide with positions that, in practice, attract sink-like attention.^[3]^[9]
SnapKV runs the prefilled prompt through a final "observation window" and uses pooled attention from that window to vote for which earlier positions to keep per attention head. It targets prompt compression rather than per-decoding-step eviction and reports stronger compression on long prompts (around 92% reduction at 1024-token capacity on its benchmarks).^[11] H2O computes a flat cumulative score over all queries seen so far rather than a per-head pooled signal from a localized observation window; this makes H2O simpler and effective for general decoding, while SnapKV-style per-head selection has shown advantages on long-prompt question answering. Subsequent surveys of KV cache compression group H2O and SnapKV together as the dominant attention-based eviction approaches, with H2O often used as the standard baseline in 2024 and 2025 papers.^[7]^[10]

A handful of other systems are closely related at the level of motivation but operate by different mechanisms. PagedAttention in vLLM does not evict, but virtualizes the cache into page-sized blocks so memory is allocated only as needed.^[12] Infini-Attention compresses old KV states into a fixed-size memory rather than evicting them. Grouped-Query Attention and Multi-Query Attention reduce per-token cache size by sharing keys and values across heads, an orthogonal axis to eviction. These can be combined with H2O.

Adoption and follow-on work

H2O has become one of the most-cited reference points in the KV-cache compression literature. It is the standard "attention-based heavy-hitter" baseline in surveys and follow-on papers, alongside SnapKV and StreamingLLM.^[7]^[10] A representative example, MarkTechPost's 2026 round-up of the leading KV cache compression techniques for LLM inference, lists H2O among the top approaches and notes its lineage and continued use as a comparison point in 2024-2026 work.^[7] Specific follow-on systems include Q-Hitter (MLSys 2024), which targets the interaction between heavy-hitter selection and KV cache quantization and is explicitly framed as "a better token oracle" relative to H2O,^[10] and a number of fragility, reasoning-aware, and head-aware variants that adopt H2O's accumulated-attention score as the importance signal but vary the protection or budget rules.^[13]

H2O was proposed for integration into the vLLM serving engine in issue #3532 in March 2024; that proposal was eventually closed as not planned, with the maintainers prioritizing prefix caching and paging rather than per-token eviction.^[14] At the time of writing the most widely used production deployment vector for H2O remains the reference h2o_flexgen and h2o_hf implementations published under the MIT license.^[2] The algorithm's training-free nature means it can in principle be applied to any decoder-only LLM, and the academic literature reports H2O-style eviction running on OPT, LLaMA (1, 2, 3), Llama 2, and Mistral checkpoints in third-party reproductions.^[7]^[13]

Limitations and criticisms

The H2O paper and subsequent work identify a number of limitations.

Reliance on a submodularity assumption. The (1 - 1/e) approximation guarantee depends on a submodularity property of the attention-based score function. The authors acknowledge this is an empirical rather than rigorously proved property of softmax attention, and the formal bound therefore holds modulo a parameter alpha that captures how far the actual scheme deviates from exact submodularity.^[3]

Sensitivity to budget on long-context, multi-step reasoning. While H2O matches full-cache accuracy on the lm-eval-harness suite at 20% budget, more recent evaluations focused on multi-step reasoning find that all attention-based eviction methods, including H2O, can degrade more sharply when the budget is pushed substantially below the prompt length on reasoning-heavy workloads. In some of these settings the recency component of the cache turns out to be more protective than the heavy-hitter component, an opposite ordering from what holds on shorter benchmarks.^[13]

Greedy local statistics may miss future-relevant tokens. Because the accumulated score is computed online from past queries only, the algorithm can in principle evict a token that has been historically ignored but would have become important to a future query. The persistence observations in the paper argue this is uncommon in practice, but it is not ruled out, and methods such as SnapKV that look at an explicit observation window of queries near the end of the prompt have shown stronger results on long-prompt question answering for exactly this reason.^[11]

Interaction with KV cache quantization. Combining H2O selection with low-bit quantization of the cached keys and values can perturb the heavy-hitter set, because the quantization noise affects which tokens are scored highest. Q-Hitter explicitly investigates this issue and proposes a modified oracle.^[10]

No reduction in the prefill cost. H2O's eviction operates per decoding step; the prefill pass still computes full attention over the prompt, which means the method does not address the latency of prompt processing itself. Systems such as SnapKV that compress at the end of prefill are complementary on that axis.^[11]

Online overhead of the accumulated score. Although the additional bookkeeping is small relative to the attention computation itself, the algorithm does add an argmax and a running sum over the cache budget on each step. For very tight budgets and very fast decoders the overhead is negligible, but in settings where the per-step compute is dominated by a small number of matmuls (for example heavily quantized smaller models running on accelerators with extremely fast tensor cores) the relative overhead grows. The reference implementation amortizes this by fusing the score update with the attention softmax pass, an optimization that is implementation-specific and not automatic for arbitrary serving stacks.^[2]^[3]

Single-head global scoring. The original H2O algorithm computes one accumulated score per token, shared across attention heads, rather than per-head scores. Per-head importance can differ substantially in modern decoder LLMs, and subsequent work, including SnapKV and various head-aware variants, demonstrates measurable gains from per-head selection on long-context question answering.^[11]^[13] The per-head choice is a trade-off: scoring per head increases bookkeeping linearly in the number of heads but better tracks the actual attention patterns of methods such as Grouped-Query Attention.

H2O contributed three things that are now standard reference points in the KV-cache compression literature:

The empirical observation that accumulated attention follows a power-law distribution and that the heavy-tail positions can be identified online with a running sum of softmax weights.^[3]
The cache-design pattern of combining heavy hitters with a sliding window of recent tokens, which has been adopted, with variations, by most subsequent eviction systems.^[7]^[10]^[13]
A theoretical framing of streaming eviction as dynamic submodular maximization, giving an approximation guarantee that is unusual in this otherwise heuristic literature.^[1]^[3]

Together with PagedAttention (which addresses cache fragmentation rather than redundancy), Scissorhands (which formalizes persistence of importance), StreamingLLM (which formalizes attention sinks), and SnapKV (which formalizes per-head observation-window selection), H2O is one of the building blocks of the contemporary toolkit for serving large language models under tight memory budgets.^[7]^[9]^[11]^[12] The technique is orthogonal to quantization, pruning, GQA and MQA head sharing, Speculative Decoding, and inference engines such as vLLM, SGLang, and TensorRT, so it can be stacked with most of them. Within the broader effort to make long-context LLM serving cheaper, H2O occupies the position of a simple, training-free, theoretically grounded eviction baseline against which newer cache-compression methods are typically compared.

References

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen, "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models", arXiv, 2023-06-24 (v1) / 2023-12-18 (v3). https://arxiv.org/abs/2306.14048. Accessed 2026-05-20. ↩
FMInference, "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (reference implementation)", GitHub, 2023. https://github.com/FMInference/H2O. Accessed 2026-05-20. ↩
Zhenyu Zhang et al., "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (HTML version)", arXiv, 2023-12-18. https://arxiv.org/html/2306.14048v3. Accessed 2026-05-20. ↩
Hugging Face, "Paper page: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models", Hugging Face, 2023. https://huggingface.co/papers/2306.14048. Accessed 2026-05-20. ↩
NeurIPS, "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (poster page)", NeurIPS, 2023. https://neurips.cc/virtual/2023/poster/71645. Accessed 2026-05-20. ↩
NeurIPS Proceedings, "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models (conference paper)", NeurIPS Proceedings, 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/6ceefa7b15572587b78ecfcebb2827f8-Paper-Conference.pdf. Accessed 2026-05-20. ↩
MarkTechPost, "Top 10 KV Cache Compression Techniques for LLM Inference: Reducing Memory Overhead Across Eviction, Quantization, and Low-Rank Methods", MarkTechPost, 2026-04-29. https://www.marktechpost.com/2026/04/29/top-10-kv-cache-compression-techniques-for-llm-inference-reducing-memory-overhead-across-eviction-quantization-and-low-rank-methods/. Accessed 2026-05-20. ↩
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava, "Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time", arXiv, 2023-05-26. https://arxiv.org/abs/2305.17118. Accessed 2026-05-20. ↩
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, "Efficient Streaming Language Models with Attention Sinks", arXiv, 2023-09-29. https://arxiv.org/abs/2309.17453. Accessed 2026-05-20. ↩
Zhenyu Zhang et al., "Q-Hitter: A Better Token Oracle for Efficient LLM Inference via Sparse-Quantized KV Cache", Proceedings of MLSys, 2024. https://proceedings.mlsys.org/paper_files/paper/2024/file/bbb7506579431a85861a05fff048d3e1-Paper-Conference.pdf. Accessed 2026-05-20. ↩
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, Deming Chen, "SnapKV: LLM Knows What You are Looking for Before Generation", arXiv, 2024-04-22. https://arxiv.org/abs/2404.14469. Accessed 2026-05-20. ↩
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, Ion Stoica, "Efficient Memory Management for Large Language Model Serving with PagedAttention", arXiv, 2023-09-12. https://arxiv.org/abs/2309.06180. Accessed 2026-05-20. ↩
Anonymous authors, "Hold Onto That Thought: Assessing KV Cache Compression On Reasoning", OpenReview, 2025. https://openreview.net/pdf?id=OtZtLYAdQY. Accessed 2026-05-20. ↩
vllm-project, "Feature: Implement H2O (Heavy-Hitter Oracle) (Issue #3532)", GitHub, 2024. https://github.com/vllm-project/vllm/issues/3532. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributor · full history

Suggest edit

What links here

Speculative Decoding

Background

KV cache memory in autoregressive inference

Sparsity of attention at inference

Prior eviction baselines

The Heavy Hitters observation

The H2O algorithm

Score function

Eviction step

Theoretical guarantee

Cost and online operation

Implementation in FlexGen

Empirical results

Accuracy on standard tasks

Memory and throughput

Compatibility with quantization

Comparison with other KV cache eviction methods

Adoption and follow-on work

Limitations and criticisms

Related work and significance

See also

References

Improve this article

Related Articles

Speculative Decoding

Fireworks AI

QLoRA

AWQ (Activation-aware Weight Quantization)

StreamingLLM

Lookahead Decoding

What links here

Related Articles

Speculative Decoding

Fireworks AI

QLoRA

AWQ (Activation-aware Weight Quantization)

StreamingLLM

Lookahead Decoding