StreamingLLM

AI Inference Large Language Models

22 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

21 citations

Revision

v3 · 4,305 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

StreamingLLM is an inference-time technique that allows pretrained transformer language models, originally trained with a finite attention window, to process input streams of effectively unlimited length while using a constant-size key-value cache. The method, introduced by Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis in the paper Efficient Streaming Language Models with Attention Sinks, rests on a single empirical observation: in autoregressive transformers, the first few tokens of a sequence accumulate disproportionately large attention scores regardless of their semantic content, behaving as "attention sinks" that absorb the otherwise unallocated softmax probability mass.^[1]^[2] By retaining the key-value (KV) states of those initial sink tokens together with a sliding window of recent tokens, StreamingLLM enables stable decoding across millions of tokens without fine-tuning, modifying weights, or extending the model's positional encoding scheme.^[1]^[3] The paper was first posted to arXiv on 29 September 2023 and accepted to the International Conference on Learning Representations (ICLR) 2024.^[1]^[4]

The method has had wide downstream impact in both research and production inference systems. The reference implementation at mit-han-lab/streaming-llm was followed by a community drop-in package (attention_sinks by Tom Aarsen), a SinkCache abstraction merged into Hugging Face Transformers, and TensorRT-LLM and SwiftInfer ports that bring the method to production GPU servers.^[3]^[5]^[6]^[7]^[8] The attention sink concept itself has since been adopted, in modified form, by architectures such as OpenAI's gpt-oss models, which replace the role of the initial sink tokens with a per-head learnable scalar in the softmax denominator.^[9]^[10] As of 2026, sink-based KV-cache strategies are treated as a standard primitive of long-context inference, alongside sliding window attention, PagedAttention, and KV-cache eviction policies such as H2O.^[11]^[12]

Background

The problem: streaming inference with a finite cache

A standard decoder-only transformer such as Llama 2, MPT, Falcon, or Pythia uses multi-head self-attention over all previously generated tokens at every decoding step.^[1] During autoregressive generation, the keys and values produced by each layer for each prior token are cached in the KV cache so they need not be recomputed.^[1]^[13] Two costs grow with sequence length:

Memory. The KV cache size scales linearly with the number of stored tokens; for a long stream this rapidly exceeds available GPU memory.^[1]
Compute. Attention is quadratic in the number of cached tokens for the attention step, and even at decode time the per-step cost grows linearly with cache length.^[1]

A second problem is length generalisation: most LLMs are pretrained on sequences of a fixed maximum length (4,096 tokens for Llama 2, 2,048 for many earlier models), and quality degrades or collapses entirely when the input exceeds that bound, even if the KV cache could in principle hold more tokens.^[1]^[5] For streaming workloads, multi-turn chat, continuous voice transcription, or any always-on assistant, both problems compound: the model must run for hours or days without resetting state, and the input continuously grows.

Why window attention alone fails

The obvious fix is to evict old KV entries once the cache fills, retaining only the most recent L tokens. This is the window attention baseline. Empirically, however, simple window attention collapses catastrophically once eviction begins. In the StreamingLLM paper, Llama-2-13B with a 1,024-entry window applied to a 20,000-token stream reaches a perplexity of roughly 5,158, compared with about 5.4 for an oracle baseline that recomputes attention over the full sliding window at every step.^[1]^[13] In practical chat demos, Llama-2-7B and Mistral-7B with naive window eviction begin emitting broken Unicode and incoherent text within about 1,000 tokens of the eviction threshold.^[5]^[13]

The puzzle is that the model still has the same L most-recent tokens in cache; only the first few tokens have been discarded. Yet performance falls off a cliff. StreamingLLM's central contribution is to identify why this happens and to fix it with a near-trivial change to the cache policy.

Author backgrounds and institutional setting

The paper was led by Guangxuan Xiao, an EECS graduate student at MIT working in Song Han's group on efficient deployment of large language models, with co-authors Yuandong Tian and Mike Lewis at Meta AI, Beidi Chen at Carnegie Mellon University, and Song Han at MIT and NVIDIA.^[1]^[20] Han also held an affiliation with the MIT-IBM Watson AI Lab at the time of publication, and Lewis was the senior author.^[20] The work fit into a broader research direction at MIT HAN Lab focused on making frontier LLMs cheaper to run, alongside related projects such as SmoothQuant (post-training quantisation) and AWQ (activation-aware weight quantisation).^[4]^[20] Beidi Chen's group at CMU had concurrent interests in attention-sparsification techniques, and the H2O paper (NeurIPS 2023, discussed below) shares an author with StreamingLLM at a similar time.^[12] An MIT News profile in February 2024 framed the work as a practical fix for chatbots that "crash" after extended interaction, quoting Xiao on its motivation for "persistently deployed" assistants.^[20]

The attention sink observation

Empirical pattern

Xiao and colleagues plot per-token, per-layer attention maps from Llama-2 and related models on long sequences and observe that, beyond the first two layers, an enormous share of attention probability is directed at the first few tokens of the sequence, irrespective of what those tokens encode semantically.^[1]^[14] Substituting the natural beginning-of-sequence content with arbitrary placeholder text (newlines, punctuation) preserves the effect, confirming that the phenomenon is positional rather than content-driven.^[14] The authors call these initial positions attention sinks.

A softmax-driven explanation

The mechanism the paper proposes follows from the softmax normalisation in attention.^[14] Standard scaled-dot-product attention writes

attention = softmax(Q K^T / sqrt(d_k))

so the attention scores for any query must sum to exactly 1. When a head has no semantically relevant token to attend to at a given step, it cannot simply output zero attention; the probability mass must land somewhere. Through training, the model learns that the safest place to dump unused attention is on tokens that are guaranteed to be present and reachable from every query position. The first tokens of the sequence satisfy both criteria: they appear in every training example and, because of the causal mask, they are visible to every subsequent position. Over training, the keys at those positions are sculpted so that they cleanly absorb the unused probability mass without contaminating the value aggregation.^[1]^[14]

This explanation predicts that attention sinks should appear wherever a softmax is used over a fixed-position prefix. The paper verifies the prediction in two encoder settings, BERT and the Vision Transformer (ViT), and reports similar concentration of attention at the [CLS] or first-patch token.^[1]^[14] Subsequent interpretability work has linked the phenomenon to "no-op" attention heads and to a particular form of residual-stream regulation.^[9]^[14]

Why removing sinks destroys the model

Once the KV entries for the sink tokens are evicted from the cache by simple window attention, the model still emits a softmax that sums to 1, but it now has nowhere semantically harmless to send the previously unused mass. That mass instead flows onto the surviving recent tokens, distorting their effective contribution to the value aggregation and propagating bad activations through every subsequent layer. The collapse in perplexity is the macroscopic consequence of this distortion. The fix, conceptually, is to ensure the sink positions remain in cache at all times so that the attention mechanism has a legitimate place to discharge unused probability mass.^[1]^[13]

How StreamingLLM works

Cache layout

StreamingLLM stores two contiguous KV regions per layer:

Region	Contents	Size
Attention sink	KV states for the first `S` tokens of the stream	typically `S = 4`
Rolling window	KV states for the most recent `W` tokens	typically `W = 1,020`

For an incoming token at position t > S + W, the cache continues to hold the first S tokens unchanged while the window slides by one position, evicting the oldest non-sink entry. Notationally the paper uses S + W to describe the configuration; the default in published code is 4 + 1020 = 1024 total cached tokens.^[1]^[3]^[6] Ablations in the paper find that S = 1 or S = 2 is sometimes insufficient to fully stabilise perplexity, while S = 4 is sufficient for every model tested.^[1]

Positional encoding inside the cache

A subtle but crucial detail concerns positional encoding. StreamingLLM computes positions inside the cache, not in the original token stream.^[1]^[13] If the cache always holds 1,024 tokens, then those tokens occupy positional indices 0 to 1,023 from the model's perspective, regardless of where they originated in the absolute stream. This keeps every query's distance to the attention sinks small, which matters for relative-position schemes that decay with distance.

Concrete implications by encoding type:

Rotary Position Embedding (RoPE). Keys are cached before rotation; the rotation matrix is reapplied at each decode step using the cache-internal position, not the absolute position.^[1]^[13]
ALiBi. The linear bias is computed from cache-internal distances. The contiguous bias is applied rather than a "jumping" bias that would reflect the absolute stream position.^[1]

Without this trick, RoPE would rotate the cached sink keys to a position thousands of steps away from the current query, undoing the sink's effect; the model would still nominally have the sink tokens in cache, but its positional machinery would render them unreachable.^[13]

Pseudocode

A simplified view of the decode-time policy looks like the following:

for t in stream:
    k_t, v_t = compute_kv(token_t)
    if len(cache) < S + W:
        cache.append((k_t, v_t))
    else:
        # keep first S entries (sinks) unchanged; evict oldest non-sink
        cache = cache[:S] + cache[S+1:] + [(k_t, v_t)]
    # recompute relative positions for RoPE/ALiBi based on cache index
    output_t = attention(query_t, cache_keys, cache_values)

The total cache footprint is therefore constant in t, and attention at each step is O((S + W)^2) rather than O(t^2), giving constant per-token cost.^[1]

Sink-token pretraining variant

The vanilla StreamingLLM recipe repurposes whatever tokens happen to sit at the start of the stream as the sinks; in a chat application this is typically a system prompt or beginning-of-sequence token. The paper additionally explores a small pretraining change: prepend a single dedicated, learnable sink token to every training example.^[1]^[14] A 160M-parameter model pretrained with this modification:

converges with loss curves indistinguishable from a vanilla model;^[1]
achieves comparable or slightly improved zero-shot scores on seven NLP benchmarks;^[1]
enables stable streaming with S = 1 (only the dedicated sink), where the vanilla model needs S = 4 repurposed content tokens to be stable;^[1]^[14]
reaches a streaming perplexity of 18.01 in the 1 + 1023 configuration, compared with 18.49 for the vanilla baseline.^[1]

The paper also evaluates a "Zero Sink" variant inspired by the "SoftMax-Off-by-One" proposal of Evan Miller, which adds 1 to the softmax denominator. This partially mitigates the sink phenomenon but does not eliminate the need for multiple repurposed sink tokens.^[1]^[14]

Results in the original paper

Perplexity on long streams

The headline experiment runs each candidate model on streams of up to 4 million tokens drawn from PG19 and other long-document corpora, measuring sliding-window perplexity. With the StreamingLLM cache layout, Llama-2-7B/13B/70B, MPT-7B/30B, Falcon-7B/40B, and Pythia-2.8B/6.9B/12B all exhibit perplexity curves that remain flat and close to the recomputation oracle across the entire stream.^[1]^[3] Pure window attention without sinks diverges within a few thousand tokens of the eviction threshold.^[1] Dense attention either runs out of memory or fails for inputs longer than the training context.^[1]

Streaming question answering

On the ARC-Challenge and ARC-Easy benchmarks reformulated as a continuous stream, StreamingLLM matches the one-shot baseline at roughly 71-91% accuracy, where dense attention is out of memory and window attention scores near zero (1-3%).^[1] These numbers do not represent improved reasoning over a one-shot baseline; they show that the method preserves the model's underlying capability when run as a stream, which is what dense attention and window attention each fail to do for different reasons.^[1]^[13]

Throughput

StreamingLLM is compared with a "sliding window with recomputation" baseline, which is the standard oracle: at every decode step, recompute attention from scratch over the last W tokens. That oracle has O(T \cdot W^2) time complexity for a stream of length T, dominated by the recomputed prefill. StreamingLLM, by reusing cached KVs and only computing one new KV per step, runs in O(T \cdot W) and delivers up to a 22.2x end-to-end speedup at long streams while matching perplexity.^[1]^[3]^[13] A concrete data point from MIT's coverage: at a 4,096-token cache size, the sliding-window-with-recomputation baseline requires roughly 1,411 milliseconds per generated token, while StreamingLLM produces a token in about 65 milliseconds, a 21.7x ratio that closely matches the 22.2x figure quoted in the paper.^[20]

Models and configurations evaluated

Family	Sizes evaluated	Position encoding
Llama-2	7B, 13B, 70B	RoPE
MPT	7B, 30B	ALiBi
Falcon	7B, 40B	RoPE
Pythia	2.8B, 6.9B, 12B	RoPE

Source: original paper, Tables 1-3.^[1] All families showed the attention sink phenomenon, and all worked with the same 4 + 1020 cache configuration without retraining.^[1]^[3]

Reference implementation

The reference code is released by the MIT HAN Lab at github.com/mit-han-lab/streaming-llm under an MIT license, built on top of transformers==4.33.0.^[3] It provides:

modified attention modules for Llama, MPT, Falcon, and Pythia that implement the sink-plus-window KV layout and the cache-internal positional rotation for RoPE/ALiBi;^[3]
a streaming chat demo (streaming_chat.py);^[3]
a perplexity evaluation harness over PG19 and other long corpora.^[3]

The repository links the paper to its ICLR 2024 publication and provides the canonical BibTeX entry for citation.^[3] As of 2026 it remains the reference codebase for the technique, with hundreds of forks and external integrations.

Adoption in libraries and runtimes

Hugging Face `attention_sinks` (Tom Aarsen)

Shortly after the paper's release, Tom Aarsen of Hugging Face published the attention_sinks Python package as a drop-in replacement for the transformers API.^[5]^[6] Switching from transformers import AutoModel to from attention_sinks import AutoModel is enough to wrap any supported model with the StreamingLLM cache policy:

from attention_sinks import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-chat-hf",
    device_map="auto",
    attention_sink_size=4,
    attention_sink_window_size=4092,
)

The package defaults to attention_sink_size=4 and attention_sink_window_size=1020, matching the paper. Community contributions extended support beyond the originally evaluated four families to Llama, Mistral, Falcon, MPT, GPT-NeoX (Pythia), GPT-J, Qwen, StableLM, BTLM, and Yi.^[6] Endless-generation tests on Llama-2-7B show stable fluency past 10,000 tokens, where stock transformers runs out of VRAM or emits broken Unicode and pure window attention corrupts output after about 1,000 tokens.^[5]

Hugging Face Transformers `SinkCache`

The pattern was subsequently upstreamed into the main transformers library through PR #26681 (December 2023), which introduced a generic Cache abstraction and a concrete SinkCache implementation.^[7]^[15] SinkCache retains a configurable number of initial sinks and a rolling window of recent KV entries, exposing the StreamingLLM behaviour as a standard library feature available without third-party packages.^[7] A community Hugging Face Space (transformers-community/sink_cache) demonstrates the integration.^[7]

TensorRT-LLM and SwiftInfer

NVIDIA's TensorRT-LLM compiler stack integrated StreamingLLM as a supported KV-cache mode shortly after release.^[4] The Colossal-AI team released SwiftInfer, a TensorRT-LLM-based reimplementation of the StreamingLLM algorithm optimised for production multi-round dialogue, which reports an additional 46% inference-performance improvement over the vanilla StreamingLLM implementation on top of the original 22.2x speedup over the sliding-window-with-recomputation baseline.^[8]^[16] SwiftInfer is open-sourced at github.com/hpcaitech/SwiftInfer.^[8]

Adoption summary

Integration	Maintainer	Notes
`mit-han-lab/streaming-llm`	MIT HAN Lab	Reference research code, paper-faithful^[3]
`attention_sinks` package	Tom Aarsen (Hugging Face)	Drop-in replacement for `transformers` API, broad model coverage^[5]^[6]
`transformers.SinkCache`	Hugging Face	Upstream `Cache` abstraction, December 2023^[7]
TensorRT-LLM	NVIDIA	Supported KV-cache mode^[4]
SwiftInfer	HPC-AI / Colossal-AI	TensorRT-LLM port with 46% throughput improvement^[8]^[16]

Relationship to sliding window attention

StreamingLLM is best understood as a fix to plain window attention rather than a new architectural primitive. Mistral 7B famously uses sliding window attention (SWA) with a 4,096-token window over an 8,192-token context, and its multi-layer stack lets information from earlier tokens still propagate by aggregation across layers.^[11]^[17] However, SWA in production deployments still suffers from the attention sink failure mode once the stream length exceeds the cache, which is why open-source serving stacks combine SWA with a sink retention policy (often called "SWA + sinks") to reach truly streaming behaviour.^[11]^[12] The attention_sinks package explicitly supports this hybrid for Mistral by exposing both the window size and the sink size.^[6]

Heavy-Hitter Oracle (H2O)

The Heavy-Hitter Oracle (H2O) of Zhang et al. (NeurIPS 2023) is a contemporaneous KV-cache compression scheme that selects which entries to keep by tracking accumulated attention scores: tokens that have historically attracted significant attention from later positions are retained as "heavy hitters" alongside a recent-window of tokens.^[12] H2O reports matching full-cache performance at 20% of the cache and 3-29x throughput gains.^[12] Empirically the heavy-hitter set often overlaps the StreamingLLM sink set in its first few entries, but the two methods make different bets: StreamingLLM uses a fixed positional rule (first S tokens + last W tokens), while H2O uses an attention-statistic rule that adapts per query.^[12] Subsequent inference systems combine them: a sink-preserving prefix plus an H2O-style selector for the window has been explored in multiple follow-ups.^[12]^[16]

Follow-up methods and infinite-context architectures

Several lines of work build on the attention sink observation:

Infini-Attention (Munkhdalai et al., Google, 2024) compresses arbitrarily long contexts into a bounded-size compressive memory while reading from a sliding window, achieving infinite-context inference under a fixed memory budget. Its sliding-window component faces the same eviction-of-sinks failure mode that motivated StreamingLLM and uses related stabilisation strategies.^[18]
Sink-aware pretraining. Several papers since 2024 explore variants of the dedicated learnable sink token, generalising the paper's pretraining experiment to bigger models and analysing its effect on attention head specialisation.^[9]^[14]
gpt-oss attention sinks. OpenAI's open-weight gpt-oss family released in 2025 implements an attention sink as a per-head learnable scalar added inside the softmax denominator, so that softmax(qk) becomes effectively softmax over [a_1, ..., a_T, sink]. This serves the same purpose as a StreamingLLM-style positional sink (a place for the head to discharge unused attention mass) but does not require any tokens to be kept in cache.^[9]^[10]^[19] The change is reported to make gpt-oss substantially more amenable to aggressive 4-bit weight quantisation.^[10]^[19]

Connection to ViT "register tokens"

A parallel line of vision research arrived at a structurally similar fix. Darcet et al., in "Vision Transformers Need Registers" (2024), observed that supervised and self-supervised Vision Transformers develop high-norm outlier tokens in low-information background patches of an image, producing noisy attention maps; they showed that prepending a small number of empty learnable register tokens to the patch sequence absorbs the artefacts and yields cleaner feature maps and dense prediction.^[21] The mechanism is essentially the same as StreamingLLM's pretraining sink-token variant: a designated set of input positions soaks up attention that the model has no genuine use for, freeing the content tokens from being co-opted into the same role.^[9]^[21] Later "training-free" follow-ups in 2025 generalised the technique to off-the-shelf vision models without retraining, by identifying the small set of "register neurons" that drive the artefacts.^[21] The convergence of evidence across NLP and vision strengthens the interpretation of attention sinks as a generic consequence of softmax-normalised attention rather than a quirk of any particular model family.^[9]^[14]^[21]

Where StreamingLLM does not help

The paper is explicit that StreamingLLM does not extend a model's effective context window.^[1]^[5] Tokens that fall outside the sink and window regions are simply gone; the model cannot retrieve information from them. As a consequence, the technique is unsuitable for tasks that require reasoning over the full history, such as long-document summarisation or needle-in-a-haystack retrieval over millions of tokens.^[5]^[13] Appendix C of the paper confirms that question-answering accuracy drops to 0% when the relevant context lies outside the cache.^[1] StreamingLLM is best understood as a stability mechanism for continuous decoding, not as a long-context comprehension method.

Limitations

The most important limitations follow directly from the design:

No new long-range capability. A Llama-2-7B model running under StreamingLLM still has a 4,096-token effective window over the cache it can see; older content is irretrievably evicted.^[1]^[5]
No fix for poor long-document reasoning. Tasks like book summarisation, multi-document question answering, or RAG over very long retrieved contexts derive no benefit, because the relevant information will typically be evicted.^[1]^[5]
Cache-internal positions can confuse downstream tooling. Because positions are computed inside the cache rather than from the absolute stream offset, any external code that depends on absolute positions (for example, a logging tool that tries to map cached KV entries back to source tokens) must be aware of the remapping. A known issue in Hugging Face Transformers documents how SinkCache interacts with post-RoPE key caching in some configurations.^[7]^[15]
Pretraining-variant cost. The cleanest version of the method, with a dedicated learnable sink token, requires pretraining from scratch with the modification; it cannot be applied retroactively to off-the-shelf weights, although the vanilla repurposing variant can.^[1]^[14]
Interaction with other KV compression. Combining StreamingLLM with quantisation, attention-based eviction (H2O), or paged caches requires care to ensure the sink entries are excluded from compression and eviction.^[7]^[12]^[15]

A user who needs the model to remember information from earlier in a long stream must combine StreamingLLM with an external memory mechanism (retrieval, summarisation buffers, or compressive memory schemes such as Infini-Attention).^[18]

Significance

StreamingLLM crystallised two ideas that have since become standard vocabulary in long-context inference. First, the observation that initial tokens function as positional attention sinks gave a clean explanation for a longstanding curiosity in transformer attention maps and predicted, accurately, that interfering with those positions would destabilise generation.^[1]^[14] Second, the fix demonstrated that a non-trivial fraction of "long-context" failures during streaming are caused by KV-cache bookkeeping rather than by the model's representational capacity, and that those failures can be repaired with a roughly five-line change to the cache policy.^[3]^[13]

The technique's practical reach is unusually broad for a single ICLR paper. It is implemented in the reference codebase, in a widely used third-party package, in the main Hugging Face Transformers library, in NVIDIA TensorRT-LLM, and in production runtimes like SwiftInfer, and it underlies the streaming behaviour of multiple chat assistants built on Llama-2, Mistral 7B, and similar models.^[3]^[5]^[6]^[7]^[4]^[8] The underlying observation has further generalised into architectural decisions in subsequent models: gpt-oss replaces positional sinks with learnable softmax-denominator scalars to obtain the same stabilisation property without the cache cost.^[9]^[10]^[19] Surveys of inference optimization now treat sink-aware KV caching as a default consideration when designing streaming deployments.^[11]^[12]

A second-order effect of the paper has been pedagogical. The attention sink story is one of the most cited examples of a deployment failure mode that was hidden in plain sight in published attention visualisations for years, and is sometimes used in interpretability courses to illustrate the value of looking at attention norms and not only attention probabilities.^[9]^[14] The MIT News writeup of the paper emphasised that the practical value is not a bigger model or a longer "true" context window but the ability to deploy existing weights persistently, in always-on assistants, code-generation copilots, and editing tools, without the slow drift into incoherence or out-of-memory crashes that plagued early multi-turn chat systems.^[20]

References

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, "Efficient Streaming Language Models with Attention Sinks", arXiv (v4), 2024-04-07. https://arxiv.org/abs/2309.17453. Accessed 2026-05-20. ↩
Guangxuan Xiao et al., "Efficient Streaming Language Models with Attention Sinks (v1)", arXiv, 2023-09-29. https://arxiv.org/abs/2309.17453v1. Accessed 2026-05-20. ↩
MIT HAN Lab, "streaming-llm: [ICLR 2024] Efficient Streaming Language Models with Attention Sinks", GitHub, 2024. https://github.com/mit-han-lab/streaming-llm. Accessed 2026-05-20. ↩
Song Han Lab, "StreamingLLM project page", MIT HAN Lab, 2024. https://hanlab.mit.edu/projects/streamingllm. Accessed 2026-05-20. ↩
Tom Aarsen, "Attention Sinks in LLMs for endless fluency", Hugging Face Blog, 2023-12. https://huggingface.co/blog/tomaarsen/attention-sinks. Accessed 2026-05-20. ↩
Tom Aarsen, "attention_sinks: Extend existing LLMs to infinite-length inputs without sacrificing efficiency or performance", GitHub, 2023-2024. https://github.com/tomaarsen/attention_sinks. Accessed 2026-05-20. ↩
Hugging Face community, "transformers-community/sink_cache", Hugging Face, 2024. https://huggingface.co/transformers-community/sink_cache. Accessed 2026-05-20. ↩
HPC-AI Tech, "SwiftInfer: Efficient AI Inference & Serving", GitHub, 2024. https://github.com/hpcaitech/SwiftInfer. Accessed 2026-05-20. ↩
MIT HAN Lab, "How Attention Sinks Keep Language Models Stable", HAN Lab blog, 2024. https://hanlab.mit.edu/blog/streamingllm. Accessed 2026-05-20. ↩
Adam Zweiger, "Attention sinks in gpt-oss: for each attention head, a learned scalar in the softmax denominator", post on X, 2025-08. https://x.com/AdamZweiger/status/1952799642636148917. Accessed 2026-05-20. ↩
Michael Brenndoerfer, "Mistral Architecture: Sliding Window Attention & Efficient LLM Design", mbrenndoerfer.com, 2024. https://mbrenndoerfer.com/writing/mistral-architecture-sliding-window-attention. Accessed 2026-05-20. ↩
Zhenyu Zhang et al., "H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models", arXiv:2306.14048, NeurIPS 2023. https://arxiv.org/pdf/2306.14048. Accessed 2026-05-20. ↩
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, "Efficient Streaming Language Models with Attention Sinks (HTML version)", arXiv, 2024. https://arxiv.org/html/2309.17453v3. Accessed 2026-05-20. ↩
OpenReview, "Efficient Streaming Language Models with Attention Sinks", ICLR 2024 Conference Paper. https://openreview.net/forum?id=NG7sS51zVF. Accessed 2026-05-20. ↩
Tom Aarsen et al., "Generate: New Cache abstraction and Attention Sinks support (PR #26681)", Hugging Face Transformers, GitHub, 2023-12. https://github.com/huggingface/transformers/pull/26681. Accessed 2026-05-20. ↩
HPC-AI Tech, "Inference Performance Improved by 46%, Open Source Solution Breaks the Length Limit of LLM for Multi-Round Conversations", HPC-AI blog, 2024-01. https://hpc-ai.com/blog/Colossal-AI-SwiftInfer. Accessed 2026-05-20. ↩
Albert Jiang et al., "Mistral 7B", arXiv:2310.06825, 2023-10. https://arxiv.org/abs/2310.06825. Accessed 2026-05-20. ↩
Tsendsuren Munkhdalai, Manaal Faruqui, Siddharth Gopal, "Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention", arXiv:2404.07143, 2024-04. https://arxiv.org/abs/2404.07143. Accessed 2026-05-20. ↩
Sebastian Raschka, "From GPT-2 to gpt-oss: Analyzing the Architectural Advances", Ahead of AI / Substack, 2025-08. https://magazine.sebastianraschka.com/p/from-gpt-2-to-gpt-oss-analyzing-the. Accessed 2026-05-20. ↩
Adam Zewe, "A new way to let AI chatbots converse all day without crashing", MIT News, 2024-02-13. https://news.mit.edu/2024/new-way-let-ai-chatbots-converse-all-day-without-crashing-0213. Accessed 2026-05-20. ↩
Timothee Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski, "Vision Transformers Need Registers", arXiv:2309.16588, ICLR 2024. https://arxiv.org/abs/2309.16588. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Speculative Decoding

Background

The problem: streaming inference with a finite cache

Why window attention alone fails

Author backgrounds and institutional setting

The attention sink observation

Empirical pattern

A softmax-driven explanation

Why removing sinks destroys the model

How StreamingLLM works

Cache layout

Positional encoding inside the cache

Pseudocode

Sink-token pretraining variant

Results in the original paper

Perplexity on long streams

Streaming question answering

Throughput

Models and configurations evaluated

Reference implementation

Adoption in libraries and runtimes

Hugging Face attention_sinks (Tom Aarsen)

Hugging Face Transformers SinkCache

TensorRT-LLM and SwiftInfer

Adoption summary

Related work and follow-up research

Relationship to sliding window attention

Heavy-Hitter Oracle (H2O)

Follow-up methods and infinite-context architectures

Connection to ViT "register tokens"

Where StreamingLLM does not help

Limitations

Significance

See also

References

Improve this article

Related Articles

Speculative Decoding

Fireworks AI

QLoRA

AWQ (Activation-aware Weight Quantization)

H2O (Heavy-Hitter Oracle for KV Cache)

Lookahead Decoding

What links here

Related Articles

Speculative Decoding

Fireworks AI

QLoRA

AWQ (Activation-aware Weight Quantization)

H2O (Heavy-Hitter Oracle for KV Cache)

Lookahead Decoding

Hugging Face `attention_sinks` (Tom Aarsen)

Hugging Face Transformers `SinkCache`