StreamingLLM
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,307 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,307 words
Add missing citations, update stale details, or suggest a clearer explanation.
StreamingLLM is an inference-time technique that allows pretrained transformer language models, originally trained with a finite attention window, to process input streams of effectively unlimited length while using a constant-size key-value cache. The method, introduced by Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis in the paper Efficient Streaming Language Models with Attention Sinks, rests on a single empirical observation: in autoregressive transformers, the first few tokens of a sequence accumulate disproportionately large attention scores regardless of their semantic content, behaving as "attention sinks" that absorb the otherwise unallocated softmax probability mass.[^1][^2] By retaining the key-value (KV) states of those initial sink tokens together with a sliding window of recent tokens, StreamingLLM enables stable decoding across millions of tokens without fine-tuning, modifying weights, or extending the model's positional encoding scheme.[^1][^3] The paper was first posted to arXiv on 29 September 2023 and accepted to the International Conference on Learning Representations (ICLR) 2024.[^1][^4]
The method has had wide downstream impact in both research and production inference systems. The reference implementation at mit-han-lab/streaming-llm was followed by a community drop-in package (attention_sinks by Tom Aarsen), a SinkCache abstraction merged into Hugging Face Transformers, and TensorRT-LLM and SwiftInfer ports that bring the method to production GPU servers.[^3][^5][^6][^7][^8] The attention sink concept itself has since been adopted, in modified form, by architectures such as OpenAI's gpt-oss models, which replace the role of the initial sink tokens with a per-head learnable scalar in the softmax denominator.[^9][^10] As of 2026, sink-based KV-cache strategies are treated as a standard primitive of long-context inference, alongside sliding window attention, PagedAttention, and KV-cache eviction policies such as H2O.[^11][^12]
A standard decoder-only transformer such as Llama 2, MPT, Falcon, or Pythia uses multi-head self-attention over all previously generated tokens at every decoding step.[^1] During autoregressive generation, the keys and values produced by each layer for each prior token are cached in the KV cache so they need not be recomputed.[^1][^13] Two costs grow with sequence length:
A second problem is length generalisation: most LLMs are pretrained on sequences of a fixed maximum length (4,096 tokens for Llama 2, 2,048 for many earlier models), and quality degrades or collapses entirely when the input exceeds that bound, even if the KV cache could in principle hold more tokens.[^1][^5] For streaming workloads, multi-turn chat, continuous voice transcription, or any always-on assistant, both problems compound: the model must run for hours or days without resetting state, and the input continuously grows.
The obvious fix is to evict old KV entries once the cache fills, retaining only the most recent L tokens. This is the window attention baseline. Empirically, however, simple window attention collapses catastrophically once eviction begins. In the StreamingLLM paper, Llama-2-13B with a 1,024-entry window applied to a 20,000-token stream reaches a perplexity of roughly 5,158, compared with about 5.4 for an oracle baseline that recomputes attention over the full sliding window at every step.[^1][^13] In practical chat demos, Llama-2-7B and Mistral-7B with naive window eviction begin emitting broken Unicode and incoherent text within about 1,000 tokens of the eviction threshold.[^5][^13]
The puzzle is that the model still has the same L most-recent tokens in cache; only the first few tokens have been discarded. Yet performance falls off a cliff. StreamingLLM's central contribution is to identify why this happens and to fix it with a near-trivial change to the cache policy.
The paper was led by Guangxuan Xiao, an EECS graduate student at MIT working in Song Han's group on efficient deployment of large language models, with co-authors Yuandong Tian and Mike Lewis at Meta AI, Beidi Chen at Carnegie Mellon University, and Song Han at MIT and NVIDIA.[^1][^20] Han also held an affiliation with the MIT-IBM Watson AI Lab at the time of publication, and Lewis was the senior author.[^20] The work fit into a broader research direction at MIT HAN Lab focused on making frontier LLMs cheaper to run, alongside related projects such as SmoothQuant (post-training quantisation) and AWQ (activation-aware weight quantisation).[^4][^20] Beidi Chen's group at CMU had concurrent interests in attention-sparsification techniques, and the H2O paper (NeurIPS 2023, discussed below) shares an author with StreamingLLM at a similar time.[^12] An MIT News profile in February 2024 framed the work as a practical fix for chatbots that "crash" after extended interaction, quoting Xiao on its motivation for "persistently deployed" assistants.[^20]
Xiao and colleagues plot per-token, per-layer attention maps from Llama-2 and related models on long sequences and observe that, beyond the first two layers, an enormous share of attention probability is directed at the first few tokens of the sequence, irrespective of what those tokens encode semantically.[^1][^14] Substituting the natural beginning-of-sequence content with arbitrary placeholder text (newlines, punctuation) preserves the effect, confirming that the phenomenon is positional rather than content-driven.[^14] The authors call these initial positions attention sinks.
The mechanism the paper proposes follows from the softmax normalisation in attention.[^14] Standard scaled-dot-product attention writes
attention = softmax(Q K^T / sqrt(d_k))
so the attention scores for any query must sum to exactly 1. When a head has no semantically relevant token to attend to at a given step, it cannot simply output zero attention; the probability mass must land somewhere. Through training, the model learns that the safest place to dump unused attention is on tokens that are guaranteed to be present and reachable from every query position. The first tokens of the sequence satisfy both criteria: they appear in every training example and, because of the causal mask, they are visible to every subsequent position. Over training, the keys at those positions are sculpted so that they cleanly absorb the unused probability mass without contaminating the value aggregation.[^1][^14]
This explanation predicts that attention sinks should appear wherever a softmax is used over a fixed-position prefix. The paper verifies the prediction in two encoder settings, BERT and the Vision Transformer (ViT), and reports similar concentration of attention at the [CLS] or first-patch token.[^1][^14] Subsequent interpretability work has linked the phenomenon to "no-op" attention heads and to a particular form of residual-stream regulation.[^9][^14]
Once the KV entries for the sink tokens are evicted from the cache by simple window attention, the model still emits a softmax that sums to 1, but it now has nowhere semantically harmless to send the previously unused mass. That mass instead flows onto the surviving recent tokens, distorting their effective contribution to the value aggregation and propagating bad activations through every subsequent layer. The collapse in perplexity is the macroscopic consequence of this distortion. The fix, conceptually, is to ensure the sink positions remain in cache at all times so that the attention mechanism has a legitimate place to discharge unused probability mass.[^1][^13]
StreamingLLM stores two contiguous KV regions per layer:
| Region | Contents | Size |
|---|---|---|
| Attention sink | KV states for the first S tokens of the stream | typically S = 4 |
| Rolling window | KV states for the most recent W tokens | typically W = 1,020 |
For an incoming token at position t > S + W, the cache continues to hold the first S tokens unchanged while the window slides by one position, evicting the oldest non-sink entry. Notationally the paper uses S + W to describe the configuration; the default in published code is 4 + 1020 = 1024 total cached tokens.[^1][^3][^6] Ablations in the paper find that S = 1 or S = 2 is sometimes insufficient to fully stabilise perplexity, while S = 4 is sufficient for every model tested.[^1]
A subtle but crucial detail concerns positional encoding. StreamingLLM computes positions inside the cache, not in the original token stream.[^1][^13] If the cache always holds 1,024 tokens, then those tokens occupy positional indices 0 to 1,023 from the model's perspective, regardless of where they originated in the absolute stream. This keeps every query's distance to the attention sinks small, which matters for relative-position schemes that decay with distance.
Concrete implications by encoding type:
Without this trick, RoPE would rotate the cached sink keys to a position thousands of steps away from the current query, undoing the sink's effect; the model would still nominally have the sink tokens in cache, but its positional machinery would render them unreachable.[^13]
A simplified view of the decode-time policy looks like the following:
for t in stream:
k_t, v_t = compute_kv(token_t)
if len(cache) < S + W:
cache.append((k_t, v_t))
else:
# keep first S entries (sinks) unchanged; evict oldest non-sink
cache = cache[:S] + cache[S+1:] + [(k_t, v_t)]
# recompute relative positions for RoPE/ALiBi based on cache index
output_t = attention(query_t, cache_keys, cache_values)
The total cache footprint is therefore constant in t, and attention at each step is O((S + W)^2) rather than O(t^2), giving constant per-token cost.[^1]
The vanilla StreamingLLM recipe repurposes whatever tokens happen to sit at the start of the stream as the sinks; in a chat application this is typically a system prompt or beginning-of-sequence token. The paper additionally explores a small pretraining change: prepend a single dedicated, learnable sink token to every training example.[^1][^14] A 160M-parameter model pretrained with this modification:
S = 1 (only the dedicated sink), where the vanilla model needs S = 4 repurposed content tokens to be stable;[^1][^14]1 + 1023 configuration, compared with 18.49 for the vanilla baseline.[^1]The paper also evaluates a "Zero Sink" variant inspired by the "SoftMax-Off-by-One" proposal of Evan Miller, which adds 1 to the softmax denominator. This partially mitigates the sink phenomenon but does not eliminate the need for multiple repurposed sink tokens.[^1][^14]
The headline experiment runs each candidate model on streams of up to 4 million tokens drawn from PG19 and other long-document corpora, measuring sliding-window perplexity. With the StreamingLLM cache layout, Llama-2-7B/13B/70B, MPT-7B/30B, Falcon-7B/40B, and Pythia-2.8B/6.9B/12B all exhibit perplexity curves that remain flat and close to the recomputation oracle across the entire stream.[^1][^3] Pure window attention without sinks diverges within a few thousand tokens of the eviction threshold.[^1] Dense attention either runs out of memory or fails for inputs longer than the training context.[^1]
On the ARC-Challenge and ARC-Easy benchmarks reformulated as a continuous stream, StreamingLLM matches the one-shot baseline at roughly 71-91% accuracy, where dense attention is out of memory and window attention scores near zero (1-3%).[^1] These numbers do not represent improved reasoning over a one-shot baseline; they show that the method preserves the model's underlying capability when run as a stream, which is what dense attention and window attention each fail to do for different reasons.[^1][^13]
StreamingLLM is compared with a "sliding window with recomputation" baseline, which is the standard oracle: at every decode step, recompute attention from scratch over the last W tokens. That oracle has O(T \cdot W^2) time complexity for a stream of length T, dominated by the recomputed prefill. StreamingLLM, by reusing cached KVs and only computing one new KV per step, runs in O(T \cdot W) and delivers up to a 22.2x end-to-end speedup at long streams while matching perplexity.[^1][^3][^13] A concrete data point from MIT's coverage: at a 4,096-token cache size, the sliding-window-with-recomputation baseline requires roughly 1,411 milliseconds per generated token, while StreamingLLM produces a token in about 65 milliseconds, a 21.7x ratio that closely matches the 22.2x figure quoted in the paper.[^20]
| Family | Sizes evaluated | Position encoding |
|---|---|---|
| Llama-2 | 7B, 13B, 70B | RoPE |
| MPT | 7B, 30B | ALiBi |
| Falcon | 7B, 40B | RoPE |
| Pythia | 2.8B, 6.9B, 12B | RoPE |
Source: original paper, Tables 1-3.[^1] All families showed the attention sink phenomenon, and all worked with the same 4 + 1020 cache configuration without retraining.[^1][^3]
The reference code is released by the MIT HAN Lab at github.com/mit-han-lab/streaming-llm under an MIT license, built on top of transformers==4.33.0.[^3] It provides:
streaming_chat.py);[^3]The repository links the paper to its ICLR 2024 publication and provides the canonical BibTeX entry for citation.[^3] As of 2026 it remains the reference codebase for the technique, with hundreds of forks and external integrations.
attention_sinks (Tom Aarsen)Shortly after the paper's release, Tom Aarsen of Hugging Face published the attention_sinks Python package as a drop-in replacement for the transformers API.[^5][^6] Switching from transformers import AutoModel to from attention_sinks import AutoModel is enough to wrap any supported model with the StreamingLLM cache policy:
from attention_sinks import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
device_map="auto",
attention_sink_size=4,
attention_sink_window_size=4092,
)
The package defaults to attention_sink_size=4 and attention_sink_window_size=1020, matching the paper. Community contributions extended support beyond the originally evaluated four families to Llama, Mistral, Falcon, MPT, GPT-NeoX (Pythia), GPT-J, Qwen, StableLM, BTLM, and Yi.[^6] Endless-generation tests on Llama-2-7B show stable fluency past 10,000 tokens, where stock transformers runs out of VRAM or emits broken Unicode and pure window attention corrupts output after about 1,000 tokens.[^5]
SinkCacheThe pattern was subsequently upstreamed into the main transformers library through PR #26681 (December 2023), which introduced a generic Cache abstraction and a concrete SinkCache implementation.[^7][^15] SinkCache retains a configurable number of initial sinks and a rolling window of recent KV entries, exposing the StreamingLLM behaviour as a standard library feature available without third-party packages.[^7] A community Hugging Face Space (transformers-community/sink_cache) demonstrates the integration.[^7]
NVIDIA's TensorRT-LLM compiler stack integrated StreamingLLM as a supported KV-cache mode shortly after release.[^4] The Colossal-AI team released SwiftInfer, a TensorRT-LLM-based reimplementation of the StreamingLLM algorithm optimised for production multi-round dialogue, which reports an additional 46% inference-performance improvement over the vanilla StreamingLLM implementation on top of the original 22.2x speedup over the sliding-window-with-recomputation baseline.[^8][^16] SwiftInfer is open-sourced at github.com/hpcaitech/SwiftInfer.[^8]
| Integration | Maintainer | Notes |
|---|---|---|
mit-han-lab/streaming-llm | MIT HAN Lab | Reference research code, paper-faithful[^3] |
attention_sinks package | Tom Aarsen (Hugging Face) | Drop-in replacement for transformers API, broad model coverage[^5][^6] |
transformers.SinkCache | Hugging Face | Upstream Cache abstraction, December 2023[^7] |
| TensorRT-LLM | NVIDIA | Supported KV-cache mode[^4] |
| SwiftInfer | HPC-AI / Colossal-AI | TensorRT-LLM port with 46% throughput improvement[^8][^16] |
StreamingLLM is best understood as a fix to plain window attention rather than a new architectural primitive. Mistral 7B famously uses sliding window attention (SWA) with a 4,096-token window over an 8,192-token context, and its multi-layer stack lets information from earlier tokens still propagate by aggregation across layers.[^11][^17] However, SWA in production deployments still suffers from the attention sink failure mode once the stream length exceeds the cache, which is why open-source serving stacks combine SWA with a sink retention policy (often called "SWA + sinks") to reach truly streaming behaviour.[^11][^12] The attention_sinks package explicitly supports this hybrid for Mistral by exposing both the window size and the sink size.[^6]
The Heavy-Hitter Oracle (H2O) of Zhang et al. (NeurIPS 2023) is a contemporaneous KV-cache compression scheme that selects which entries to keep by tracking accumulated attention scores: tokens that have historically attracted significant attention from later positions are retained as "heavy hitters" alongside a recent-window of tokens.[^12] H2O reports matching full-cache performance at 20% of the cache and 3-29x throughput gains.[^12] Empirically the heavy-hitter set often overlaps the StreamingLLM sink set in its first few entries, but the two methods make different bets: StreamingLLM uses a fixed positional rule (first S tokens + last W tokens), while H2O uses an attention-statistic rule that adapts per query.[^12] Subsequent inference systems combine them: a sink-preserving prefix plus an H2O-style selector for the window has been explored in multiple follow-ups.[^12][^16]
Several lines of work build on the attention sink observation:
softmax(qk) becomes effectively softmax over [a_1, ..., a_T, sink]. This serves the same purpose as a StreamingLLM-style positional sink (a place for the head to discharge unused attention mass) but does not require any tokens to be kept in cache.[^9][^10][^19] The change is reported to make gpt-oss substantially more amenable to aggressive 4-bit weight quantisation.[^10][^19]A parallel line of vision research arrived at a structurally similar fix. Darcet et al., in "Vision Transformers Need Registers" (2024), observed that supervised and self-supervised Vision Transformers develop high-norm outlier tokens in low-information background patches of an image, producing noisy attention maps; they showed that prepending a small number of empty learnable register tokens to the patch sequence absorbs the artefacts and yields cleaner feature maps and dense prediction.[^21] The mechanism is essentially the same as StreamingLLM's pretraining sink-token variant: a designated set of input positions soaks up attention that the model has no genuine use for, freeing the content tokens from being co-opted into the same role.[^9][^21] Later "training-free" follow-ups in 2025 generalised the technique to off-the-shelf vision models without retraining, by identifying the small set of "register neurons" that drive the artefacts.[^21] The convergence of evidence across NLP and vision strengthens the interpretation of attention sinks as a generic consequence of softmax-normalised attention rather than a quirk of any particular model family.[^9][^14][^21]
The paper is explicit that StreamingLLM does not extend a model's effective context window.[^1][^5] Tokens that fall outside the sink and window regions are simply gone; the model cannot retrieve information from them. As a consequence, the technique is unsuitable for tasks that require reasoning over the full history, such as long-document summarisation or needle-in-a-haystack retrieval over millions of tokens.[^5][^13] Appendix C of the paper confirms that question-answering accuracy drops to 0% when the relevant context lies outside the cache.[^1] StreamingLLM is best understood as a stability mechanism for continuous decoding, not as a long-context comprehension method.
The most important limitations follow directly from the design:
SinkCache interacts with post-RoPE key caching in some configurations.[^7][^15]A user who needs the model to remember information from earlier in a long stream must combine StreamingLLM with an external memory mechanism (retrieval, summarisation buffers, or compressive memory schemes such as Infini-Attention).[^18]
StreamingLLM crystallised two ideas that have since become standard vocabulary in long-context inference. First, the observation that initial tokens function as positional attention sinks gave a clean explanation for a longstanding curiosity in transformer attention maps and predicted, accurately, that interfering with those positions would destabilise generation.[^1][^14] Second, the fix demonstrated that a non-trivial fraction of "long-context" failures during streaming are caused by KV-cache bookkeeping rather than by the model's representational capacity, and that those failures can be repaired with a roughly five-line change to the cache policy.[^3][^13]
The technique's practical reach is unusually broad for a single ICLR paper. It is implemented in the reference codebase, in a widely used third-party package, in the main Hugging Face Transformers library, in NVIDIA TensorRT-LLM, and in production runtimes like SwiftInfer, and it underlies the streaming behaviour of multiple chat assistants built on Llama-2, Mistral 7B, and similar models.[^3][^5][^6][^7][^4][^8] The underlying observation has further generalised into architectural decisions in subsequent models: gpt-oss replaces positional sinks with learnable softmax-denominator scalars to obtain the same stabilisation property without the cache cost.[^9][^10][^19] Surveys of inference optimization now treat sink-aware KV caching as a default consideration when designing streaming deployments.[^11][^12]
A second-order effect of the paper has been pedagogical. The attention sink story is one of the most cited examples of a deployment failure mode that was hidden in plain sight in published attention visualisations for years, and is sometimes used in interpretability courses to illustrate the value of looking at attention norms and not only attention probabilities.[^9][^14] The MIT News writeup of the paper emphasised that the practical value is not a bigger model or a longer "true" context window but the ability to deploy existing weights persistently, in always-on assistants, code-generation copilots, and editing tools, without the slow drift into incoherence or out-of-memory crashes that plagued early multi-turn chat systems.[^20]