Attention sink

Deep Learning Neural Networks

11 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

7 citations

Revision

v2 · 2,118 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

An attention sink is an empirical phenomenon in Transformer language models in which a large fraction of each attention head's weight concentrates on a few tokens at the very start of the sequence, most often the first token, even when those tokens carry no information relevant to the current prediction. The term was introduced in 2023 by Guangxuan Xiao and colleagues in the paper that proposed StreamingLLM, where they observed that discarding the key and value states of these initial tokens sharply degrades a model while keeping them restores its performance ^[1]. The behavior is now understood as a structural artifact of the softmax normalization inside attention rather than a signal of semantic importance, and it has direct consequences for KV cache eviction, streaming inference, and quantization.

Attention sinks matter because they break an intuitive assumption behind many efficiency methods, namely that low-information tokens can be dropped or coarsely approximated. The sink tokens look unimportant by content yet are load-bearing for the computation, so naive sliding-window caching, cache eviction, or aggressive quantization of those positions can cause large quality losses. Conversely, once the mechanism is understood it can be exploited: StreamingLLM uses a handful of preserved sink tokens to let a fixed-size cache process effectively unbounded input.

What does an attention sink look like?

In a decoder-only large language model with causal masking, every position attends only to itself and to earlier positions. When the per-head attention maps are visualized, deeper layers show a striking pattern: across nearly all query positions a heavy band of attention weight lands on the first few token positions, independent of what those tokens actually are ^[1]. Substituting the first token with an arbitrary or meaningless token leaves the band intact, which is why the positions, not their content, are called sinks. In many models the role is filled by the beginning-of-sequence (BOS) token, but it does not have to be.

The effect is not tied to one architecture. Xiao et al. documented it across the Llama-2, MPT, Falcon, and Pythia model families, and later studies confirmed it in most pretrained decoder-only LLMs ^[1]^[3]. Two refinements followed. First, sinks are not always confined to position zero: under a prefix language-modeling objective they can spread across the prefix, and in some models an early period or newline token also acts as a secondary sink ^[3]^[4]. Second, the sink is strongest in the middle and upper layers and is typically weak or absent in the first layer or two, which is consistent with it being something the network constructs rather than a property of the input ^[3]^[4].

Why do attention sinks form?

The standard explanation centers on the normalization built into attention. A softmax converts the query-key scores at a position into a probability distribution that must sum to one over all visible tokens. When a head has no strongly relevant token to read from, it still cannot emit an empty distribution; the weight has to go somewhere. The most convenient destination is a position visible from every query, and in a causal model the earliest tokens are visible to all later ones, so the model learns to route unused attention mass there. The sink therefore behaves as a learned no-op or bias: a near-constant value vector that a head can mix in without altering its output while still satisfying the sum-to-one constraint ^[1]^[3].

This view was sharpened by an independent observation from Evan Miller, whose July 2023 note "Attention Is Off By One" argued that the softmax denominator is the culprit ^[2]. Miller proposed adding one to the denominator, written softmax1(x)_i = exp(x_i) / (1 + sum_j exp(x_j)), which is equivalent to inserting a synthetic, always-zero null position. With this extra term a head can drive all of its real attention weights toward zero and stay quiet, so it no longer needs a real token to serve as a sink. Miller called the variant "quiet attention" and noted that his motivation was to remove the activation outliers that make models hard to quantize ^[2]. Xiao et al. tested essentially this idea under the name "Zero Sink" and found that it helps but does not fully remove the model's reliance on initial tokens ^[1].

A 2024 empirical study by Xiangming Gu and colleagues at Sea AI Lab and the National University of Singapore traced when and why the sink forms ^[3]. They found that it is not present at initialization but emerges after a model has been optimized on enough data, typically appearing within the first one thousand to two thousand training steps and strengthening thereafter; it is encouraged by weight decay and a larger learning rate and is essentially unaffected by batch size. Crucially, they characterized the sink as acting like a key bias that stores excess attention probability while contributing little to the value-weighted output. When they replaced softmax with normalization-free alternatives such as sigmoid attention, models up to one billion parameters trained without developing attention sinks at all, supporting the claim that the normalization, not the data, is the root cause ^[3].

How does StreamingLLM use attention sinks?

The phenomenon was discovered while solving a practical problem: running an LLM over a stream of text far longer than its training context without letting the cache grow without bound. The obvious fix, sliding window attention, keeps only the most recent tokens' keys and values. Xiao et al. showed that this collapses the moment the text exceeds the window, because evicting the first tokens removes the sinks the model depends on and perplexity spikes ^[1].

Their method, StreamingLLM, fixes this by permanently keeping a small number of initial-token KV entries, as dedicated attention sinks, alongside a rolling window of the most recent tokens. They found that retaining four initial tokens is enough to recover full performance whereas one or two are not, and that the combined cache (for example four sink tokens plus a window of recent tokens) lets a fixed-size cache extrapolate to very long inputs. With this scheme, Llama-2, MPT, Falcon, and Pythia models processed up to four million tokens with stable perplexity, and in the streaming setting StreamingLLM ran up to 22.2 times faster than a sliding-window baseline that recomputes attention from scratch at each step ^[1]. The method requires no fine-tuning; it changes only which KV entries are kept and how positions are encoded within the window. The work was carried out by researchers at MIT, Meta AI, Carnegie Mellon University, and NVIDIA, was submitted on September 29, 2023, and appeared at ICLR 2024.

StreamingLLM also showed that the sink can be designed in rather than discovered. Pre-training with a single dedicated learnable placeholder token at the start of every sequence, which the paper calls a "Learnable Sink," let models stream stably using just that one token instead of needing four arbitrary ones ^[1]. This idea has since reached production architectures. OpenAI's gpt-oss models, the open-weight gpt-oss-120b and gpt-oss-20b released on August 5, 2025, give each attention head a learned scalar sink logit that is appended to the softmax scores. It is not a real token but an extra entry in the denominator, so a head can send probability mass to the sink and effectively skip attending to any real token, the same off-by-one relaxation Miller described, implemented per head ^[5].

The table below summarizes the main responses to the phenomenon, which divide into methods that keep and use sinks and methods that prevent them from forming.

Approach	Year	What it does	Relationship to sinks
StreamingLLM (keep sink tokens)	2023	Pin the first few KV entries plus a rolling window	Exploits sinks to enable constant-memory streaming ^[1]
Learnable sink token	2023	Pre-train with one dedicated placeholder token	Replaces four ad-hoc sinks with one designed slot ^[1]
SoftMax1 / Zero Sink (off-by-one)	2023	Add a +1 term to the softmax denominator	Lets a head stay quiet, reducing the need for a sink ^[1]^[2]
Per-head sink logit (gpt-oss)	2025	Learned scalar appended to each head's scores	Built-in sink that uses no real token ^[5]
Sigmoid / normalization-free attention	2024	Drops the sum-to-one constraint	Sinks do not emerge ^[3]
Preserve-First-N / KVSink	2024 to 2025	Keep sink tokens at higher precision when quantizing	Protects sinks from quantization error ^[6]
Register tokens (vision Transformers)	2023	Add learnable register tokens	Absorbs the visual analog of a sink ^[7]

Why do attention sinks matter for quantization, streaming, and KV cache?

Attention sinks complicate the three main families of inference optimization.

For KV-cache eviction and sparsity, any policy that decides which past tokens to keep must protect the sink positions. Heavy-hitter and sliding-window schemes that drop low-attention or old tokens will silently delete sinks unless they special-case them, which is why StreamingLLM and later eviction methods explicitly pin the first few tokens ^[1].

For streaming and long-context serving, the sink is what makes constant-memory generation feasible at all: a fixed cache of sink tokens plus a recent window approximates full attention well enough to keep perplexity flat over millions of tokens ^[1].

For quantization, the difficulty is that sink positions coincide with the largest activation outliers in the model. The "massive activations" studied by Mingjie Sun and colleagues are a small number of hidden-state entries whose magnitudes are orders of magnitude above the rest; in LLaMA2-7B they occupy two fixed feature dimensions (1415 and 2533) and appear on the starting token and the first delimiter token (the first period or newline), and they act as the implicit bias that drives the attention concentration ^[4]. Sun et al. report that the largest such activation in LLaMA2-7B is roughly 2,000 in magnitude, about 10,000 times the median activation, and that zeroing just four of these values out of millions collapses the model, which is direct evidence that the handful of sink positions are functionally load-bearing rather than noise ^[4]. Because these extreme values land on the sink tokens, quantizing those positions to low bit-widths injects large errors. Practical KV cache quantization methods therefore keep the first few tokens in higher precision, a heuristic often called Preserve-First-N; methods such as KVSink (2025) improve on it by predicting which tokens are sinks rather than assuming they are only the first N, since sinks can also appear later in the sequence ^[6].

How do attention sinks relate to massive activations and vision Transformers?

The massive-activations line of work by Sun et al. (2024), presented at the first Conference on Language Modeling, gives the mechanistic counterpart to the attention-level description: early feed-forward layers inflate the norm of the first token's hidden state, and that large, nearly input-independent vector becomes the bias that the attention of later layers pools onto ^[4]. The two phenomena are widely regarded as two views of the same effect, one expressed in activation space and one in attention space.

The same artifact appears outside language. In "Vision Transformers Need Registers," Timothee Darcet and colleagues found that vision Transformers spawn high-norm tokens in uninformative background patches and repurpose them for internal computation, the visual analog of an attention sink; adding a few learnable register tokens gives the model a clean place to store that information and yields smoother feature maps ^[7]. This parallels StreamingLLM's learnable sink and gpt-oss's per-head sink logit: all three give the network an explicit scratch slot so that it stops commandeering real tokens.

Finally, the topic connects to the broader study of attention efficiency. Sliding-window attention, KV-cache compression, and cache-eviction policies all have to account for sinks, and several recent attention variants, including sigmoid attention, gated attention, and softmax-plus-one schemes, are motivated in part by the goal of removing them ^[2]^[3]^[5].

References

Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. "Efficient Streaming Language Models with Attention Sinks." arXiv:2309.17453, September 2023; ICLR 2024. https://arxiv.org/abs/2309.17453 ↩
Miller, E. "Attention Is Off By One." July 2023. https://www.evanmiller.org/attention-is-off-by-one.html ↩
Gu, X., Pang, T., Du, C., Liu, Q., Zhang, F., Du, C., Wang, Y., and Lin, M. "When Attention Sink Emerges in Language Models: An Empirical View." arXiv:2410.10781, October 2024; ICLR 2025 (Spotlight). https://arxiv.org/abs/2410.10781 ↩
Sun, M., Chen, X., Kolter, J. Z., and Liu, Z. "Massive Activations in Large Language Models." arXiv:2402.17762, February 2024; COLM 2024. https://arxiv.org/abs/2402.17762 ↩
OpenAI. "gpt-oss-120b and gpt-oss-20b Model Card." arXiv:2508.10925, August 2025; https://openai.com/index/introducing-gpt-oss/ ↩
"KVSink: Understanding and Enhancing the Preservation of Attention Sinks in KV Cache Quantization for LLMs." arXiv:2508.04257, August 2025. https://arxiv.org/abs/2508.04257 ↩
Darcet, T., Oquab, M., Mairal, J., and Bojanowski, P. "Vision Transformers Need Registers." arXiv:2309.16588, September 2023; ICLR 2024. https://arxiv.org/abs/2309.16588 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

Multi-Head Self-Attention Register tokens (Vision Transformers Need Registers)Self-Extend

Overview

What does an attention sink look like?

Why do attention sinks form?

How does StreamingLLM use attention sinks?

Why do attention sinks matter for quantization, streaming, and KV cache?

How do attention sinks relate to massive activations and vision Transformers?

References

Improve this article

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here