Attention sink
Last reviewed
Sources
7 citations
Review status
Source-backed
Revision
v1 · 2,040 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
7 citations
Review status
Source-backed
Revision
v1 · 2,040 words
Add missing citations, update stale details, or suggest a clearer explanation.
An attention sink is an empirical phenomenon in Transformer language models in which a large fraction of each attention head's weight concentrates on a few tokens at the very start of the sequence, most often the first token, even when those tokens carry no information relevant to the current prediction. The term was introduced in 2023 by Guangxuan Xiao and colleagues in the paper that proposed StreamingLLM, where they observed that discarding the key and value states of these initial tokens sharply degrades a model while keeping them restores its performance [1]. The behavior is now understood as a structural artifact of the softmax normalization inside attention rather than a signal of semantic importance, and it has direct consequences for KV cache eviction, streaming inference, and quantization.
Attention sinks matter because they break an intuitive assumption behind many efficiency methods, namely that low-information tokens can be dropped or coarsely approximated. The sink tokens look unimportant by content yet are load-bearing for the computation, so naive sliding-window caching, cache eviction, or aggressive quantization of those positions can cause large quality losses. Conversely, once the mechanism is understood it can be exploited: StreamingLLM uses a handful of preserved sink tokens to let a fixed-size cache process effectively unbounded input.
In a decoder-only large language model with causal masking, every position attends only to itself and to earlier positions. When the per-head attention maps are visualized, deeper layers show a striking pattern: across nearly all query positions a heavy band of attention weight lands on the first few token positions, independent of what those tokens actually are [1]. Substituting the first token with an arbitrary or meaningless token leaves the band intact, which is why the positions, not their content, are called sinks. In many models the role is filled by the beginning-of-sequence (BOS) token, but it does not have to be.
The effect is not tied to one architecture. Xiao et al. documented it across the Llama-2, MPT, Falcon, and Pythia model families, and later studies confirmed it in most pretrained decoder-only LLMs [1][3]. Two refinements followed. First, sinks are not always confined to position zero: under a prefix language-modeling objective they can spread across the prefix, and in some models an early period or newline token also acts as a secondary sink [3][4]. Second, the sink is strongest in the middle and upper layers and is typically weak or absent in the first layer or two, which is consistent with it being something the network constructs rather than a property of the input [3][4].
The standard explanation centers on the normalization built into attention. A softmax converts the query-key scores at a position into a probability distribution that must sum to one over all visible tokens. When a head has no strongly relevant token to read from, it still cannot emit an empty distribution; the weight has to go somewhere. The most convenient destination is a position visible from every query, and in a causal model the earliest tokens are visible to all later ones, so the model learns to route unused attention mass there. The sink therefore behaves as a learned no-op or bias: a near-constant value vector that a head can mix in without altering its output while still satisfying the sum-to-one constraint [1][3].
This view was sharpened by an independent observation from Evan Miller, whose July 2023 note "Attention Is Off By One" argued that the softmax denominator is the culprit [2]. Miller proposed adding one to the denominator, written softmax1(x)_i = exp(x_i) / (1 + sum_j exp(x_j)), which is equivalent to inserting a synthetic, always-zero null position. With this extra term a head can drive all of its real attention weights toward zero and stay quiet, so it no longer needs a real token to serve as a sink. Miller called the variant "quiet attention" and noted that his motivation was to remove the activation outliers that make models hard to quantize [2]. Xiao et al. tested essentially this idea under the name "Zero Sink" and found that it helps but does not fully remove the model's reliance on initial tokens [1].
A 2024 empirical study by Xiangming Gu and colleagues at Sea AI Lab and the National University of Singapore traced when and why the sink forms [3]. They found that it is not present at initialization but emerges after a model has been optimized on enough data, typically appearing within the first one thousand to two thousand training steps and strengthening thereafter; it is encouraged by weight decay and a larger learning rate and is essentially unaffected by batch size. Crucially, they characterized the sink as acting like a key bias that stores excess attention probability while contributing little to the value-weighted output. When they replaced softmax with normalization-free alternatives such as sigmoid attention, models up to one billion parameters trained without developing attention sinks at all, supporting the claim that the normalization, not the data, is the root cause [3].
The phenomenon was discovered while solving a practical problem: running an LLM over a stream of text far longer than its training context without letting the cache grow without bound. The obvious fix, sliding window attention, keeps only the most recent tokens' keys and values. Xiao et al. showed that this collapses the moment the text exceeds the window, because evicting the first tokens removes the sinks the model depends on and perplexity spikes [1].
Their method, StreamingLLM, fixes this by permanently keeping a small number of initial-token KV entries, as dedicated attention sinks, alongside a rolling window of the most recent tokens. They found that retaining four initial tokens is enough to recover full performance whereas one or two are not, and that the combined cache (for example four sink tokens plus a window of recent tokens) lets a fixed-size cache extrapolate to very long inputs. With this scheme, Llama-2, MPT, Falcon, and Pythia models processed up to four million tokens with stable perplexity, and in the streaming setting StreamingLLM ran up to 22.2 times faster than a sliding-window baseline that recomputes attention from scratch at each step [1]. The method requires no fine-tuning; it changes only which KV entries are kept and how positions are encoded within the window. The work was carried out by researchers at MIT, Meta AI, Carnegie Mellon University, and NVIDIA, was submitted on September 29, 2023, and appeared at ICLR 2024.
StreamingLLM also showed that the sink can be designed in rather than discovered. Pre-training with a single dedicated learnable placeholder token at the start of every sequence, which the paper calls a "Learnable Sink," let models stream stably using just that one token instead of needing four arbitrary ones [1]. This idea has since reached production architectures. OpenAI's gpt-oss models, the open-weight gpt-oss-120b and gpt-oss-20b released on August 5, 2025, give each attention head a learned scalar sink logit that is appended to the softmax scores. It is not a real token but an extra entry in the denominator, so a head can send probability mass to the sink and effectively skip attending to any real token, the same off-by-one relaxation Miller described, implemented per head [5].
The table below summarizes the main responses to the phenomenon, which divide into methods that keep and use sinks and methods that prevent them from forming.
| Approach | Year | What it does | Relationship to sinks |
|---|---|---|---|
| StreamingLLM (keep sink tokens) | 2023 | Pin the first few KV entries plus a rolling window | Exploits sinks to enable constant-memory streaming [1] |
| Learnable sink token | 2023 | Pre-train with one dedicated placeholder token | Replaces four ad-hoc sinks with one designed slot [1] |
| SoftMax1 / Zero Sink (off-by-one) | 2023 | Add a +1 term to the softmax denominator | Lets a head stay quiet, reducing the need for a sink [1][2] |
| Per-head sink logit (gpt-oss) | 2025 | Learned scalar appended to each head's scores | Built-in sink that uses no real token [5] |
| Sigmoid / normalization-free attention | 2024 | Drops the sum-to-one constraint | Sinks do not emerge [3] |
| Preserve-First-N / KVSink | 2024 to 2025 | Keep sink tokens at higher precision when quantizing | Protects sinks from quantization error [6] |
| Register tokens (vision Transformers) | 2023 | Add learnable register tokens | Absorbs the visual analog of a sink [7] |
Attention sinks complicate the three main families of inference optimization.
For KV-cache eviction and sparsity, any policy that decides which past tokens to keep must protect the sink positions. Heavy-hitter and sliding-window schemes that drop low-attention or old tokens will silently delete sinks unless they special-case them, which is why StreamingLLM and later eviction methods explicitly pin the first few tokens [1].
For streaming and long-context serving, the sink is what makes constant-memory generation feasible at all: a fixed cache of sink tokens plus a recent window approximates full attention well enough to keep perplexity flat over millions of tokens [1].
For quantization, the difficulty is that sink positions coincide with the largest activation outliers in the model. The "massive activations" studied by Mingjie Sun and colleagues are a small number of hidden-state entries whose magnitudes are orders of magnitude above the rest; in LLaMA2-7B they occupy two fixed feature dimensions (1415 and 2533) and appear on the starting token and the first delimiter token, and they act as the implicit bias that drives the attention concentration [4]. Because these extreme values land on the sink tokens, quantizing those positions to low bit-widths injects large errors. Practical KV cache quantization methods therefore keep the first few tokens in higher precision, a heuristic often called Preserve-First-N; methods such as KVSink (2025) improve on it by predicting which tokens are sinks rather than assuming they are only the first N, since sinks can also appear later in the sequence [6].
The massive-activations line of work by Sun et al. (2024), presented at the first Conference on Language Modeling, gives the mechanistic counterpart to the attention-level description: early feed-forward layers inflate the norm of the first token's hidden state, and that large, nearly input-independent vector becomes the bias that the attention of later layers pools onto [4]. The two phenomena are widely regarded as two views of the same effect, one expressed in activation space and one in attention space.
The same artifact appears outside language. In "Vision Transformers Need Registers," Timothee Darcet and colleagues found that vision Transformers spawn high-norm tokens in uninformative background patches and repurpose them for internal computation, the visual analog of an attention sink; adding a few learnable register tokens gives the model a clean place to store that information and yields smoother feature maps [7]. This parallels StreamingLLM's learnable sink and gpt-oss's per-head sink logit: all three give the network an explicit scratch slot so that it stops commandeering real tokens.
Finally, the topic connects to the broader study of attention efficiency. Sliding-window attention, KV-cache compression, and cache-eviction policies all have to account for sinks, and several recent attention variants, including sigmoid attention, gated attention, and softmax-plus-one schemes, are motivated in part by the goal of removing them [2][3][5].