Linear Attention

Model Architecture Transformer Models

27 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

20 citations

Revision

v3 · 5,422 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Linear attention is a family of sub-quadratic attention mechanisms that replaces the softmax dot-product operation of standard Transformer self-attention with a feature-map-based factorization, reducing the cost of processing a sequence from quadratic to linear in its length. By writing attention as phi(Q) (phi(K)^T V), where phi is a non-negative kernel feature map, the computation can be reordered to scale linearly rather than quadratically with sequence length, cutting complexity from O(N^2 d) to O(N d^2).^[1] The formulation also reveals a natural recurrent state representation that allows autoregressive decoding in constant time and memory per token, blurring the line between Transformers and recurrent neural networks.^[1] First popularized in the 2020 ICML paper "Transformers are RNNs" by Katharopoulos, Vyas, Pappas and Fleuret, linear attention has since been refined through gating, decay, delta-rule updates and structured state space duality, with variants such as RWKV, RetNet, Mamba-2, GLA and Gated DeltaNet narrowing the historical quality gap to softmax attention.^[1]^[2]^[3]^[4]^[5]^[6]

The original paper reports that its linear transformers "achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences."^[1]

What is linear attention in one sentence?

Linear attention removes the softmax from self-attention and replaces the exponential similarity exp(q . k) with an explicit inner product phi(q)^T phi(k), so that matrix-product associativity lets the model accumulate a single fixed-size d x d state instead of materializing the full N x N attention matrix, giving O(N) time and O(1) memory per token at inference.^[1] The five-word version, from the title of the founding paper, is that "Transformers are RNNs".^[1]

History

The history of linear attention can be traced through three overlapping waves of research. The earliest precursors come from the 1990s, when Schmidhuber proposed fast-weight programmers as an alternative to standard recurrent networks. In a 1992 Neural Computation paper, a slow controller network was trained by gradient descent to write context-dependent fast weights into a second feedforward network via additive outer products of self-generated key and value patterns.^[9] This is, up to renaming of variables, the same operation that linear attention performs at each token, although the connection to attention as understood in modern Transformers was not made explicit until decades later.^[3] An earlier 1991 technical report by Schmidhuber describes the same idea under a different name.^[9]

The second wave coincided with the rise of the Transformer. After Vaswani et al. introduced softmax self-attention in 2017, the quadratic cost of long-sequence training became a recognized bottleneck.^[7] Several teams in 2019 and 2020 proposed efficient-attention mechanisms that broadly fell into three families: sparse approximations of the attention matrix, low-rank projections, and kernel-based linearizations.^[8] In June 2020 Wang and collaborators released Linformer, which projected keys and values to a low-rank subspace; later the same month Katharopoulos, Vyas, Pappas and Fleuret submitted the seminal "Transformers are RNNs" paper introducing the modern ELU+1 feature-map linear attention and the recurrent state interpretation.^[1]^[16] In September 2020 Choromanski and collaborators released the Performer paper, providing the first kernel approximation of softmax with provable error bounds.^[10]

The third wave, from 2021 onward, focused on closing the quality gap. Schlag, Irie and Schmidhuber formalized the equivalence to fast weight programmers and introduced the delta-rule update.^[3] In 2023 a series of more expressive variants appeared: RWKV applied linear attention with per-channel exponential decay at large scale, and RetNet introduced explicit decay and a triple parallel/recurrent/chunkwise computation form for industrial-scale training.^[4]^[17] GLA in late 2023 added data-dependent gates and a hardware-aware kernel.^[2] In 2024 Mamba 2 unified linear attention with selective state space models through the SSD framework, and a wave of delta-rule variants culminated in Gated DeltaNet, which combines gating with the delta update.^[5]^[6]^[14] By 2025, hybrid architectures interleaving linear-attention or SSM layers with periodic softmax layers had been adopted at the foundation-model scale by teams releasing MiniMax-01, Kimi Linear and certain Qwen variants, signalling broader industrial uptake of the approach.^[18]

Background

Standard self-attention in the transformer architecture computes outputs as softmax(QK^T / sqrt(d_k)) V, where the N x N matrix QK^T is materialized for every layer and every head.^[7] This makes time and memory scale as O(N^2 d), which becomes prohibitive when the sequence length N reaches tens or hundreds of thousands of tokens.^[1]^[7] Many efficient-attention proposals appeared between 2019 and 2021, including sparse, low-rank and locality-sensitive approximations, but most retained an explicit softmax over a reduced set of token pairs.^[8]

Linear attention takes a different route. Instead of approximating the softmax, it removes the softmax altogether and replaces the exponential kernel exp(q . k) with an explicit feature map phi chosen so that sim(q, k) = phi(q)^T phi(k) is non-negative.^[1] Because the new similarity is bilinear, the order of matrix products can be changed: rather than computing (QK^T) V, the algorithm computes phi(K)^T V first (an outer-product sum of size d x d), and then multiplies by phi(Q). The cost of building the intermediate is O(N d^2) instead of O(N^2 d), which is favorable whenever N >> d.^[1]

The conceptual roots of this trick go back further. In 1992 Jurgen Schmidhuber introduced "fast weight programmers", in which a slow network learns by gradient descent to write key-value outer products into a separate fast-weight matrix that another network reads from.^[3]^[9] Schlag, Irie and Schmidhuber later showed in 2021 that linearized self-attention is mathematically equivalent to that classical construction, describing the fast-weight write as a sequence of "additive outer products of self-invented activation patterns (today called keys and values)", and motivating a richer class of update rules.^[3]

How does linear attention reduce the cost from quadratic to linear?

From softmax to feature maps

Let Q, K, V be the query, key and value matrices of shape N x d. The i-th output of softmax attention is

y_i = sum_j  exp(q_i^T k_j) / sum_l exp(q_i^T k_l)  *  v_j

Katharopoulos et al. generalize the exponential kernel to any non-negative similarity sim(q, k), giving

y_i = sum_j  sim(q_i, k_j) v_j  /  sum_j sim(q_i, k_j)

If sim factorizes as sim(q, k) = phi(q)^T phi(k) for some feature map phi: R^d -> R^r with non-negative outputs, the numerator and denominator both rearrange:

y_i = phi(q_i)^T ( sum_j phi(k_j) v_j^T )  /  phi(q_i)^T ( sum_j phi(k_j) )

The two sums no longer depend on i, so they can be precomputed once per layer at cost O(N r d) and reused for every query.^[1] The complexity drops from O(N^2 d) to O(N r d), which equals O(N d^2) for the natural choice r = d.^[1] The authors describe the move as expressing "the self-attention as a linear dot-product of kernel feature maps" and exploiting "the associativity property of matrix products to reduce the complexity from O(N^2) to O(N)".^[1]

Causal masking and recurrence

For autoregressive language modeling the sum must respect the causal mask, so that position i only sees positions j <= i. Define the cumulative running statistics

S_i = S_{i-1} + phi(k_i) v_i^T
z_i = z_{i-1} + phi(k_i)
y_i = (phi(q_i)^T S_i) / (phi(q_i)^T z_i)

S_i is a d x d (or r x d) matrix and z_i is a d-vector; both are constant in size with respect to sequence length.^[1] These updates make linear attention behave like a recurrent neural network whose hidden state is the matrix S_i, which is why Katharopoulos et al. titled their paper "Transformers are RNNs".^[1] At inference time each new token costs O(d^2) arithmetic and O(d^2) memory, independent of how many tokens have already been generated, in contrast to the linearly growing KV cache of softmax Transformers.^[1]

The same paper reports speedups of up to 4000 times for autoregressive prediction on very long sequences compared to a standard Transformer of comparable size, while attaining comparable likelihood on tasks tested in the paper.^[1]

Classical feature maps

Several feature maps appear repeatedly in the literature:

Feature map	Definition	Source
ELU+1	`phi(x) = elu(x) + 1`, applied elementwise	Katharopoulos et al. 2020^[1]
Positive random features (FAVOR+)	`phi(x) = exp(-
Deterministic Parameter-Free Projection (DPFP)	Concatenation of products of ReLU shifts of `x`	Schlag, Irie, Schmidhuber 2021^[3]
Polynomial features	`phi(x) = (1, x, x (x) x, ...)` truncated to a fixed order	Kacham et al., Arora et al.^[11]^[12]
Cosine / sine projections	`phi(x) = [cos(W x); sin(W x)]` for orthogonal `W`	Performer variants^[10]
Learned MLP	`phi(x) = MLP(x)`, trained to mimic softmax spikes	Hedgehog (Zhang et al. 2024)^[11]

The ELU+1 choice was motivated by the need for a strictly positive output without requiring a particular kernel approximation; it is cheap, deterministic and the default for many later studies of pure linear attention.^[1] The Performer paper instead derives a positive random-feature decomposition of the exact softmax kernel, giving an unbiased estimator of softmax attention rather than an unrelated similarity.^[10] DPFP is a deterministic finite-dimensional approximation that the Schlag et al. authors found to balance simplicity and effectiveness on language modeling and translation tasks.^[3]

Associativity and the kernel trick

The core algebraic step that makes linear attention work is the reassociation of three matrix products. In softmax attention, computing softmax(QK^T) V requires materializing QK^T, an N x N matrix, before normalizing across rows; this matrix cannot be avoided because the softmax couples all entries in a row through the normalizer. In linear attention, the absence of a softmax makes the bilinear similarity phi(q)^T phi(k) a true inner product, and matrix multiplication is associative, so (phi(Q) phi(K)^T) V = phi(Q) (phi(K)^T V). Choosing the right-associated form means the algorithm builds a r x d matrix phi(K)^T V rather than an N x N matrix, eliminating the quadratic intermediate.^[1] This is a special case of the kernel trick familiar from kernel methods in classical machine learning, where an inner product in a feature space replaces an explicit pairwise similarity computation in input space.^[3]^[10]

The kernel-trick view also clarifies what Performer adds over generic linear attention. The Performer feature map is not arbitrary: it is constructed so that E[phi(q)^T phi(k)] equals the softmax kernel exp(q^T k) for random orthogonal projection matrices, with the expectation taken over those random matrices.^[10] As a result, FAVOR+ is an unbiased Monte Carlo estimator of softmax attention rather than a different attention mechanism, and the approximation error is controlled by the number of random features. Generic linear attentions such as ELU+1 or DPFP are not estimators of softmax; they define their own kernel and let training reshape the network around it.^[1]^[3]

Chunkwise parallel form

A pure recurrence is hardware-unfriendly because it serializes the time dimension. The chunkwise parallel form, popularized by GLA, RetNet, Mamba-2 and DeltaNet, splits the sequence of length L into L/C chunks of size C.^[2]^[4]^[5]^[13] Within each chunk the quadratic form phi(Q) phi(K)^T V is used (allowing fused GEMM kernels), while between chunks only the final hidden state S is propagated via the recurrence.^[2] This interpolates between the parallel form (one big chunk) and the recurrent form (chunk size one), allowing implementations to choose chunk size based on the hardware and sequence length.^[2] The Gated Linear Attention paper develops a hardware-efficient instantiation called FlashLinearAttention that the authors report to be faster than Flash Attention-2 as a standalone layer even on short sequences such as 1K tokens.^[2]

The recurrence S_i = S_{i-1} + phi(k_i) v_i^T makes the equivalence to a particular kind of recurrent neural network explicit: instead of a fixed-size vector hidden state as in an LSTM or GRU, the state is a matrix that grows by an outer product at every step.^[1]^[3] This view connects to the older fast-weight literature in two ways. First, Schmidhuber's 1992 Neural Computation paper proposed fast weight memories updated by exactly this kind of additive outer product, with a slow controller producing the keys and values.^[9] Second, Schlag, Irie and Schmidhuber proved in 2021 that the additive update is identical, up to renaming, to a linearized self-attention layer, and they then proposed a delta rule extension that replaces the additive write with a Widrow-Hoff style error-correcting write.^[3]

The delta-rule update is

S_t = S_{t-1} - beta_t (S_{t-1} k_t - v_t) k_t^T

where beta_t is a per-token learning-rate that the network predicts.^[3]^[14] This is exactly one step of gradient descent on the per-key reconstruction error || S k_t - v_t ||^2, which makes the matrix S behave like an associative memory trained on the fly during the forward pass.^[14] Yang et al. later showed how to parallelize this update over sequence length by writing the recurrence using generalized Householder transformations, giving DeltaNet a chunkwise parallel training algorithm with the same hardware properties as plain linear attention.^[14]

The recurrent perspective also clarifies why linear attention sits between Transformers and classical RNNs. Each new token writes an outer product into a finite memory whose capacity is bounded by d^2; this contrasts with softmax attention, which retains every prior token in the KV cache and can address it exactly through the softmax. Linear attention thus trades a growing exact memory for a fixed-size compressed memory, with cost O(1) per token at inference but a constant-bandwidth bottleneck on how much can be remembered.^[1]^[15]

What are the main variants of linear attention?

Performer (FAVOR+)

The Performer architecture, by Choromanski and collaborators at Google Research, was introduced at ICLR 2021 and approximates the softmax kernel with positive orthogonal random features (FAVOR+, Fast Attention Via positive Orthogonal Random features).^[10] Unlike the ELU+1 map, FAVOR+ provides an unbiased estimator of softmax attention with provable uniform convergence and low variance, which made it the first widely-adopted linear-attention variant that could approximate, rather than replace, the original Transformer kernel.^[10] Performer is fully compatible with pretrained Transformer weights, allowing softmax models to be converted to a linear-time form post-hoc.^[10]

Linformer

Linformer, by Wang and collaborators in mid-2020, also achieves linear complexity but through a different mechanism.^[16] Rather than removing the softmax, Linformer projects the N x d key and value matrices through learned k x N matrices to produce k x d proxies, then applies standard softmax attention on the reduced length-k sequence.^[16] It is therefore strictly speaking a low-rank attention method rather than a feature-map linear attention, but it is often grouped with linear-attention work because it brings the cost to O(N k d).^[16] One consequence is that Linformer is non-causal by default and not as readily reformulated as a recurrence.^[16]

RWKV

RWKV (Receptance Weighted Key Value) was introduced in 2023 by Peng and a large open-source collaboration, and combines a transformer-style attention block with an RNN-style hidden state.^[17] Its time-mixing block uses a variant of linear attention in which the addition of new key-value outer products is multiplied by an exponential decay along the time dimension, controlled by a learnable per-channel decay vector w and a bonus vector u for the current token.^[17] The result is a recurrence that the authors describe as a linear-cost RNN with transformer-level performance, and they trained models up to 14 billion parameters which the paper describes as the largest dense RNN trained at the time of publication.^[17] RWKV-7 is a more recent generation in the same family.

RetNet

RetNet (Retentive Network), proposed in July 2023 by Sun and collaborators at Microsoft Research and Tsinghua, presents the retention operator as a linear attention with explicit exponential decay along the sequence.^[4] Each retention layer admits three mathematically equivalent computation modes: a parallel form for training (similar in cost to softmax attention), a recurrent form for O(1) token-by-token inference, and a chunkwise recurrent form that interpolates between the two for long-sequence inference.^[4] The authors position the model as a successor architecture for large language models and report better inference throughput than softmax Transformers paired with Flash Attention, with competitive language-modeling perplexity.^[4]

Gated Linear Attention (GLA)

GLA, introduced by Yang, Wang, Shen, Panda and Kim in December 2023 and published at ICML 2024, augments the standard linear-attention recurrence with a data-dependent gate G_t that multiplies the running state before each update.^[2] The recurrence becomes

S_t = G_t (.) S_{t-1} + k_t v_t^T

where (.) denotes a structured (typically diagonal or outer-product) decay matrix produced from the input.^[2] Compared to RetNet's fixed exponential decay, this lets the model selectively forget context based on the current token. The accompanying FlashLinearAttention kernel is a hardware-aware implementation that the authors found to outperform Flash Attention-2 even on short sequences.^[2] GLA was reported to be competitive with the LLaMA-architecture Transformer, RetNet and Mamba in moderate-scale language modeling experiments, with strong length-generalization behavior when extending from 2K to over 20K tokens.^[2]

DeltaNet and Gated DeltaNet

Yang, Wang, Zhang, Shen and Kim published in June 2024 a hardware-efficient algorithm for training DeltaNet, the delta-rule linear attention of Schlag et al., parallelized across sequence length via a reparameterization using Householder transformations.^[14] At the 1.3B parameter scale, the authors reported that DeltaNet trained on 100 billion tokens outperformed Mamba and GLA on perplexity and zero-shot downstream tasks, and that DeltaNet-attention hybrid models outperformed Transformer baselines.^[14]

Gated DeltaNet by Yang, Kautz and Hatamizadeh, published at ICLR 2025, combines DeltaNet's delta-rule update with GLA-style gating to produce a recurrence that can both forget memory rapidly (via the gate) and update specific keys precisely (via the delta rule).^[6] The paper reports consistent improvement over Mamba 2 and DeltaNet on language modeling, common-sense reasoning, in-context retrieval, length extrapolation and long-context understanding benchmarks, with hybrid architectures pairing Gated DeltaNet layers with sliding window attention or Mamba 2 layers performing the best.^[6]

Mamba 2 and structured state space duality

Mamba 2, published at ICML 2024 by Dao and Gu under the title "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality", established a precise theoretical bridge between selective state space models and linear attention.^[5] The authors argue that SSMs and attention "are actually quite closely related", connected "through various decompositions of a well-studied class of structured semiseparable matrices".^[5] The State Space Duality (SSD) framework expresses both families as decompositions of these matrices, with linear attention corresponding to the dual quadratic form and selective SSMs corresponding to the dual linear (matrix-multiply-free) form.^[5] Under this view, the two computation modes are exactly the same model evaluated by different algorithms, in the same way that linear attention has both a recurrent and a parallel form.^[5] Mamba 2's core layer is described in the paper as a refinement of Mamba's selective SSM that is two to eight times faster while remaining competitive with Transformers on language modeling.^[5]

BASED: balancing recall and throughput

In February 2024, Arora and collaborators at Stanford released BASED (Building Attention Systems for Efficient Decoding), a linear-attention model designed explicitly to address the associative-recall weakness of prior linear-attention and SSM models.^[12] The authors observed that on synthetic recall benchmarks and in-context-learning tasks, models such as Mamba, RWKV, Hyena and RetNet underperform standard Transformers, even when their language-modeling perplexity is comparable.^[12] BASED combines a sliding-window softmax attention with a Taylor-series linear attention (a polynomial-feature approximation of softmax) and interleaves these layers throughout the network.^[12] At inference time the model decodes without a KV cache for the linear-attention component, which Arora et al. report yields up to 24 times higher throughput than a Transformer of comparable parameter count using Flash Attention 2.^[12] The paper frames the result as a tunable tradeoff between recall (favored by softmax) and throughput (favored by linear attention), with hybrid models occupying intermediate points on a Pareto frontier.^[12]

Hedgehog and softmax mimicry

In February 2024, Zhang, Bhatia, Kumbong and Re proposed Hedgehog at ICLR 2024 to attack the quality gap from a different angle.^[11] They identified two properties of softmax that prior linear-attention feature maps lack: low-entropy ("spiky") attention weights that concentrate mass on a small number of relevant tokens, and dot-product monotonicity (attention weights should be monotone in q . k).^[11] Hedgehog replaces the fixed feature map by a small trainable MLP, trained either from scratch or to imitate the softmax of a teacher Transformer.^[11] The authors reported recovering more than 99 percent of standard Transformer quality, including converting pretrained GPT-2 and Llama-2 7B models into linear-attention form with substantially smaller quality drops than prior linear attentions.^[11]

Comparison table

Variant	Year	Mechanism	Decay/gate	Distinctive feature
Linear Transformer	2020	`phi(x) = elu(x) + 1`	None	Original `O(N d^2)` formulation^[1]
Performer	2020/21	Positive random features	None	Unbiased softmax estimator^[10]
DeltaNet (Schlag)	2021	DPFP kernel + delta rule	None	Error-correcting writes^[3]
RWKV	2023	Linear attention with channel decay	Fixed exponential	Open-source RNN-Transformer hybrid^[17]
RetNet	2023	Retention with explicit decay	Fixed exponential	Three computation modes^[4]
GLA	2023	Linear attention with data-dependent gate	Learned per-token	FlashLinearAttention kernel^[2]
Mamba 2	2024	Selective SSM (SSD dual)	Learned diagonal	2-8x faster than Mamba^[5]
DeltaNet (Yang)	2024	Delta rule, parallelized	None	Householder reparameterization^[14]
Hedgehog	2024	Learned MLP feature map	None	Mimics softmax spikiness^[11]
Gated DeltaNet	2024/25	Delta rule + gate	Learned per-token	Combines both ideas^[6]

Where is linear attention used in practice?

Open-source implementations of linear attention exist in several frameworks. The original Linear Transformer is published with reference code by the EPFL Idiap group as the fast-transformers library, which includes the ELU+1 feature map and a CUDA implementation of the causal cumulative-sum kernel.^[1] Performer has a maintained PyTorch port (performer-pytorch) and a JAX implementation that follows the FAVOR+ paper.^[10] DeltaNet and GLA are distributed by the flash-linear-attention project on GitHub, which packages chunkwise kernels and Triton implementations of GLA, DeltaNet, RetNet and related recurrences.^[2]^[14] Gated DeltaNet's official implementation is released by NVIDIA Labs.^[6]

Linear-attention layers also appear inside larger publicly released models. RWKV is the most prominent example of a model family trained from scratch on a pure linear-attention recurrence, with checkpoints released openly by the RWKV project.^[17] MiniMax-Text-01 interleaves linear-attention blocks (described by the authors as a Lightning Attention variant) with periodic softmax-attention blocks, in a hybrid model.^[18] RetNet was released with checkpoints by Microsoft Research. Recent industrial-scale models such as Kimi Linear and certain Qwen variants are also reported to use hybrid architectures combining linear attention or selective SSM layers with conventional attention for short-context fidelity.^[18]

What is linear attention used for?

Linear attention is most attractive when sequence length is the dominant cost driver. Long-document modeling, code repositories spanning hundreds of thousands of tokens, audio sequences sampled at high frame rates, genomic sequences, and streaming inference workloads all stress the quadratic scaling of softmax attention.^[1]^[4] In these regimes, the constant-time-per-token recurrent form of linear attention removes the linearly growing KV cache that otherwise dominates serving memory in autoregressive decoding.^[1]^[4]^[17]

A second area is on-device or low-memory inference. Because the per-token state is O(d^2) rather than O(N d), a 7B linear-attention model can in principle decode tokens with memory bounded independent of context length, which is valuable for mobile and embedded deployment.^[17] RWKV in particular has been used as a research vehicle for this kind of deployment.^[17]

A third use is as a building block in hybrid architectures, where most layers are linear-attention or SSM layers and a small minority are softmax attention. Such hybrids combine the long-range efficiency of linear recurrences with the precise retrieval capacity of softmax attention, and have been adopted in recent industrial systems.^[6]^[14]^[18] DeltaNet, Gated DeltaNet and Mamba 2 hybrids with full attention or sliding window attention are reported in their respective papers to match or exceed pure Transformer baselines on language modeling benchmarks at moderate scale.^[6]^[14]

Beyond language modeling, linear attention has been applied to vision (e.g. linear-attention vision transformers), speech, time-series modeling and reinforcement learning settings where the sequence length or rollout length is the primary cost concern.^[8]

A practical lens on the tradeoff is what Arora et al. call the recall-throughput frontier: any model must trade some amount of in-context recall capacity for some amount of decoding throughput.^[12] Softmax attention occupies the high-recall, low-throughput end of this frontier because it stores every past token exactly but pays a linearly growing KV cache. Pure linear attention with a small state occupies the high-throughput, low-recall end because per-token cost is constant but the matrix state can store only a bounded number of key-value pairs without interference. Variants with larger feature dimensions, decay, gating or the delta rule move along this frontier; hybrid models combine layers from different points on the frontier and let the network decide where to allocate softmax-style precision. The pace of recent improvements suggests that hybrid linear-attention architectures, rather than pure replacements for softmax, will be the dominant form of efficient long-context modeling in the near term.^[6]^[12]^[14]^[18]

What are the limitations of linear attention?

Despite the recent improvements, linear attention historically lagged softmax attention on language modeling perplexity and on retrieval-heavy tasks, and the quality gap remains a major topic of study.^[11]^[14] Several specific weaknesses have been characterized in the literature:

Limited expressivity. Because the per-step write is a rank-one outer product into a fixed-size matrix, the cumulative state can interfere with itself; the model cannot in general remember an unbounded number of distinct key-value pairs exactly.^[3]^[11] Schlag et al. analyze this as a memory capacity bottleneck and propose the delta rule as a partial fix.^[3]
Non-spiky attention weights. Zhang et al. show that prior linear attentions tend to produce flat, high-entropy attention distributions instead of the spiky distributions that softmax models converge to on language modeling, which correlates with degraded performance on tasks that require attending sharply to a single past token.^[11]
Loss of dot-product monotonicity. The same Hedgehog analysis shows that linear-attention feature maps may not be monotone in q . k, breaking an invariant that softmax preserves and that downstream layers appear to rely on.^[11]
Not injective. Linear attention is not injective in the queries, so different queries can produce identical outputs (semantic confusion), in contrast to softmax which is injective for distinct argument vectors.^[15]
Local-modeling deficiencies. Several studies note that linear attention without explicit decay or gating struggles to model strong local dependencies, motivating decay terms in RetNet, RWKV and GLA and convolutional pre-processing in some vision linear-attention variants.^[15]^[4]^[17]
Hardware-efficient training is non-trivial. The naive causal recurrence is sequential and slow on GPUs; achieving hardware efficiency required the chunkwise parallel algorithm and dedicated kernels such as FlashLinearAttention.^[2]^[14]

These limitations explain why early linear-attention models, despite excellent asymptotic complexity, often failed to match same-size Transformers in perplexity and downstream accuracy. The most recent variants (GLA, DeltaNet, Mamba 2 and Gated DeltaNet) close most of the gap on moderate-scale benchmarks and outperform Transformers on length-extrapolation tasks, but pure linear-attention models at very large scale are still less battle-tested than softmax Transformers, and most production systems that adopt linear attention do so in a hybrid with at least some softmax layers.^[2]^[6]^[14]^[18]

How does linear attention relate to the attention sink phenomenon?

A related point of contrast is the attention sink phenomenon described by Xiao and collaborators in late 2023.^[19] In softmax Transformers, a large fraction of attention heads systematically place a disproportionate share of their attention mass on a small set of early tokens (often the first token), regardless of those tokens' semantic content.^[19] This behavior is thought to arise because softmax forces attention weights to sum to one, so heads that have nothing useful to attend to must place their mass somewhere; the model learns to park unused attention on tokens with low-magnitude value vectors.^[19] StreamingLLM, the system Xiao et al. propose, exploits this by keeping the first few tokens together with a sliding window during streaming decoding, allowing softmax models trained on a fixed window to generalize to multi-million-token streams.^[19]

Linear attention does not impose the sum-to-one constraint, and empirical studies report that linear-attention models and other unnormalized variants such as sigmoid-attention models do not develop the same kind of attention sinks even at the hundreds-of-millions-of-parameters scale.^[20] This is sometimes cited as evidence that the sink is a softmax-specific artifact rather than a property of attention as such, and as one of the structural differences that linear-attention designers must keep in mind: the failure mode that the sink relieves in softmax models (heads with nothing to attend to) instead manifests in linear-attention models as state-overwriting interference, which gating and the delta rule attempt to address.^[15]^[20]

Linear attention is one strand of a broader effort to escape the quadratic cost of softmax attention in the Transformer. Adjacent approaches include:

Low-rank attention (Linformer): replaces the N x N matrix with a low-rank approximation, sharing linear scaling but retaining a softmax.^[16]
Sparse attention (sparse attention, BigBird, Longformer): keeps softmax but restricts the set of token pairs attended to.^[8]
Sliding-window attention (sliding window attention): a special case of sparse attention used in Mistral and many recent open models.^[8]
Hardware-aware exact attention (Flash Attention): keeps softmax and quadratic FLOPs but reduces memory and wall-clock by tiling.
State space models (state space model, Mamba, Mamba 2): a parallel family of recurrent sequence models that, via the SSD framework, share a duality with linear attention.^[5]
Linear-attention vision transformers and other domain-specific variants extend the same trick to images and other modalities.^[8]

References

Katharopoulos, Vyas, Pappas, Fleuret, "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention", arXiv preprint and ICML 2020, 2020-06-29. https://arxiv.org/abs/2006.16236. Accessed 2026-06-22. ↩
Yang, Wang, Shen, Panda, Kim, "Gated Linear Attention Transformers with Hardware-Efficient Training", arXiv:2312.06635 and ICML 2024, 2023-12-11. https://arxiv.org/abs/2312.06635. Accessed 2026-06-22. ↩
Schlag, Irie, Schmidhuber, "Linear Transformers Are Secretly Fast Weight Programmers", arXiv:2102.11174 and ICML 2021, 2021-02-22. https://arxiv.org/abs/2102.11174. Accessed 2026-06-22. ↩
Sun, Dong, Huang, Ma, Xia, Xue, Wang, Wei, "Retentive Network: A Successor to Transformer for Large Language Models", arXiv:2307.08621, Microsoft Research, 2023-07-17. https://arxiv.org/abs/2307.08621. Accessed 2026-06-22. ↩
Dao, Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality", arXiv:2405.21060 and ICML 2024, 2024-05-31. https://arxiv.org/abs/2405.21060. Accessed 2026-06-22. ↩
Yang, Kautz, Hatamizadeh, "Gated Delta Networks: Improving Mamba2 with Delta Rule", arXiv:2412.06464 and ICLR 2025, 2024-12-09. https://arxiv.org/abs/2412.06464. Accessed 2026-06-22. ↩
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, "Attention Is All You Need", arXiv:1706.03762, 2017-06-12. https://arxiv.org/abs/1706.03762. Accessed 2026-06-22. ↩
Tay, Dehghani, Bahri, Metzler, "Efficient Transformers: A Survey", arXiv:2009.06732 and ACM Computing Surveys, 2020-09-14. https://arxiv.org/abs/2009.06732. Accessed 2026-06-22. ↩
Schmidhuber, "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks", Neural Computation 4(1):131-139, MIT Press, 1992-01-01. https://direct.mit.edu/neco/article/4/1/131/5620/Learning-to-Control-Fast-Weight-Memories-An. Accessed 2026-06-22. ↩
Choromanski, Likhosherstov, Dohan, Song, Gane, Sarlos, Hawkins, Davis, Mohiuddin, Kaiser, Belanger, Colwell, Weller, "Rethinking Attention with Performers", arXiv:2009.14794 and ICLR 2021, 2020-09-30. https://arxiv.org/abs/2009.14794. Accessed 2026-06-22. ↩
Zhang, Bhatia, Kumbong, Re, "The Hedgehog and the Porcupine: Expressive Linear Attentions with Softmax Mimicry", arXiv:2402.04347 and ICLR 2024, 2024-02-06. https://arxiv.org/abs/2402.04347. Accessed 2026-06-22. ↩
Arora, Eyuboglu, Zhang, Timalsina, Alberti, Zinsley, Zou, Rudra, Re, "Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff", arXiv:2402.18668, 2024-02-28. https://arxiv.org/abs/2402.18668. Accessed 2026-06-22. ↩
Hua, Dai, Liu, Le, "Transformer Quality in Linear Time", arXiv:2202.10447, 2022-02-21. https://arxiv.org/abs/2202.10447. Accessed 2026-06-22. ↩
Yang, Wang, Zhang, Shen, Kim, "Parallelizing Linear Transformers with the Delta Rule over Sequence Length", arXiv:2406.06484 and NeurIPS 2024, 2024-06-10. https://arxiv.org/abs/2406.06484. Accessed 2026-06-22. ↩
Han, Pan, Han, Song, Huang, "Bridging the Divide: Reconsidering Softmax and Linear Attention", arXiv:2412.06590, 2024-12-09. https://arxiv.org/abs/2412.06590. Accessed 2026-06-22. ↩
Wang, Li, Khabsa, Fang, Ma, "Linformer: Self-Attention with Linear Complexity", arXiv:2006.04768, 2020-06-08. https://arxiv.org/abs/2006.04768. Accessed 2026-06-22. ↩
Peng, Alcaide, Anthony et al., "RWKV: Reinventing RNNs for the Transformer Era", arXiv:2305.13048 and EMNLP Findings 2023, 2023-05-22. https://arxiv.org/abs/2305.13048. Accessed 2026-06-22. ↩
MiniMax, "MiniMax-01: Scaling Foundation Models with Lightning Attention", arXiv:2501.08313, 2025-01-14. https://arxiv.org/abs/2501.08313. Accessed 2026-06-22. ↩
Xiao, Tian, Chen, Han, Lewis, "Efficient Streaming Language Models with Attention Sinks", arXiv:2309.17453 and ICLR 2024, 2023-09-29. https://arxiv.org/abs/2309.17453. Accessed 2026-06-22. ↩
Gu, Xiao, Tang et al., "When Attention Sink Emerges in Language Models: An Empirical View", arXiv:2410.10781 and ICLR 2025, 2024-10-14. https://arxiv.org/abs/2410.10781. Accessed 2026-06-22. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

Best Open-Source LLMs DeepSeek vs Llama vs Qwen Gated DeltaNet Infini-Attention Jet-Nemotron Lightning Attention Long-context language models Multi-Head Self-Attention Qwen3-Next Qwen3.5 Qwen3.6 RNN RWKV-7 (Goose)Retentive Network (RetNet)Test-Time Training (TTT)ThunderKittens Titans (neural architecture)YOCO (You Only Cache Once)xLSTM

What is linear attention in one sentence?

History

Background

How does linear attention reduce the cost from quadratic to linear?

From softmax to feature maps

Causal masking and recurrence

Classical feature maps

Associativity and the kernel trick

Chunkwise parallel form

How is linear attention related to recurrent neural networks?

What are the main variants of linear attention?

Performer (FAVOR+)

Linformer

RWKV

RetNet

Gated Linear Attention (GLA)

DeltaNet and Gated DeltaNet

Mamba 2 and structured state space duality

BASED: balancing recall and throughput

Hedgehog and softmax mimicry

Comparison table

Where is linear attention used in practice?

What is linear attention used for?

What are the limitations of linear attention?

How does linear attention relate to the attention sink phenomenon?

Related Work

See also

References

Improve this article

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Rotary Position Embedding

Self-attention

Cross-attention

Mixture of Depths

What links here

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Rotary Position Embedding

Self-attention

Cross-attention

Mixture of Depths

What links here