Attention

See also: Machine learning terms

This article gives a high-level overview of attention as a family of mechanisms in machine learning. For deeper treatments, see the dedicated pages on [[self_attention]], [[multi-head_self-attention]], [[cross_attention]], [[bahdanau_attention]], [[attention_is_all_you_need]], [[flash_attention]], [[grouped_query_attention]], [[multi-head_latent_attention]], and [[paged_attention]].

Introduction

Attention is a family of techniques in machine learning that allow a model to focus on specific parts of an input while making predictions.[^1] Rather than compressing an entire input into a single fixed-size representation, attention mechanisms let the model dynamically weigh the importance of different input elements at each step of computation. This selective focus mirrors, in a loose sense, how biological attention works: irrelevant information is suppressed while relevant information is amplified.[^2]

Mathematically, attention can be viewed as weighted aggregation: given a query that describes what is being looked for and a set of (key, value) pairs that describe what is available, the mechanism produces a weighted sum of the values, with weights derived from how well each key matches the query.[^3] This simple operation — soft, differentiable lookup — has proven extraordinarily expressive when stacked into deep networks.

Originally developed in 2014 for neural machine translation, attention has become the foundational building block of modern deep learning.[^4] It is the core operation inside the [[transformer]] architecture introduced in 2017, which underpins large language models such as the [[gpt]] series, [[llama]], [[claude]], and [[deepseek]] models, as well as Vision Transformers, diffusion models for image and video generation, and protein-structure systems such as [[alphafold]]. Understanding attention is essential for understanding modern artificial intelligence.

Cognitive science origins

Although the term "attention" in machine learning is a metaphor rather than a literal model of brain function, the concept is deeply rooted in cognitive psychology and neuroscience. Selective attention — the ability to focus mental resources on a subset of available stimuli — has been studied empirically since at least the 1950s. The classic "cocktail party effect" described by Colin Cherry (1953) demonstrated that listeners can selectively attend to one conversation in a crowded room while suppressing others, providing one of the earliest experimental frameworks for attention research.[^5]

A particularly influential theoretical framework was feature integration theory, introduced by Anne Treisman and Garry Gelade in their 1980 paper "A Feature-Integration Theory of Attention" in Cognitive Psychology.[^6] Treisman and Gelade proposed that visual processing proceeds in two stages: a parallel, pre-attentive stage in which simple features (color, orientation, motion) are detected automatically across the visual field, followed by an attentive stage in which focal attention "binds" these features into coherent object representations. Without attention, the theory predicts that features can become incorrectly conjoined, producing illusory conjunctions — for example, perceiving a red O and a green X as a green O and a red X. Their experimental findings supported the role of focused attention as a binding mechanism for object perception.

In neuroscience, attention is associated with networks involving the prefrontal cortex and the parietal cortex, particularly the dorsal attention network and the ventral attention network identified by Maurizio Corbetta and Gordon Shulman (2002).[^7] Top-down attention is directed by goals and expectations, while bottom-up attention is captured by salient stimuli. Computational models of biological attention — such as the saliency maps of Itti, Koch, and Niebur (1998)[^8] — predate machine learning attention and inspired some early work on visual attention in deep networks.

It is important to note that machine-learning attention does not closely model these biological mechanisms. The shared name is largely metaphorical: in both cases, a limited resource is allocated selectively over inputs, but the underlying mathematics and biology differ substantially. Still, the cognitive-science framing has motivated several design choices, including the idea that attention should be soft (continuous, differentiable) rather than hard (a one-of-many selection that would not admit gradient-based learning).[^9]

History in neural networks

Early sequence-to-sequence models

Before attention was introduced, sequence-to-sequence (seq2seq) models for tasks like machine translation relied on an encoder-decoder framework built from recurrent neural networks (RNNs). The encoder processed the source sentence token by token and compressed it into a single fixed-length context vector, which the decoder then used to generate the target sentence. Sutskever, Vinyals, and Le (2014) demonstrated that this approach could achieve strong results with LSTM networks,[^10] but the fixed-length bottleneck caused performance to degrade on longer sentences because a single vector could not adequately capture all the information in a long input sequence.

Bahdanau attention (2014, additive)

The first widely recognized attention mechanism for neural networks was proposed by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio in their 2014 paper "Neural Machine Translation by Jointly Learning to Align and Translate," initially posted to arXiv in September 2014 (arXiv:1409.0473) and published at ICLR 2015.[^4] See also [[bahdanau]] and [[bahdanau_attention]] for the dedicated page. The key insight was to replace the fixed-length context vector with a dynamic one: instead of forcing the encoder to compress the entire source sentence into a single vector, the decoder could look back at all encoder hidden states and select the most relevant ones at each generation step.

Bahdanau attention works as follows. For each decoder time step t, the mechanism computes an alignment score e_{t,j} between the previous decoder hidden state s_{t-1} and each encoder hidden state h_j using a learned feedforward network:

e_{t,j} = v^T tanh(W_s s_{t-1} + W_h h_j)

These scores are normalized through a softmax function to produce attention weights alpha_{t,j}. The context vector c_t is then the weighted sum of encoder hidden states:

c_t = sum_j alpha_{t,j} h_j

Because the alignment scores are computed using an additive combination passed through a neural network, this variant is often called additive attention.[^4] The approach yielded translation quality comparable to the state-of-the-art phrase-based system on English-to-French translation, and crucially, it did not suffer the same degradation on long sentences that earlier encoder-decoder models exhibited. Bahdanau et al. also showed qualitatively that the alignment weights recovered linguistically reasonable word alignments, anticipating the use of attention as an interpretability tool.

Luong attention (2015, multiplicative variants)

In 2015, Minh-Thang Luong, Hieu Pham, and Christopher Manning published "Effective Approaches to Attention-based Neural Machine Translation" (arXiv:1508.04025, EMNLP 2015), which introduced several refinements and alternatives to Bahdanau attention.[^11] Luong et al. proposed two broad classes of attention:

Global attention, which attends to all source positions (similar to Bahdanau attention but architecturally simpler).
Local attention, which attends only to a small window of source positions around an aligned position p_t, reducing computational cost.

Luong attention also introduced multiple scoring functions for computing alignment:

Scoring function	Formula	Notes
Dot product	score(s_t, h_j) = s_t^T h_j	Simplest; no extra parameters
General	score(s_t, h_j) = s_t^T W_a h_j	Learned weight matrix W_a
Concat (additive)	score(s_t, h_j) = v^T tanh(W [s_t; h_j])	Similar to Bahdanau

A key implementation difference is that Luong attention uses the current decoder hidden state s_t to compute alignment scores, whereas Bahdanau attention uses the previous state s_{t-1}. Because the dot product and general scoring functions rely on matrix multiplication rather than a feedforward network, Luong attention is sometimes called multiplicative attention and tends to be computationally faster.[^11]

Vaswani et al. 2017 — Transformer and self-attention

The attention mechanism reached its most influential form in the 2017 paper "[[attention_is_all_you_need]]" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin (arXiv:1706.03762, NeurIPS 2017).[^3] See also [[vaswani]] for the lead author. The paper introduced the [[transformer]] architecture, which dispenses with recurrence and convolutions entirely and relies solely on attention mechanisms. As of 2025, the paper had been cited more than 173,000 times, making it one of the most cited papers in the history of computer science.[^12]

The Transformer achieved 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the previous best by more than 2 BLEU points. On English-to-French, it set a new single-model record of 41.8 BLEU after training for only 3.5 days on eight P100 GPUs.[^3]

Critically, Vaswani et al. demonstrated that self-attention layers, when stacked, can replace recurrence and convolution as the primary mechanism for sequence modeling. This unlocked unprecedented parallelism during training (because all positions can be processed simultaneously, unlike RNNs) and led to the cascade of model-scale advances that followed, culminating in modern large language models.

Mathematical formulation

Queries, keys, and values

The Transformer formalized attention using the vocabulary of queries, keys, and values (Q, K, V).[^3] The analogy is drawn from information retrieval: a query represents what the model is looking for, keys describe the items available to attend to, and values hold the content that will be retrieved. In self-attention, all three are derived from the same input sequence through learned linear projections:

Q = X W_Q,    K = X W_K,    V = X W_V

where X is the input matrix (each row is a token embedding) and W_Q, W_K, W_V are learned weight matrices. In cross-attention, Q is derived from one sequence and K, V from another.

Scaled dot-product attention

The core computation in the Transformer is scaled dot-product attention:[^3]

Attention(Q, K, V) = softmax( Q K^T / sqrt(d_k) ) V

where d_k is the dimensionality of the key vectors. The formula works in three stages:

Compute similarity scores. The dot product Q K^T produces a matrix of raw attention scores, where each entry measures the similarity between a query and a key.
Scale. The scores are divided by sqrt(d_k).
Softmax and aggregate. The scaled scores pass through a softmax function to produce a probability distribution (the attention weights), which is then used to take a weighted sum of the value vectors.

Why scale by sqrt(d_k)? Vaswani et al. explicitly motivate the scaling factor in Section 3.2.1 of the 2017 paper.[^3] When d_k is large, dot products between queries and keys tend to grow in magnitude. If the individual elements of Q and K are independent random variables with mean 0 and variance 1, then their dot product has mean 0 and variance d_k. Large-magnitude dot products push the softmax into regions where it has extremely small gradients, slowing or stalling learning. Dividing by sqrt(d_k) normalizes the variance of the dot products back to 1, keeping the softmax in a region with healthier gradients. This scaling is a load-bearing detail of the original Transformer formulation.

Multi-head attention

Rather than performing a single attention computation with full-dimensional queries, keys, and values, the Transformer uses [[multi_head_attention]] (often written multi-head attention or MHA).[^3] The idea is to run h attention "heads" in parallel, each operating on a different learned linear projection of Q, K, and V into lower-dimensional subspaces:

head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O

Each head can learn to attend to different types of relationships. For instance, one head might focus on syntactic dependencies (e.g., subject-verb agreement) while another captures semantic similarity. In the original Transformer (d_model = 512, h = 8), each head operates on d_k = d_v = 64 dimensional projections. The concatenated outputs are projected back to d_model dimensions through a final weight matrix W_O.

Multi-head attention adds essentially no computational overhead compared to single-head attention with the full dimensionality, because the per-head dimensionality is reduced proportionally. The benefit is representational: multiple heads allow the model to jointly attend to information from different representation subspaces at different positions.

Types of attention

Self-attention vs cross-attention

A central distinction concerns where the queries, keys, and values come from. See [[self_attention]] for the dedicated article.

[[self_attention]] (also called intra-attention) applies the attention mechanism within a single sequence. Each token generates a query, a key, and a value; every token's query is compared against every other token's key, and the resulting attention weights determine how much each token's value contributes to the output representation at that position.[^3] Self-attention is what powers the encoder and decoder stacks of modern Transformers.
[[cross_attention]] (also called encoder-decoder attention) is a variant where the queries come from one sequence and the keys and values come from a different sequence. In the original Transformer, the decoder generates queries from its own hidden states, while the keys and values come from the encoder's output, allowing each position in the decoder to attend to all positions in the encoder.[^3] Cross-attention is the standard mechanism for bridging modalities (e.g., text-to-image diffusion models, text-conditioned speech synthesis) and for retrieval-augmented systems.

Self-attention has a critical advantage over recurrent layers: it connects every pair of positions in a sequence with a constant number of operations (O(1) path length), whereas an RNN requires O(n) sequential steps to propagate information from one end of the sequence to the other.[^3] This makes self-attention far more effective at capturing long-range dependencies and is also more parallelizable during training.

Masked / causal attention (autoregressive)

For autoregressive language modeling — predicting the next token given the previous tokens — the model must not be allowed to "see the future." This is enforced by masked self-attention, also called causal attention.[^3] In implementation, the attention score matrix is augmented with a triangular mask: entries above the main diagonal (corresponding to attending to future positions) are set to negative infinity before the softmax, so they receive zero attention weight.

Mask_{i,j} = 0      if j <= i
           = -inf   if j  > i

This simple mask is what makes models like [[gpt]] and [[llama]] autoregressive: at training time the model sees the entire sequence at once but is structurally prevented from leaking information backward from later positions. At inference time, tokens are generated one at a time, and each new token attends over all preceding tokens via the [[kv_cache]] (see below).

Bidirectional attention (BERT)

Encoder-only models such as [[bert]] (Devlin et al., 2018) use bidirectional attention: every token attends to every other token in the sequence, with no causal mask.[^13] To train such models without trivial copying, BERT replaces selected tokens with a special [MASK] symbol and trains the model to predict the original token from context — the masked language modeling objective. Bidirectional attention is well-suited to representation learning and discriminative tasks (classification, named-entity recognition, span extraction) but is not directly suited to open-ended text generation, which is why the [[gpt]] family (causal) is the dominant architecture for generative language models.

The encoder layers of encoder-decoder Transformers (e.g., the original Transformer for translation, [[t5]]) also use bidirectional self-attention; the decoder uses causal self-attention together with cross-attention back to the encoder.

Variants

Multi-head attention (MHA, 2017)

Standard multi-head attention as introduced by Vaswani et al. (2017) gives each of the h heads its own query, key, and value projections.[^3] The KV cache during autoregressive inference therefore stores h sets of key and value vectors per token. This is memory-intensive at large model scale; the variants below trade off some quality for substantial KV-cache savings.

Multi-query attention (MQA, Shazeer 2019)

Multi-Query Attention was proposed by Noam Shazeer in his 2019 paper "Fast Transformer Decoding: One Write-Head Is All You Need" (arXiv:1911.02150).[^14] See also [[multi_query_attention]]. The central observation is that during autoregressive inference, the primary performance bottleneck on modern accelerators is the memory bandwidth required to load the key-value cache, not the arithmetic computation itself.

MQA addresses this by having all query heads share a single set of key and value projections. Instead of h independent key and value heads (as in MHA), there is just one key head and one value head. Each query head still has its own projection, so the model retains h different query perspectives, but the KV cache is reduced by a factor of h. In practice, MQA speeds up inference decoding substantially with only a small quality degradation. It was adopted in [[palm]] (Google, 2022) and [[falcon]] (TII, 2023).

Grouped-query attention (GQA, Ainslie 2023)

[[grouped_query_attention]] (GQA), introduced by Ainslie et al. (2023, arXiv:2305.13245, EMNLP 2023), is a compromise between MHA and MQA.[^15] Instead of sharing a single KV head across all query heads (MQA) or having unique KV heads for every query head (MHA), GQA divides the query heads into g groups, where each group shares one set of key and value projections.

Variant	Query heads	KV heads	KV cache size
Multi-Head Attention (MHA)	h	h	h * d_k * 2 * n
Grouped-Query Attention (GQA)	h	g	g * d_k * 2 * n
Multi-Query Attention (MQA)	h	1	1 * d_k * 2 * n

GQA generalizes both extremes: when g = h, GQA reduces to MHA; when g = 1, GQA reduces to MQA. By choosing an intermediate g, GQA achieves most of the inference speed benefits of MQA while maintaining quality closer to MHA. Meta adopted GQA with 8 KV groups in Llama 2 70B (2023), and it has since become the default attention variant in most production large language models, including [[llama]] 3, Mistral, and many others.[^15]

Multi-head Latent Attention (MLA, DeepSeek 2024)

[[mla]] (Multi-head Latent Attention, also covered at [[multi-head_latent_attention]]) was introduced in DeepSeek-V2 (arXiv:2405.04434, May 2024).[^16] MLA takes a fundamentally different approach to KV cache reduction: rather than reducing the number of KV heads (as in MQA and GQA), MLA compresses the key and value representations into a low-dimensional latent vector before caching. At inference time, the compressed representation is projected back to produce unique keys and values for each head.

Given an input token x_n, MLA first compresses it into a latent representation:

c^{KV}_n = W^{DKV} x_n

where W^{DKV} is a down-projection matrix mapping the model dimension d to a much smaller latent dimension d_c. This compact vector is stored in the KV cache instead of the full key and value vectors. When attention is computed, separate up-projection matrices W^{UK} and W^{UV} reconstruct unique keys and values for each head.

A key challenge is compatibility with [[rotary_position_embedding]] (RoPE). Standard RoPE entangles positional information with content, which would prevent the "absorption trick" that lets MLA fold the up-projection matrices into the query projection and avoid actually decompressing the KV cache during inference. DeepSeek solved this with decoupled RoPE: separate query and key vectors are introduced specifically for positional encoding, keeping the main latent keys isolated from rotation matrices.[^16]

In DeepSeek-V2, MLA achieved a 93.3% reduction in KV-cache size compared to standard MHA while matching (and sometimes exceeding) model quality, and increased maximum generation throughput by 5.76 times. DeepSeek-V3 uses d_h = 128, H = 128, and d_c = 512, giving a compression ratio of 32. MLA was used in DeepSeek-V3 and DeepSeek-R1; subsequent research (TransMLA, 2025) has explored enabling MLA in any Transformer-based LLM.

Differential attention (2024)

Differential attention, introduced by Ye et al. at Microsoft Research and Tsinghua University in their 2024 paper "Differential Transformer" (arXiv:2410.05258, ICLR 2025 Oral), rethinks how attention scores are computed.[^17] The mechanism partitions the query and key projections into two groups and computes two independent softmax distributions:

DiffAttn(X) = ( softmax(Q_1 K_1^T / sqrt(d)) - lambda * softmax(Q_2 K_2^T / sqrt(d)) ) V

The subtraction acts as noise cancellation: many tokens in standard attention receive small but non-negligible weight, diluting the signal. Differential attention subtracts these common noise patterns, causing attention to concentrate on genuinely relevant tokens. Experiments across model sizes from 830M to 13.1B parameters showed consistent improvements: a 6.8B Diff Transformer matched the validation loss of an 11B standard Transformer.[^17]

Efficiency improvements

Flash Attention v1/v2/v3 (Dao 2022-2024)

[[flash_attention]], introduced by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re in their 2022 paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (arXiv:2205.14135), addresses the quadratic memory cost of attention by rethinking how the computation interacts with GPU hardware — not by approximating the math.[^18]

The key insight is that standard attention implementations are bottlenecked not by arithmetic operations but by memory transfers between GPU high-bandwidth memory (HBM) and the on-chip SRAM. Standard implementations materialize the full n x n attention matrix in HBM, which requires O(n^2) memory reads and writes. FlashAttention avoids this by tiling: it splits Q, K, and V into blocks, loads each block from HBM into SRAM, computes the attention for that block in fast on-chip memory, and writes only the final output back to HBM. A carefully designed online softmax normalization algorithm allows blocks to be processed incrementally without ever needing the full attention matrix in memory.

The result is exact attention (not an approximation) that uses O(n) memory instead of O(n^2) and achieves wall-clock speedups of 2 to 4 times over standard implementations.[^18]

FlashAttention-2 (Tri Dao, 2023, arXiv:2307.08691) reduced non-matmul FLOPs by restructuring the algorithm to spend a higher fraction of time on matrix multiplications, improved parallelism across thread blocks, and refined warp partitioning. These changes yielded roughly a 2x speedup over FlashAttention v1, reaching 50 to 73% of theoretical maximum FLOPs/s on NVIDIA A100 GPUs.[^19]
FlashAttention-3 (Tri Dao and Jay Shah, 2024, NeurIPS 2024) targets NVIDIA Hopper GPUs (H100). It exploits asynchronous execution of Tensor Cores and the Tensor Memory Accelerator via warp specialization, interleaved block-wise matmul and softmax operations, and FP8 low-precision computation with block quantization. It achieves up to 840 TFLOPs/s in BF16 on H100 (about 85% utilization), roughly 1.5 to 2x faster than FlashAttention-2.[^20]
FlashAttention-4 (Zadouri, Shah, Hohnerbach, Liu, Thakkar, Dao, 2026, MLSys 2026) extends the line to NVIDIA Blackwell GPUs (B200). It introduces ping-pong scheduling, software exponential emulation using polynomial approximation on FMA units, and conditional online softmax rescaling. Written in CuTe-DSL, it reaches up to 1,605 TFLOPs/s in BF16 on B200 (about 71% utilization), 1.3x faster than cuDNN 9.13 and 2.7x faster than Triton.[^21] See [[flash_attention_3]] for the dedicated v3 article.

Sparse and sliding-window attention (Longformer, Mistral)

[[sparse_attention]] approaches restrict each token's attention to a subset of positions rather than the full sequence.

Longformer (Beltagy, Peters, and Cohan, 2020, arXiv:2004.05150) combines local sliding-window attention with global attention on a small number of designated tokens (e.g., the [CLS] token for classification). It also introduced dilated sliding-window attention. Longformer scales linearly with sequence length and was pretrained for up to 4,096 tokens.[^22]
BigBird (Zaheer et al., 2020, NeurIPS) extends Longformer by adding random attention. The authors proved theoretically that BigBird's sparse pattern is a universal approximator of sequence functions and Turing complete.[^23]
Mistral 7B (Mistral AI, 2023) uses sliding window attention with a window size of 4,096 tokens; stacking 32 layers gives an effective receptive field of about 131,000 tokens. A rolling-buffer KV cache limited to the window halves cache memory at long sequence lengths.[^24]
Hybrid local-global designs: Gemma 2 (Google, 2024) uses a 1:1 ratio of local and global attention layers with a 4,096-token window. Gemma 3 (Google, 2025) shifted to a 5:1 ratio with a 1,024-token window, reducing attention compute by roughly 5x and KV cache memory from about 60% to 15% of total memory while still supporting 128K context lengths via RoPE frequency rescaling on the global layers.[^25]

Linear attention (Performer, Linformer)

Katharopoulos et al. (2020) proposed linear attention in "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (ICML 2020).[^26] The core idea replaces the softmax with a decomposable kernel function. Using feature maps phi:

LinearAttention(Q, K, V) = ( phi(Q) ( phi(K)^T V ) ) / ( phi(Q) sum(phi(K)^T) )

By exploiting the associative property of matrix multiplication, the computation avoids materializing the n x n attention matrix. The product phi(K)^T V produces a d x d matrix (independent of n), reducing complexity from O(n^2 d) to O(n d^2). Katharopoulos et al. used phi(x) = elu(x) + 1 and showed a direct connection between Transformers and RNNs that enables efficient autoregressive generation.

Related approaches include:

Linformer (Wang et al., 2020, arXiv:2006.04768) projects keys and values to a lower-dimensional space, achieving O(n) complexity but at the cost of fixing a maximum sequence length.[^27]
Performer (Choromanski et al., 2020, ICLR 2021, arXiv:2009.14794) uses random feature maps (FAVOR+) to approximate softmax attention with provable accuracy bounds at O(n) cost.[^28]
Reformer (Kitaev, Kaiser, and Levskaya, 2020, arXiv:2001.04451) uses locality-sensitive hashing to attend only to nearby items in hash space.[^29]

Ring attention (Liu 2023)

Ring Attention (Liu, Zaharia, and Abbeel, ICLR 2024, arXiv:2310.01889) is a distributed sequence-parallelism technique that enables processing of extremely long sequences by splitting them across devices arranged in a ring topology.[^30] Each device computes blockwise attention between its local query block and a visiting KV block, while simultaneously sending that KV block to the next device in the ring and receiving the next KV block from the previous device. Because block computation takes longer than block transfers, communication is fully hidden.

Ring Attention enables training and inference on sequences up to p times longer than what a single device can handle, where p is the number of devices. On 32 A100 GPUs, a 7B model can process over 1 million tokens; on TPUv4-1024, a 3B model can train with 16 million tokens.[^30]

Native Sparse Attention (2025)

Native Sparse Attention (NSA), introduced by DeepSeek in February 2025 (arXiv:2502.11089, Best Paper at ACL 2025), is a hardware-aligned sparse attention mechanism designed to be natively trainable end-to-end.[^31] NSA processes inputs through three parallel attention branches combined via learned gating: compressed attention (coarse-grained blocks), selected attention (top-n important blocks at full precision), and sliding-window attention (local recent tokens). On 64K sequences, NSA achieves 9.0x forward speedup, 6.0x backward speedup, and 11.6x decoding speedup while matching or exceeding full-attention quality.

Position encoding pairing

Attention is intrinsically permutation-equivariant: scaled dot-product attention treats its inputs as an unordered set, so positional information must be injected externally for sequence modeling. The choice of position encoding has become a major design lever in modern Transformers, and several encodings are tightly coupled to specific attention variants.

Sinusoidal absolute position embeddings, used in the original Transformer, add a deterministic sinusoidal vector to each token embedding before the first attention layer.[^3]
Learned absolute position embeddings, used by [[bert]] and [[gpt]]-2, replace the sinusoid with a learned vector per position.[^13]
[[rotary_position_embedding]] (RoPE), introduced by Su et al. (2021, arXiv:2104.09864), rotates the query and key vectors by a position-dependent angle inside each attention head.[^32] RoPE has become the default in [[llama]], [[mistral]], [[deepseek]], [[qwen]], and most modern LLMs. The fact that RoPE acts directly on Q and K means it composes naturally with attention variants such as GQA, MLA (with decoupled RoPE), and sliding-window attention.
[[alibi]] (Attention with Linear Biases, Press et al., 2022, arXiv:2108.12409) adds a fixed linear penalty to the attention scores based on the distance between query and key positions.[^33] ALiBi enables length extrapolation: models trained at one context length can be evaluated at much longer lengths without re-training.

These position encodings interact with attention in different ways: RoPE rotates Q and K, ALiBi biases the score matrix, and absolute encodings simply add to the token representation. The interaction is non-trivial — for example, MLA's compatibility with RoPE required the decoupled-RoPE design described above.

KV cache (inference optimization)

During autoregressive generation (producing one token at a time), a [[language_model]] must compute attention over all previously generated tokens. Without optimization, this means recomputing the key and value projections for every past token at every generation step, leading to redundant computation that grows quadratically with sequence length.

The [[kv_cache]] solves this by storing the key and value vectors from all previous time steps. At each new generation step, only the key and value for the new token need to be computed and appended to the cache. The query for the new token then attends over all cached keys and values. This reduces the per-step projection cost from O(n d) to O(d), though the attention computation itself still requires O(n d) per step.

The main challenge is memory: the KV cache grows linearly with sequence length, model width, and batch size. For a large model such as Llama 2 70B with a context window of 4,096 tokens, the KV cache alone can consume tens of gigabytes of GPU memory. Several strategies address this:

Strategy	Mechanism	Typical reduction
MQA / GQA	Reduce number of KV heads	KV cache reduced by factor of h (MQA) or h/g (GQA)
MLA	Compress KV into low-rank latent vector	93.3% cache reduction (DeepSeek-V2)
KV cache quantization	Store cached K/V in FP8 or INT4	2-4x memory reduction
[[paged_attention]] (PagedAttention)	Virtual-memory-style non-contiguous cache blocks	Waste reduced from 60-80% to under 4%
Sliding-window caches	Limit cache to fixed window size	Bounded memory regardless of sequence length
Token eviction / compression	Selectively remove or merge less important cached tokens	Variable, task-dependent

PagedAttention (Kwon et al., 2023, SOSP 2023), used in the [[vllm]] serving framework, deserves special mention.[^34] It borrows ideas from operating-system virtual memory to manage cache memory in non-contiguous blocks, reducing fragmentation. Standard implementations waste 60 to 80% of KV-cache memory; PagedAttention reduces waste to under 4% and improves serving throughput by 2 to 4x. [[radix_attention]] (Zheng et al., 2024) extends this by sharing prefix KV blocks across requests in a radix tree, accelerating multi-turn conversation and structured generation.

Implementation tricks

Practical attention implementations in modern training and inference stacks rely on a handful of complementary techniques:

Mixed-precision training: matrix multiplications are computed in BF16 or FP16 while the softmax and gradient accumulation use FP32, balancing numerical stability with throughput.[^35] On H100 and B200, FP8 attention (used by FlashAttention-3 and -4) further increases throughput, with careful scaling and incoherent processing to bound numerical error.
Triton kernels: many production attention kernels (including the reference FlashAttention implementations and DeepSeek's NSA) are written in OpenAI Triton, a Python-like DSL that targets GPUs and abstracts away much of the CUDA boilerplate.[^36] Triton has become the de facto standard for custom attention kernels in research and is increasingly used in production.
CUTLASS and CuTe-DSL: NVIDIA's CUTLASS library and its successor CuTe-DSL provide high-performance GEMM building blocks that underpin FlashAttention-3 and -4 on Hopper and Blackwell GPUs.
Paged attention (vLLM): as described above, PagedAttention enables high-throughput serving by managing the KV cache as virtual-memory pages, making it feasible to serve many concurrent requests with shared prefixes.[^34]
Continuous batching: serving frameworks such as [[vllm]] and TensorRT-LLM use continuous batching (also called in-flight batching), where new requests join an ongoing batch as previous ones finish, dramatically increasing GPU utilization for autoregressive workloads.
Speculative decoding: speculative-decoding and lookahead techniques generate multiple candidate tokens with a small draft model and verify them with a larger target model in a single attention pass, increasing tokens-per-second without changing the underlying attention algorithm.

Limitations

O(n^2) memory and compute

Standard self-attention has time and memory complexity of O(n^2 d), where n is sequence length and d is model dimension.[^3] The quadratic dependence on n is the fundamental bottleneck for very long contexts. Although FlashAttention reduces the memory cost to O(n) (using O(n) auxiliary storage even though the conceptual attention matrix is n x n), the compute cost remains quadratic for exact attention. This is why sparse, linear, and state-space alternatives remain active research areas.

Long-context challenges

Empirically, long-context Transformers face several distinct failure modes:

Lost in the middle: Liu et al. (2023, arXiv:2307.03172) showed that even capable LLMs are markedly less accurate at retrieving information from the middle of a long context compared to the beginning or end — a U-shaped accuracy curve.[^37]
Position-encoding extrapolation: many position encodings struggle to generalize beyond the training context length. Techniques like RoPE frequency scaling (NTK-aware scaling, YaRN), ALiBi, and position interpolation address this with varying success.
Softmax attention dilution: as context length grows, individual attention weights become smaller, making it harder to pick out a few important tokens. Differential attention[^17] and learned sparse attention[^31] are partial remedies.
Throughput and latency: even with FlashAttention, prefill and decode latency grow with context length, motivating the variants surveyed above.

Visualization and interpretation

One practical advantage of attention mechanisms is that attention weights can be inspected to gain insight into what the model is focusing on. Attention maps are typically visualized as heatmaps, where brighter entries indicate stronger attention between two positions. Bahdanau et al. (2014) and Vaswani et al. (2017) used such visualizations to argue that the attention mechanism recovers linguistically interpretable patterns (e.g., word alignments in translation, head-dependent relations in parsing).[^4][^3]

However, researchers have cautioned against over-interpreting attention weights. Jain and Wallace (2019) showed that attention weights often do not correlate well with other measures of feature importance and that alternative attention distributions can produce identical predictions — challenging the view that attention is itself an explanation.[^38] Subsequent work (Wiegreffe and Pinter, 2019, EMNLP) clarified the conditions under which attention can or cannot be interpreted as explanation.[^39] Attention weights indicate how information flows through the network but do not necessarily indicate which inputs are causally important for the output; more rigorous interpretability methods — such as probing classifiers and gradient-based attribution — are typically needed to draw reliable conclusions.

Alternatives to attention

A line of research aims to replace attention entirely with mechanisms that have linear or sub-quadratic complexity while retaining the parallelism and expressivity of Transformers.

State-space models (Mamba): [[mamba]] (Gu and Dao, 2023, arXiv:2312.00752) makes the state transition matrices of a structured state-space model input-dependent (selective), enabling dynamic information routing.[^40] Mamba achieves linear-time complexity O(n), 5x higher inference throughput than Transformers, and scales to million-length sequences; Mamba-3B matches or outperforms Transformers of twice its size on language-modeling benchmarks. Recent hybrid architectures such as [[jamba]] (AI21 Labs, 2024) interleave Mamba, attention, and Mixture-of-Experts layers.
Linear RNNs (RWKV): [[rwkv]] (Peng et al., 2023, arXiv:2305.13048) is an RNN-style architecture with linear complexity in sequence length that nonetheless can be trained in parallel like a Transformer.[^41] RWKV combines the expressivity of attention with the inference efficiency of RNNs; the project has released open-weight models from 100M to 14B parameters.
Hyena (Poli 2023): [[hyena]] (Poli et al., 2023, ICML 2023, arXiv:2302.10866) replaces attention with a recurrence built from long convolutions and data-controlled gating.[^42] Hyena achieves subquadratic complexity and is competitive with attention on language modeling at sequence lengths up to 64K, where it offers 100x speedup over standard attention.
RetNet (Retentive Networks): Sun et al. (2023, arXiv:2307.08621) propose retention — a mechanism that admits a parallel form (similar to attention) and a recurrent form (similar to an RNN) — enabling O(1) inference per step with parallel training.[^43]
DeltaNet: Schlag, Irie, and Schmidhuber (2021) and later Yang et al. (2024) developed DeltaNet, a linear-attention variant with an explicit delta-rule update that improves recall over plain linear attention.[^44]
Hybrid architectures: rather than fully replacing attention, several systems interleave attention layers with state-space or linear-recurrence layers. Examples include [[jamba]], Striped Hyena, and Samba. These hybrids attempt to combine the precise recall of attention with the efficiency of sub-quadratic mechanisms.

Recent benchmarks suggest a nuanced picture: attention excels at precise recall from context (the "needle in a haystack" task and related associative recall), while SSMs and linear recurrences excel at compression and efficiency over long sequences.[^45] Hybrid architectures are an attempt to combine the strengths of both.

Comparison of attention variants

The following table summarizes major attention variants, their key characteristics, and the models that adopt them.

Variant	Year	Authors	Key idea	Complexity	Notable models
Additive (Bahdanau)	2014	Bahdanau, Cho, Bengio	Learned alignment via feedforward net	O(n m)	Early NMT
Multiplicative (Luong)	2015	Luong, Pham, Manning	Dot-product / general scoring	O(n m)	Early NMT
Scaled dot-product	2017	Vaswani et al.	Q, K, V with sqrt(d_k) scaling	O(n^2 d)	All Transformers
Multi-head attention	2017	Vaswani et al.	h parallel heads in subspaces	O(n^2 d)	All Transformers
Sparse (Longformer)	2020	Beltagy et al.	Local + global + dilated	O(n w)	Longformer
Sparse (BigBird)	2020	Zaheer et al.	Local + global + random	O(n (w+r+g))	BigBird
Linear attention	2020	Katharopoulos et al.	Kernel feature map	O(n d^2)	Linear Transformer
Performer	2020	Choromanski et al.	Random features (FAVOR+)	O(n d^2)	Performer
Multi-query (MQA)	2019	Shazeer	Single shared KV head	O(n^2 d)	PaLM, Falcon
Grouped-query (GQA)	2023	Ainslie et al.	g shared KV groups	O(n^2 d)	Llama 2, Llama 3, Mistral
FlashAttention	2022	Dao et al.	IO-aware tiling, exact, O(n) memory	O(n^2 d) compute	Widely adopted
FlashAttention-2	2023	Dao	Better parallelism, warp partitioning	O(n^2 d) compute	Widely adopted
FlashAttention-3	2024	Dao, Shah	Hopper async, FP8, warp specialization	O(n^2 d) compute	H100 deployments
FlashAttention-4	2026	Zadouri, Shah, Dao et al.	Blackwell pipelining	O(n^2 d) compute	B200 deployments
Sliding window	2023	Mistral AI	Local window + rolling KV cache	O(n w)	Mistral 7B, Gemma 3
Multi-head latent (MLA)	2024	DeepSeek	Latent compression of KV cache	O(n^2 d)	DeepSeek-V2, V3, R1
Differential attention	2024	Ye et al.	Difference of two softmax maps	O(n^2 d)	Diff Transformer
Native sparse (NSA)	2025	DeepSeek	Trained sparse: compress + select + slide	O(n (n/r + k + w))	DeepSeek (research)
Ring Attention	2024	Liu, Zaharia, Abbeel	Distributed ring sequence parallel	O(n^2 d / p) per device	Long-context training
Selective SSM (Mamba)	2023	Gu, Dao	Input-dependent SSM	O(n d)	Mamba, Jamba

Attention in computer vision

Vision Transformer (ViT)

The Vision Transformer (ViT, Dosovitskiy et al., 2020, ICLR 2021, arXiv:2010.11929) demonstrated that a pure Transformer applied directly to sequences of image patches can match or exceed state-of-the-art convolutional neural networks (CNNs) on image classification.[^46] ViT splits an image into fixed-size patches (typically 16x16 pixels), flattens each patch into a vector, projects it to the model dimension, prepends a learnable [CLS] token, and adds positional embeddings, then processes the resulting sequence with a standard Transformer encoder using multi-head self-attention.

Self-attention in ViT allows every patch to attend to every other patch, capturing global relationships across the entire image from the very first layer — in contrast to CNNs which build global understanding only gradually through stacked local convolutions. ViT has since spawned many variants, including DeiT (Touvron et al., 2021), Swin Transformer (Liu et al., 2021) which uses shifted-window attention for efficiency, and BEiT (Bao et al., 2021).

Attention in diffusion models

Modern text-to-image diffusion models like Stable Diffusion, DALL-E, and Imagen rely heavily on attention. Self-attention operates within the image latent representations at multiple spatial resolutions, preserving global geometric coherence. Cross-attention connects the text prompt (providing keys and values from a text encoder such as CLIP or T5) to the image latent features (providing queries), controlling which regions of the image correspond to which words.[^47] Researchers have leveraged cross-attention maps for prompt-to-prompt editing, attention-based layout control, and interpretability analysis.

Applications of attention

Attention mechanisms have been adopted across virtually every domain of machine learning:

Natural language processing: attention is the backbone of [[bert]], the [[gpt]] series, [[t5]], [[llama]], [[claude]], and [[deepseek]] models, enabling translation, summarization, question answering, and code generation.
Computer vision: Vision Transformers and their variants use self-attention for image classification, object detection, and segmentation.
Speech and audio: models such as Whisper (OpenAI, 2022) use cross-attention between audio features and text tokens for speech recognition.
Multimodal learning: cross-attention connects different modalities in models such as Flamingo (DeepMind, 2022), Stable Diffusion, and video understanding systems.
Protein structure prediction: [[alphafold]] 2 (DeepMind, 2021) uses a specialized attention mechanism — the Evoformer — that applies self- and cross-attention to protein sequences and structural features.[^48]
Reinforcement learning: Decision Transformer (Chen et al., 2021) frames RL as a sequence-modeling problem, applying self-attention to sequences of states, actions, and rewards.[^49]

Explain like I'm 5 (ELI5)

Imagine you are in a classroom and the teacher asks a question. You look around the room for clues. Some classmates are whispering the answer, some are drawing in their notebooks, and some are looking out the window. Attention is like choosing to listen more closely to the classmates who seem to know the answer and ignoring the ones looking out the window. In machine learning, attention lets the computer do something similar: it decides which pieces of information are most helpful for the task at hand and pays more attention to those, while downplaying the rest.

Introduction

Cognitive science origins

History in neural networks

Early sequence-to-sequence models

Bahdanau attention (2014, additive)

Luong attention (2015, multiplicative variants)

Vaswani et al. 2017 — Transformer and self-attention

Mathematical formulation

Queries, keys, and values

Scaled dot-product attention

Multi-head attention

Types of attention

Self-attention vs cross-attention

Masked / causal attention (autoregressive)

Bidirectional attention (BERT)

Variants

Multi-head attention (MHA, 2017)

Multi-query attention (MQA, Shazeer 2019)

Grouped-query attention (GQA, Ainslie 2023)

Multi-head Latent Attention (MLA, DeepSeek 2024)

Differential attention (2024)

Efficiency improvements

Flash Attention v1/v2/v3 (Dao 2022-2024)

Sparse and sliding-window attention (Longformer, Mistral)

Linear attention (Performer, Linformer)

Ring attention (Liu 2023)

Native Sparse Attention (2025)

Position encoding pairing

KV cache (inference optimization)

Implementation tricks

Limitations

O(n^2) memory and compute

Long-context challenges

Visualization and interpretation

Alternatives to attention

Comparison of attention variants

Attention in computer vision

Vision Transformer (ViT)

Attention in diffusion models

Applications of attention

Explain like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

LeNet

Activation Function

Introduction

Cognitive science origins

History in neural networks

Early sequence-to-sequence models

Bahdanau attention (2014, additive)

Luong attention (2015, multiplicative variants)

Vaswani et al. 2017 — Transformer and self-attention

Mathematical formulation

Queries, keys, and values

Scaled dot-product attention

Multi-head attention

Types of attention

Self-attention vs cross-attention

Masked / causal attention (autoregressive)

Bidirectional attention (BERT)

Variants

Multi-head attention (MHA, 2017)

Multi-query attention (MQA, Shazeer 2019)

Grouped-query attention (GQA, Ainslie 2023)

Multi-head Latent Attention (MLA, DeepSeek 2024)

Differential attention (2024)

Efficiency improvements

Flash Attention v1/v2/v3 (Dao 2022-2024)

Sparse and sliding-window attention (Longformer, Mistral)

Linear attention (Performer, Linformer)

Ring attention (Liu 2023)

Native Sparse Attention (2025)

Position encoding pairing

KV cache (inference optimization)