Attention
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v10 · 8,060 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v10 · 8,060 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
This article gives a high-level overview of attention as a family of mechanisms in machine learning. For deeper treatments, see the dedicated pages on [[self_attention]], [[multi-head_self-attention]], [[cross_attention]], [[bahdanau_attention]], [[attention_is_all_you_need]], [[flash_attention]], [[grouped_query_attention]], [[multi-head_latent_attention]], and [[paged_attention]].
Attention is a family of techniques in machine learning that allow a model to focus on specific parts of an input while making predictions.[^1] Rather than compressing an entire input into a single fixed-size representation, attention mechanisms let the model dynamically weigh the importance of different input elements at each step of computation. This selective focus mirrors, in a loose sense, how biological attention works: irrelevant information is suppressed while relevant information is amplified.[^2]
Mathematically, attention can be viewed as weighted aggregation: given a query that describes what is being looked for and a set of (key, value) pairs that describe what is available, the mechanism produces a weighted sum of the values, with weights derived from how well each key matches the query.[^3] This simple operation — soft, differentiable lookup — has proven extraordinarily expressive when stacked into deep networks.
Originally developed in 2014 for neural machine translation, attention has become the foundational building block of modern deep learning.[^4] It is the core operation inside the [[transformer]] architecture introduced in 2017, which underpins large language models such as the [[gpt]] series, [[llama]], [[claude]], and [[deepseek]] models, as well as Vision Transformers, diffusion models for image and video generation, and protein-structure systems such as [[alphafold]]. Understanding attention is essential for understanding modern artificial intelligence.
Although the term "attention" in machine learning is a metaphor rather than a literal model of brain function, the concept is deeply rooted in cognitive psychology and neuroscience. Selective attention — the ability to focus mental resources on a subset of available stimuli — has been studied empirically since at least the 1950s. The classic "cocktail party effect" described by Colin Cherry (1953) demonstrated that listeners can selectively attend to one conversation in a crowded room while suppressing others, providing one of the earliest experimental frameworks for attention research.[^5]
A particularly influential theoretical framework was feature integration theory, introduced by Anne Treisman and Garry Gelade in their 1980 paper "A Feature-Integration Theory of Attention" in Cognitive Psychology.[^6] Treisman and Gelade proposed that visual processing proceeds in two stages: a parallel, pre-attentive stage in which simple features (color, orientation, motion) are detected automatically across the visual field, followed by an attentive stage in which focal attention "binds" these features into coherent object representations. Without attention, the theory predicts that features can become incorrectly conjoined, producing illusory conjunctions — for example, perceiving a red O and a green X as a green O and a red X. Their experimental findings supported the role of focused attention as a binding mechanism for object perception.
In neuroscience, attention is associated with networks involving the prefrontal cortex and the parietal cortex, particularly the dorsal attention network and the ventral attention network identified by Maurizio Corbetta and Gordon Shulman (2002).[^7] Top-down attention is directed by goals and expectations, while bottom-up attention is captured by salient stimuli. Computational models of biological attention — such as the saliency maps of Itti, Koch, and Niebur (1998)[^8] — predate machine learning attention and inspired some early work on visual attention in deep networks.
It is important to note that machine-learning attention does not closely model these biological mechanisms. The shared name is largely metaphorical: in both cases, a limited resource is allocated selectively over inputs, but the underlying mathematics and biology differ substantially. Still, the cognitive-science framing has motivated several design choices, including the idea that attention should be soft (continuous, differentiable) rather than hard (a one-of-many selection that would not admit gradient-based learning).[^9]
Before attention was introduced, sequence-to-sequence (seq2seq) models for tasks like machine translation relied on an encoder-decoder framework built from recurrent neural networks (RNNs). The encoder processed the source sentence token by token and compressed it into a single fixed-length context vector, which the decoder then used to generate the target sentence. Sutskever, Vinyals, and Le (2014) demonstrated that this approach could achieve strong results with LSTM networks,[^10] but the fixed-length bottleneck caused performance to degrade on longer sentences because a single vector could not adequately capture all the information in a long input sequence.
The first widely recognized attention mechanism for neural networks was proposed by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio in their 2014 paper "Neural Machine Translation by Jointly Learning to Align and Translate," initially posted to arXiv in September 2014 (arXiv:1409.0473) and published at ICLR 2015.[^4] See also [[bahdanau]] and [[bahdanau_attention]] for the dedicated page. The key insight was to replace the fixed-length context vector with a dynamic one: instead of forcing the encoder to compress the entire source sentence into a single vector, the decoder could look back at all encoder hidden states and select the most relevant ones at each generation step.
Bahdanau attention works as follows. For each decoder time step t, the mechanism computes an alignment score e_{t,j} between the previous decoder hidden state s_{t-1} and each encoder hidden state h_j using a learned feedforward network:
e_{t,j} = v^T tanh(W_s s_{t-1} + W_h h_j)
These scores are normalized through a softmax function to produce attention weights alpha_{t,j}. The context vector c_t is then the weighted sum of encoder hidden states:
c_t = sum_j alpha_{t,j} h_j
Because the alignment scores are computed using an additive combination passed through a neural network, this variant is often called additive attention.[^4] The approach yielded translation quality comparable to the state-of-the-art phrase-based system on English-to-French translation, and crucially, it did not suffer the same degradation on long sentences that earlier encoder-decoder models exhibited. Bahdanau et al. also showed qualitatively that the alignment weights recovered linguistically reasonable word alignments, anticipating the use of attention as an interpretability tool.
In 2015, Minh-Thang Luong, Hieu Pham, and Christopher Manning published "Effective Approaches to Attention-based Neural Machine Translation" (arXiv:1508.04025, EMNLP 2015), which introduced several refinements and alternatives to Bahdanau attention.[^11] Luong et al. proposed two broad classes of attention:
Luong attention also introduced multiple scoring functions for computing alignment:
| Scoring function | Formula | Notes |
|---|---|---|
| Dot product | score(s_t, h_j) = s_t^T h_j | Simplest; no extra parameters |
| General | score(s_t, h_j) = s_t^T W_a h_j | Learned weight matrix W_a |
| Concat (additive) | score(s_t, h_j) = v^T tanh(W [s_t; h_j]) | Similar to Bahdanau |
A key implementation difference is that Luong attention uses the current decoder hidden state s_t to compute alignment scores, whereas Bahdanau attention uses the previous state s_{t-1}. Because the dot product and general scoring functions rely on matrix multiplication rather than a feedforward network, Luong attention is sometimes called multiplicative attention and tends to be computationally faster.[^11]
The attention mechanism reached its most influential form in the 2017 paper "[[attention_is_all_you_need]]" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin (arXiv:1706.03762, NeurIPS 2017).[^3] See also [[vaswani]] for the lead author. The paper introduced the [[transformer]] architecture, which dispenses with recurrence and convolutions entirely and relies solely on attention mechanisms. As of 2025, the paper had been cited more than 173,000 times, making it one of the most cited papers in the history of computer science.[^12]
The Transformer achieved 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the previous best by more than 2 BLEU points. On English-to-French, it set a new single-model record of 41.8 BLEU after training for only 3.5 days on eight P100 GPUs.[^3]
Critically, Vaswani et al. demonstrated that self-attention layers, when stacked, can replace recurrence and convolution as the primary mechanism for sequence modeling. This unlocked unprecedented parallelism during training (because all positions can be processed simultaneously, unlike RNNs) and led to the cascade of model-scale advances that followed, culminating in modern large language models.
The Transformer formalized attention using the vocabulary of queries, keys, and values (Q, K, V).[^3] The analogy is drawn from information retrieval: a query represents what the model is looking for, keys describe the items available to attend to, and values hold the content that will be retrieved. In self-attention, all three are derived from the same input sequence through learned linear projections:
Q = X W_Q, K = X W_K, V = X W_V
where X is the input matrix (each row is a token embedding) and W_Q, W_K, W_V are learned weight matrices. In cross-attention, Q is derived from one sequence and K, V from another.
The core computation in the Transformer is scaled dot-product attention:[^3]
Attention(Q, K, V) = softmax( Q K^T / sqrt(d_k) ) V
where d_k is the dimensionality of the key vectors. The formula works in three stages:
Why scale by sqrt(d_k)? Vaswani et al. explicitly motivate the scaling factor in Section 3.2.1 of the 2017 paper.[^3] When d_k is large, dot products between queries and keys tend to grow in magnitude. If the individual elements of Q and K are independent random variables with mean 0 and variance 1, then their dot product has mean 0 and variance d_k. Large-magnitude dot products push the softmax into regions where it has extremely small gradients, slowing or stalling learning. Dividing by sqrt(d_k) normalizes the variance of the dot products back to 1, keeping the softmax in a region with healthier gradients. This scaling is a load-bearing detail of the original Transformer formulation.
Rather than performing a single attention computation with full-dimensional queries, keys, and values, the Transformer uses [[multi_head_attention]] (often written multi-head attention or MHA).[^3] The idea is to run h attention "heads" in parallel, each operating on a different learned linear projection of Q, K, and V into lower-dimensional subspaces:
head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O
Each head can learn to attend to different types of relationships. For instance, one head might focus on syntactic dependencies (e.g., subject-verb agreement) while another captures semantic similarity. In the original Transformer (d_model = 512, h = 8), each head operates on d_k = d_v = 64 dimensional projections. The concatenated outputs are projected back to d_model dimensions through a final weight matrix W_O.
Multi-head attention adds essentially no computational overhead compared to single-head attention with the full dimensionality, because the per-head dimensionality is reduced proportionally. The benefit is representational: multiple heads allow the model to jointly attend to information from different representation subspaces at different positions.
A central distinction concerns where the queries, keys, and values come from. See [[self_attention]] for the dedicated article.
Self-attention has a critical advantage over recurrent layers: it connects every pair of positions in a sequence with a constant number of operations (O(1) path length), whereas an RNN requires O(n) sequential steps to propagate information from one end of the sequence to the other.[^3] This makes self-attention far more effective at capturing long-range dependencies and is also more parallelizable during training.
For autoregressive language modeling — predicting the next token given the previous tokens — the model must not be allowed to "see the future." This is enforced by masked self-attention, also called causal attention.[^3] In implementation, the attention score matrix is augmented with a triangular mask: entries above the main diagonal (corresponding to attending to future positions) are set to negative infinity before the softmax, so they receive zero attention weight.
Mask_{i,j} = 0 if j <= i
= -inf if j > i
This simple mask is what makes models like [[gpt]] and [[llama]] autoregressive: at training time the model sees the entire sequence at once but is structurally prevented from leaking information backward from later positions. At inference time, tokens are generated one at a time, and each new token attends over all preceding tokens via the [[kv_cache]] (see below).
Encoder-only models such as [[bert]] (Devlin et al., 2018) use bidirectional attention: every token attends to every other token in the sequence, with no causal mask.[^13] To train such models without trivial copying, BERT replaces selected tokens with a special [MASK] symbol and trains the model to predict the original token from context — the masked language modeling objective. Bidirectional attention is well-suited to representation learning and discriminative tasks (classification, named-entity recognition, span extraction) but is not directly suited to open-ended text generation, which is why the [[gpt]] family (causal) is the dominant architecture for generative language models.
The encoder layers of encoder-decoder Transformers (e.g., the original Transformer for translation, [[t5]]) also use bidirectional self-attention; the decoder uses causal self-attention together with cross-attention back to the encoder.
Standard multi-head attention as introduced by Vaswani et al. (2017) gives each of the h heads its own query, key, and value projections.[^3] The KV cache during autoregressive inference therefore stores h sets of key and value vectors per token. This is memory-intensive at large model scale; the variants below trade off some quality for substantial KV-cache savings.
Multi-Query Attention was proposed by Noam Shazeer in his 2019 paper "Fast Transformer Decoding: One Write-Head Is All You Need" (arXiv:1911.02150).[^14] See also [[multi_query_attention]]. The central observation is that during autoregressive inference, the primary performance bottleneck on modern accelerators is the memory bandwidth required to load the key-value cache, not the arithmetic computation itself.
MQA addresses this by having all query heads share a single set of key and value projections. Instead of h independent key and value heads (as in MHA), there is just one key head and one value head. Each query head still has its own projection, so the model retains h different query perspectives, but the KV cache is reduced by a factor of h. In practice, MQA speeds up inference decoding substantially with only a small quality degradation. It was adopted in [[palm]] (Google, 2022) and [[falcon]] (TII, 2023).
[[grouped_query_attention]] (GQA), introduced by Ainslie et al. (2023, arXiv:2305.13245, EMNLP 2023), is a compromise between MHA and MQA.[^15] Instead of sharing a single KV head across all query heads (MQA) or having unique KV heads for every query head (MHA), GQA divides the query heads into g groups, where each group shares one set of key and value projections.
| Variant | Query heads | KV heads | KV cache size |
|---|---|---|---|
| Multi-Head Attention (MHA) | h | h | h * d_k * 2 * n |
| Grouped-Query Attention (GQA) | h | g | g * d_k * 2 * n |
| Multi-Query Attention (MQA) | h | 1 | 1 * d_k * 2 * n |
GQA generalizes both extremes: when g = h, GQA reduces to MHA; when g = 1, GQA reduces to MQA. By choosing an intermediate g, GQA achieves most of the inference speed benefits of MQA while maintaining quality closer to MHA. Meta adopted GQA with 8 KV groups in Llama 2 70B (2023), and it has since become the default attention variant in most production large language models, including [[llama]] 3, Mistral, and many others.[^15]
[[mla]] (Multi-head Latent Attention, also covered at [[multi-head_latent_attention]]) was introduced in DeepSeek-V2 (arXiv:2405.04434, May 2024).[^16] MLA takes a fundamentally different approach to KV cache reduction: rather than reducing the number of KV heads (as in MQA and GQA), MLA compresses the key and value representations into a low-dimensional latent vector before caching. At inference time, the compressed representation is projected back to produce unique keys and values for each head.
Given an input token x_n, MLA first compresses it into a latent representation:
c^{KV}_n = W^{DKV} x_n
where W^{DKV} is a down-projection matrix mapping the model dimension d to a much smaller latent dimension d_c. This compact vector is stored in the KV cache instead of the full key and value vectors. When attention is computed, separate up-projection matrices W^{UK} and W^{UV} reconstruct unique keys and values for each head.
A key challenge is compatibility with [[rotary_position_embedding]] (RoPE). Standard RoPE entangles positional information with content, which would prevent the "absorption trick" that lets MLA fold the up-projection matrices into the query projection and avoid actually decompressing the KV cache during inference. DeepSeek solved this with decoupled RoPE: separate query and key vectors are introduced specifically for positional encoding, keeping the main latent keys isolated from rotation matrices.[^16]
In DeepSeek-V2, MLA achieved a 93.3% reduction in KV-cache size compared to standard MHA while matching (and sometimes exceeding) model quality, and increased maximum generation throughput by 5.76 times. DeepSeek-V3 uses d_h = 128, H = 128, and d_c = 512, giving a compression ratio of 32. MLA was used in DeepSeek-V3 and DeepSeek-R1; subsequent research (TransMLA, 2025) has explored enabling MLA in any Transformer-based LLM.
Differential attention, introduced by Ye et al. at Microsoft Research and Tsinghua University in their 2024 paper "Differential Transformer" (arXiv:2410.05258, ICLR 2025 Oral), rethinks how attention scores are computed.[^17] The mechanism partitions the query and key projections into two groups and computes two independent softmax distributions:
DiffAttn(X) = ( softmax(Q_1 K_1^T / sqrt(d)) - lambda * softmax(Q_2 K_2^T / sqrt(d)) ) V
The subtraction acts as noise cancellation: many tokens in standard attention receive small but non-negligible weight, diluting the signal. Differential attention subtracts these common noise patterns, causing attention to concentrate on genuinely relevant tokens. Experiments across model sizes from 830M to 13.1B parameters showed consistent improvements: a 6.8B Diff Transformer matched the validation loss of an 11B standard Transformer.[^17]
[[flash_attention]], introduced by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re in their 2022 paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (arXiv:2205.14135), addresses the quadratic memory cost of attention by rethinking how the computation interacts with GPU hardware — not by approximating the math.[^18]
The key insight is that standard attention implementations are bottlenecked not by arithmetic operations but by memory transfers between GPU high-bandwidth memory (HBM) and the on-chip SRAM. Standard implementations materialize the full n x n attention matrix in HBM, which requires O(n^2) memory reads and writes. FlashAttention avoids this by tiling: it splits Q, K, and V into blocks, loads each block from HBM into SRAM, computes the attention for that block in fast on-chip memory, and writes only the final output back to HBM. A carefully designed online softmax normalization algorithm allows blocks to be processed incrementally without ever needing the full attention matrix in memory.
The result is exact attention (not an approximation) that uses O(n) memory instead of O(n^2) and achieves wall-clock speedups of 2 to 4 times over standard implementations.[^18]
[[sparse_attention]] approaches restrict each token's attention to a subset of positions rather than the full sequence.
Katharopoulos et al. (2020) proposed linear attention in "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (ICML 2020).[^26] The core idea replaces the softmax with a decomposable kernel function. Using feature maps phi:
LinearAttention(Q, K, V) = ( phi(Q) ( phi(K)^T V ) ) / ( phi(Q) sum(phi(K)^T) )
By exploiting the associative property of matrix multiplication, the computation avoids materializing the n x n attention matrix. The product phi(K)^T V produces a d x d matrix (independent of n), reducing complexity from O(n^2 d) to O(n d^2). Katharopoulos et al. used phi(x) = elu(x) + 1 and showed a direct connection between Transformers and RNNs that enables efficient autoregressive generation.
Related approaches include:
Ring Attention (Liu, Zaharia, and Abbeel, ICLR 2024, arXiv:2310.01889) is a distributed sequence-parallelism technique that enables processing of extremely long sequences by splitting them across devices arranged in a ring topology.[^30] Each device computes blockwise attention between its local query block and a visiting KV block, while simultaneously sending that KV block to the next device in the ring and receiving the next KV block from the previous device. Because block computation takes longer than block transfers, communication is fully hidden.
Ring Attention enables training and inference on sequences up to p times longer than what a single device can handle, where p is the number of devices. On 32 A100 GPUs, a 7B model can process over 1 million tokens; on TPUv4-1024, a 3B model can train with 16 million tokens.[^30]
Native Sparse Attention (NSA), introduced by DeepSeek in February 2025 (arXiv:2502.11089, Best Paper at ACL 2025), is a hardware-aligned sparse attention mechanism designed to be natively trainable end-to-end.[^31] NSA processes inputs through three parallel attention branches combined via learned gating: compressed attention (coarse-grained blocks), selected attention (top-n important blocks at full precision), and sliding-window attention (local recent tokens). On 64K sequences, NSA achieves 9.0x forward speedup, 6.0x backward speedup, and 11.6x decoding speedup while matching or exceeding full-attention quality.
Attention is intrinsically permutation-equivariant: scaled dot-product attention treats its inputs as an unordered set, so positional information must be injected externally for sequence modeling. The choice of position encoding has become a major design lever in modern Transformers, and several encodings are tightly coupled to specific attention variants.
These position encodings interact with attention in different ways: RoPE rotates Q and K, ALiBi biases the score matrix, and absolute encodings simply add to the token representation. The interaction is non-trivial — for example, MLA's compatibility with RoPE required the decoupled-RoPE design described above.
During autoregressive generation (producing one token at a time), a [[language_model]] must compute attention over all previously generated tokens. Without optimization, this means recomputing the key and value projections for every past token at every generation step, leading to redundant computation that grows quadratically with sequence length.
The [[kv_cache]] solves this by storing the key and value vectors from all previous time steps. At each new generation step, only the key and value for the new token need to be computed and appended to the cache. The query for the new token then attends over all cached keys and values. This reduces the per-step projection cost from O(n d) to O(d), though the attention computation itself still requires O(n d) per step.
The main challenge is memory: the KV cache grows linearly with sequence length, model width, and batch size. For a large model such as Llama 2 70B with a context window of 4,096 tokens, the KV cache alone can consume tens of gigabytes of GPU memory. Several strategies address this:
| Strategy | Mechanism | Typical reduction |
|---|---|---|
| MQA / GQA | Reduce number of KV heads | KV cache reduced by factor of h (MQA) or h/g (GQA) |
| MLA | Compress KV into low-rank latent vector | 93.3% cache reduction (DeepSeek-V2) |
| KV cache quantization | Store cached K/V in FP8 or INT4 | 2-4x memory reduction |
| [[paged_attention]] (PagedAttention) | Virtual-memory-style non-contiguous cache blocks | Waste reduced from 60-80% to under 4% |
| Sliding-window caches | Limit cache to fixed window size | Bounded memory regardless of sequence length |
| Token eviction / compression | Selectively remove or merge less important cached tokens | Variable, task-dependent |
PagedAttention (Kwon et al., 2023, SOSP 2023), used in the [[vllm]] serving framework, deserves special mention.[^34] It borrows ideas from operating-system virtual memory to manage cache memory in non-contiguous blocks, reducing fragmentation. Standard implementations waste 60 to 80% of KV-cache memory; PagedAttention reduces waste to under 4% and improves serving throughput by 2 to 4x. [[radix_attention]] (Zheng et al., 2024) extends this by sharing prefix KV blocks across requests in a radix tree, accelerating multi-turn conversation and structured generation.
Practical attention implementations in modern training and inference stacks rely on a handful of complementary techniques:
Standard self-attention has time and memory complexity of O(n^2 d), where n is sequence length and d is model dimension.[^3] The quadratic dependence on n is the fundamental bottleneck for very long contexts. Although FlashAttention reduces the memory cost to O(n) (using O(n) auxiliary storage even though the conceptual attention matrix is n x n), the compute cost remains quadratic for exact attention. This is why sparse, linear, and state-space alternatives remain active research areas.
Empirically, long-context Transformers face several distinct failure modes:
One practical advantage of attention mechanisms is that attention weights can be inspected to gain insight into what the model is focusing on. Attention maps are typically visualized as heatmaps, where brighter entries indicate stronger attention between two positions. Bahdanau et al. (2014) and Vaswani et al. (2017) used such visualizations to argue that the attention mechanism recovers linguistically interpretable patterns (e.g., word alignments in translation, head-dependent relations in parsing).[^4][^3]
However, researchers have cautioned against over-interpreting attention weights. Jain and Wallace (2019) showed that attention weights often do not correlate well with other measures of feature importance and that alternative attention distributions can produce identical predictions — challenging the view that attention is itself an explanation.[^38] Subsequent work (Wiegreffe and Pinter, 2019, EMNLP) clarified the conditions under which attention can or cannot be interpreted as explanation.[^39] Attention weights indicate how information flows through the network but do not necessarily indicate which inputs are causally important for the output; more rigorous interpretability methods — such as probing classifiers and gradient-based attribution — are typically needed to draw reliable conclusions.
A line of research aims to replace attention entirely with mechanisms that have linear or sub-quadratic complexity while retaining the parallelism and expressivity of Transformers.
Recent benchmarks suggest a nuanced picture: attention excels at precise recall from context (the "needle in a haystack" task and related associative recall), while SSMs and linear recurrences excel at compression and efficiency over long sequences.[^45] Hybrid architectures are an attempt to combine the strengths of both.
The following table summarizes major attention variants, their key characteristics, and the models that adopt them.
| Variant | Year | Authors | Key idea | Complexity | Notable models |
|---|---|---|---|---|---|
| Additive (Bahdanau) | 2014 | Bahdanau, Cho, Bengio | Learned alignment via feedforward net | O(n m) | Early NMT |
| Multiplicative (Luong) | 2015 | Luong, Pham, Manning | Dot-product / general scoring | O(n m) | Early NMT |
| Scaled dot-product | 2017 | Vaswani et al. | Q, K, V with sqrt(d_k) scaling | O(n^2 d) | All Transformers |
| Multi-head attention | 2017 | Vaswani et al. | h parallel heads in subspaces | O(n^2 d) | All Transformers |
| Sparse (Longformer) | 2020 | Beltagy et al. | Local + global + dilated | O(n w) | Longformer |
| Sparse (BigBird) | 2020 | Zaheer et al. | Local + global + random | O(n (w+r+g)) | BigBird |
| Linear attention | 2020 | Katharopoulos et al. | Kernel feature map | O(n d^2) | Linear Transformer |
| Performer | 2020 | Choromanski et al. | Random features (FAVOR+) | O(n d^2) | Performer |
| Multi-query (MQA) | 2019 | Shazeer | Single shared KV head | O(n^2 d) | PaLM, Falcon |
| Grouped-query (GQA) | 2023 | Ainslie et al. | g shared KV groups | O(n^2 d) | Llama 2, Llama 3, Mistral |
| FlashAttention | 2022 | Dao et al. | IO-aware tiling, exact, O(n) memory | O(n^2 d) compute | Widely adopted |
| FlashAttention-2 | 2023 | Dao | Better parallelism, warp partitioning | O(n^2 d) compute | Widely adopted |
| FlashAttention-3 | 2024 | Dao, Shah | Hopper async, FP8, warp specialization | O(n^2 d) compute | H100 deployments |
| FlashAttention-4 | 2026 | Zadouri, Shah, Dao et al. | Blackwell pipelining | O(n^2 d) compute | B200 deployments |
| Sliding window | 2023 | Mistral AI | Local window + rolling KV cache | O(n w) | Mistral 7B, Gemma 3 |
| Multi-head latent (MLA) | 2024 | DeepSeek | Latent compression of KV cache | O(n^2 d) | DeepSeek-V2, V3, R1 |
| Differential attention | 2024 | Ye et al. | Difference of two softmax maps | O(n^2 d) | Diff Transformer |
| Native sparse (NSA) | 2025 | DeepSeek | Trained sparse: compress + select + slide | O(n (n/r + k + w)) | DeepSeek (research) |
| Ring Attention | 2024 | Liu, Zaharia, Abbeel | Distributed ring sequence parallel | O(n^2 d / p) per device | Long-context training |
| Selective SSM (Mamba) | 2023 | Gu, Dao | Input-dependent SSM | O(n d) | Mamba, Jamba |
The Vision Transformer (ViT, Dosovitskiy et al., 2020, ICLR 2021, arXiv:2010.11929) demonstrated that a pure Transformer applied directly to sequences of image patches can match or exceed state-of-the-art convolutional neural networks (CNNs) on image classification.[^46] ViT splits an image into fixed-size patches (typically 16x16 pixels), flattens each patch into a vector, projects it to the model dimension, prepends a learnable [CLS] token, and adds positional embeddings, then processes the resulting sequence with a standard Transformer encoder using multi-head self-attention.
Self-attention in ViT allows every patch to attend to every other patch, capturing global relationships across the entire image from the very first layer — in contrast to CNNs which build global understanding only gradually through stacked local convolutions. ViT has since spawned many variants, including DeiT (Touvron et al., 2021), Swin Transformer (Liu et al., 2021) which uses shifted-window attention for efficiency, and BEiT (Bao et al., 2021).
Modern text-to-image diffusion models like Stable Diffusion, DALL-E, and Imagen rely heavily on attention. Self-attention operates within the image latent representations at multiple spatial resolutions, preserving global geometric coherence. Cross-attention connects the text prompt (providing keys and values from a text encoder such as CLIP or T5) to the image latent features (providing queries), controlling which regions of the image correspond to which words.[^47] Researchers have leveraged cross-attention maps for prompt-to-prompt editing, attention-based layout control, and interpretability analysis.
Attention mechanisms have been adopted across virtually every domain of machine learning:
Imagine you are in a classroom and the teacher asks a question. You look around the room for clues. Some classmates are whispering the answer, some are drawing in their notebooks, and some are looking out the window. Attention is like choosing to listen more closely to the classmates who seem to know the answer and ignoring the ones looking out the window. In machine learning, attention lets the computer do something similar: it decides which pieces of information are most helpful for the task at hand and pays more attention to those, while downplaying the rest.