Attention

RawGraph

Last reviewed

Sources

No citations yet

Review status

Needs citations

Revision

v14 · 8,399 words

See also: Machine learning terms

This article gives a high-level overview of attention as a family of mechanisms in machine learning. For deeper treatments, see the dedicated pages on self attention, multi-head self-attention, cross attention, bahdanau attention, attention is all you need, flash attention, grouped query attention, multi-head latent attention, and paged attention.

What is the attention mechanism?

The attention mechanism is a neural-network operation that lets a model focus on the most relevant parts of its input by computing a weighted sum of values, where the weight on each value is set by how well its key matches a query.[3] First introduced for neural machine translation in 2014 and generalized in the 2017 transformer paper "Attention Is All You Need," attention is the core building block of nearly every modern AI system, from large language models to image generators and protein-folding models.[3][4] The 2017 paper that established it has been cited more than 173,000 times and was ranked by Nature in 2025 as the seventh most-cited research paper of the 21st century.[12][50]

Attention is a family of techniques in machine learning that allow a model to focus on specific parts of an input while making predictions.[1] Rather than compressing an entire input into a single fixed-size representation, attention mechanisms let the model dynamically weigh the importance of different input elements at each step of computation. This selective focus mirrors, in a loose sense, how biological attention works: irrelevant information is suppressed while relevant information is amplified.[2]

Mathematically, attention can be viewed as weighted aggregation: given a query that describes what is being looked for and a set of (key, value) pairs that describe what is available, the mechanism produces a weighted sum of the values, with weights derived from how well each key matches the query.[3] This simple operation, soft and differentiable lookup, has proven extraordinarily expressive when stacked into deep networks.

Originally developed in 2014 for neural machine translation, attention has become the foundational building block of modern deep learning.[4] It is the core operation inside the transformer architecture introduced in 2017, which underpins large language models such as the gpt series, llama, claude, and deepseek models, as well as Vision Transformers, diffusion models for image and video generation, and protein-structure systems such as alphafold. Understanding attention is essential for understanding modern artificial intelligence.

Cognitive science origins

Although the term "attention" in machine learning is a metaphor rather than a literal model of brain function, the concept is deeply rooted in cognitive psychology and neuroscience. Selective attention, the ability to focus mental resources on a subset of available stimuli, has been studied empirically since at least the 1950s. The classic "cocktail party effect" described by Colin Cherry (1953) demonstrated that listeners can selectively attend to one conversation in a crowded room while suppressing others, providing one of the earliest experimental frameworks for attention research.[5]

A particularly influential theoretical framework was feature integration theory, introduced by Anne Treisman and Garry Gelade in their 1980 paper "A Feature-Integration Theory of Attention" in Cognitive Psychology.[6] Treisman and Gelade proposed that visual processing proceeds in two stages: a parallel, pre-attentive stage in which simple features (color, orientation, motion) are detected automatically across the visual field, followed by an attentive stage in which focal attention "binds" these features into coherent object representations. Without attention, the theory predicts that features can become incorrectly conjoined, producing illusory conjunctions, for example, perceiving a red O and a green X as a green O and a red X. Their experimental findings supported the role of focused attention as a binding mechanism for object perception.

In neuroscience, attention is associated with networks involving the prefrontal cortex and the parietal cortex, particularly the dorsal attention network and the ventral attention network identified by Maurizio Corbetta and Gordon Shulman (2002).[7] Top-down attention is directed by goals and expectations, while bottom-up attention is captured by salient stimuli. Computational models of biological attention, such as the saliency maps of Itti, Koch, and Niebur (1998)[8], predate machine learning attention and inspired some early work on visual attention in deep networks.

It is important to note that machine-learning attention does not closely model these biological mechanisms. The shared name is largely metaphorical: in both cases, a limited resource is allocated selectively over inputs, but the underlying mathematics and biology differ substantially. Still, the cognitive-science framing has motivated several design choices, including the idea that attention should be soft (continuous, differentiable) rather than hard (a one-of-many selection that would not admit gradient-based learning).[9]

History in neural networks

Early sequence-to-sequence models

Before attention was introduced, sequence-to-sequence (seq2seq) models for tasks like machine translation relied on an encoder-decoder framework built from recurrent neural networks (RNNs). The encoder processed the source sentence token by token and compressed it into a single fixed-length context vector, which the decoder then used to generate the target sentence. Sutskever, Vinyals, and Le (2014) demonstrated that this approach could achieve strong results with LSTM networks,[10] but the fixed-length bottleneck caused performance to degrade on longer sentences because a single vector could not adequately capture all the information in a long input sequence.

When was attention invented? Bahdanau attention (2014, additive)

The first widely recognized attention mechanism for neural networks was proposed by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio in their 2014 paper "Neural Machine Translation by Jointly Learning to Align and Translate," initially posted to arXiv on September 1, 2014 (arXiv:1409.0473) and published at ICLR 2015.[4] See also bahdanau and bahdanau attention for the dedicated page. The key insight was to replace the fixed-length context vector with a dynamic one: instead of forcing the encoder to compress the entire source sentence into a single vector, the decoder could look back at all encoder hidden states and select the most relevant ones at each generation step. The authors diagnosed the core problem directly, writing that "the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture," and proposed instead "allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word."[4]

Bahdanau attention works as follows. For each decoder time step t, the mechanism computes an alignment score e_{t,j} between the previous decoder hidden state s_{t-1} and each encoder hidden state h_j using a learned feedforward network:

e_{t,j} = v^T tanh(W_s s_{t-1} + W_h h_j)

These scores are normalized through a softmax function to produce attention weights alpha_{t,j}. The context vector c_t is then the weighted sum of encoder hidden states:

c_t = sum_j alpha_{t,j} h_j

Because the alignment scores are computed using an additive combination passed through a neural network, this variant is often called additive attention.[4] The approach yielded translation quality comparable to the state-of-the-art phrase-based system on English-to-French translation, and crucially, it did not suffer the same degradation on long sentences that earlier encoder-decoder models exhibited. Bahdanau et al. also showed qualitatively that the alignment weights recovered linguistically reasonable word alignments, anticipating the use of attention as an interpretability tool.

Luong attention (2015, multiplicative variants)

In 2015, Minh-Thang Luong, Hieu Pham, and Christopher Manning published "Effective Approaches to Attention-based Neural Machine Translation" (arXiv:1508.04025, EMNLP 2015), which introduced several refinements and alternatives to Bahdanau attention.[11] Luong et al. proposed two broad classes of attention:

  • Global attention, which attends to all source positions (similar to Bahdanau attention but architecturally simpler).
  • Local attention, which attends only to a small window of source positions around an aligned position p_t, reducing computational cost.

Luong attention also introduced multiple scoring functions for computing alignment:

Scoring functionFormulaNotes
Dot productscore(s_t, h_j) = s_t^T h_jSimplest; no extra parameters
Generalscore(s_t, h_j) = s_t^T W_a h_jLearned weight matrix W_a
Concat (additive)score(s_t, h_j) = v^T tanh(W [s_t; h_j])Similar to Bahdanau

A key implementation difference is that Luong attention uses the current decoder hidden state s_t to compute alignment scores, whereas Bahdanau attention uses the previous state s_{t-1}. Because the dot product and general scoring functions rely on matrix multiplication rather than a feedforward network, Luong attention is sometimes called multiplicative attention and tends to be computationally faster.[11]

Vaswani et al. 2017: Transformer and self-attention

The attention mechanism reached its most influential form in the 2017 paper "attention is all you need" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin (arXiv:1706.03762, submitted June 12, 2017, NeurIPS 2017).[3] See also vaswani for the lead author. The paper introduced the transformer architecture, which dispenses with recurrence and convolutions entirely and relies solely on attention mechanisms. In the abstract, the authors state: "We propose a novel, simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely."[3] As of 2025, the paper had been cited more than 173,000 times, and a Nature analysis spanning five major citation databases ranked it the seventh most-cited research paper of the 21st century, making it one of the most cited papers in the history of computer science.[12][50]

The Transformer achieved 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the previous best (including ensembles) by more than 2 BLEU points. On English-to-French, it set a new single-model state-of-the-art of 41.8 BLEU after training for only 3.5 days on eight P100 GPUs, a small fraction of the training cost of the best models in the literature.[3]

Critically, Vaswani et al. demonstrated that self-attention layers, when stacked, can replace recurrence and convolution as the primary mechanism for sequence modeling. This unlocked unprecedented parallelism during training (because all positions can be processed simultaneously, unlike RNNs) and led to the cascade of model-scale advances that followed, culminating in modern large language models.

Mathematical formulation

Queries, keys, and values

The Transformer formalized attention using the vocabulary of queries, keys, and values (Q, K, V).[3] The analogy is drawn from information retrieval: a query represents what the model is looking for, keys describe the items available to attend to, and values hold the content that will be retrieved. In self-attention, all three are derived from the same input sequence through learned linear projections:

Q = X W_Q,    K = X W_K,    V = X W_V

where X is the input matrix (each row is a token embedding) and W_Q, W_K, W_V are learned weight matrices. In cross-attention, Q is derived from one sequence and K, V from another.

Scaled dot-product attention

The core computation in the Transformer is scaled dot-product attention:[3]

Attention(Q, K, V) = softmax( Q K^T / sqrt(d_k) ) V

where d_k is the dimensionality of the key vectors. The formula works in three stages:

  1. Compute similarity scores. The dot product Q K^T produces a matrix of raw attention scores, where each entry measures the similarity between a query and a key.
  2. Scale. The scores are divided by sqrt(d_k).
  3. Softmax and aggregate. The scaled scores pass through a softmax function to produce a probability distribution (the attention weights), which is then used to take a weighted sum of the value vectors.

Why scale by sqrt(d_k)? Vaswani et al. explicitly motivate the scaling factor in Section 3.2.1 of the 2017 paper.[3] When d_k is large, dot products between queries and keys tend to grow in magnitude. If the individual elements of Q and K are independent random variables with mean 0 and variance 1, then their dot product has mean 0 and variance d_k. Large-magnitude dot products push the softmax into regions where it has extremely small gradients, slowing or stalling learning. Dividing by sqrt(d_k) normalizes the variance of the dot products back to 1, keeping the softmax in a region with healthier gradients. This scaling is a load-bearing detail of the original Transformer formulation.

Multi-head attention

Rather than performing a single attention computation with full-dimensional queries, keys, and values, the Transformer uses multi head attention (often written multi-head attention or MHA).[3] The idea is to run h attention "heads" in parallel, each operating on a different learned linear projection of Q, K, and V into lower-dimensional subspaces:

head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O

Each head can learn to attend to different types of relationships. For instance, one head might focus on syntactic dependencies (e.g., subject-verb agreement) while another captures semantic similarity. In the original Transformer (d_model = 512, h = 8), each head operates on d_k = d_v = 64 dimensional projections. The concatenated outputs are projected back to d_model dimensions through a final weight matrix W_O.

Multi-head attention adds essentially no computational overhead compared to single-head attention with the full dimensionality, because the per-head dimensionality is reduced proportionally. The benefit is representational: multiple heads allow the model to jointly attend to information from different representation subspaces at different positions.

Types of attention

How does self-attention differ from cross-attention?

A central distinction concerns where the queries, keys, and values come from. See self attention for the dedicated article.

  • self attention (also called intra-attention) applies the attention mechanism within a single sequence. Each token generates a query, a key, and a value; every token's query is compared against every other token's key, and the resulting attention weights determine how much each token's value contributes to the output representation at that position.[3] Self-attention is what powers the encoder and decoder stacks of modern Transformers.
  • cross attention (also called encoder-decoder attention) is a variant where the queries come from one sequence and the keys and values come from a different sequence. In the original Transformer, the decoder generates queries from its own hidden states, while the keys and values come from the encoder's output, allowing each position in the decoder to attend to all positions in the encoder.[3] Cross-attention is the standard mechanism for bridging modalities (e.g., text-to-image diffusion models, text-conditioned speech synthesis) and for retrieval-augmented systems.

In short, self-attention relates a sequence to itself, while cross-attention relates one sequence (the queries) to another (the keys and values). Self-attention has a critical advantage over recurrent layers: it connects every pair of positions in a sequence with a constant number of operations (O(1) path length), whereas an RNN requires O(n) sequential steps to propagate information from one end of the sequence to the other.[3] This makes self-attention far more effective at capturing long-range dependencies and is also more parallelizable during training.

Masked / causal attention (autoregressive)

For autoregressive language modeling, predicting the next token given the previous tokens, the model must not be allowed to "see the future." This is enforced by masked self-attention, also called causal attention.[3] In implementation, the attention score matrix is augmented with a triangular mask: entries above the main diagonal (corresponding to attending to future positions) are set to negative infinity before the softmax, so they receive zero attention weight.

Mask_{i,j} = 0      if j <= i
           = -inf   if j  > i

This simple mask is what makes models like gpt and llama autoregressive: at training time the model sees the entire sequence at once but is structurally prevented from leaking information backward from later positions. At inference time, tokens are generated one at a time, and each new token attends over all preceding tokens via the kv cache (see below).

Bidirectional attention (BERT)

Encoder-only models such as bert (Devlin et al., 2018) use bidirectional attention: every token attends to every other token in the sequence, with no causal mask.[13] To train such models without trivial copying, BERT replaces selected tokens with a special [MASK] symbol and trains the model to predict the original token from context: the masked language modeling objective. Bidirectional attention is well-suited to representation learning and discriminative tasks (classification, named-entity recognition, span extraction) but is not directly suited to open-ended text generation, which is why the gpt family (causal) is the dominant architecture for generative language models.

The encoder layers of encoder-decoder Transformers (e.g., the original Transformer for translation, t5) also use bidirectional self-attention; the decoder uses causal self-attention together with cross-attention back to the encoder.

Variants

Multi-head attention (MHA, 2017)

Standard multi-head attention as introduced by Vaswani et al. (2017) gives each of the h heads its own query, key, and value projections.[3] The KV cache during autoregressive inference therefore stores h sets of key and value vectors per token. This is memory-intensive at large model scale; the variants below trade off some quality for substantial KV-cache savings.

Multi-query attention (MQA, Shazeer 2019)

Multi-Query Attention was proposed by Noam Shazeer in his 2019 paper "Fast Transformer Decoding: One Write-Head Is All You Need" (arXiv:1911.02150).[14] See also multi query attention. The central observation is that during autoregressive inference, the primary performance bottleneck on modern accelerators is the memory bandwidth required to load the key-value cache, not the arithmetic computation itself.

MQA addresses this by having all query heads share a single set of key and value projections. Instead of h independent key and value heads (as in MHA), there is just one key head and one value head. Each query head still has its own projection, so the model retains h different query perspectives, but the KV cache is reduced by a factor of h. In practice, MQA speeds up inference decoding substantially with only a small quality degradation. It was adopted in palm (Google, 2022) and falcon (TII, 2023).

Grouped-query attention (GQA, Ainslie 2023)

grouped query attention (GQA), introduced by Ainslie et al. (2023, arXiv:2305.13245, EMNLP 2023), is a compromise between MHA and MQA.[15] Instead of sharing a single KV head across all query heads (MQA) or having unique KV heads for every query head (MHA), GQA divides the query heads into g groups, where each group shares one set of key and value projections.

VariantQuery headsKV headsKV cache size
Multi-Head Attention (MHA)hhh * d_k * 2 * n
Grouped-Query Attention (GQA)hgg * d_k * 2 * n
Multi-Query Attention (MQA)h11 * d_k * 2 * n

GQA generalizes both extremes: when g = h, GQA reduces to MHA; when g = 1, GQA reduces to MQA. By choosing an intermediate g, GQA achieves most of the inference speed benefits of MQA while maintaining quality closer to MHA. Meta adopted GQA with 8 KV groups in Llama 2 70B (2023), and it has since become the default attention variant in most production large language models, including llama 3, Mistral, and many others.[15]

Multi-head Latent Attention (MLA, DeepSeek 2024)

mla (Multi-head Latent Attention, also covered at multi-head latent attention) was introduced in DeepSeek-V2 (arXiv:2405.04434, May 2024).[16] MLA takes a fundamentally different approach to KV cache reduction: rather than reducing the number of KV heads (as in MQA and GQA), MLA compresses the key and value representations into a low-dimensional latent vector before caching. At inference time, the compressed representation is projected back to produce unique keys and values for each head.

Given an input token x_n, MLA first compresses it into a latent representation:

c^{KV}_n = W^{DKV} x_n

where W^{DKV} is a down-projection matrix mapping the model dimension d to a much smaller latent dimension d_c. This compact vector is stored in the KV cache instead of the full key and value vectors. When attention is computed, separate up-projection matrices W^{UK} and W^{UV} reconstruct unique keys and values for each head.

A key challenge is compatibility with rotary position embedding (RoPE). Standard RoPE entangles positional information with content, which would prevent the "absorption trick" that lets MLA fold the up-projection matrices into the query projection and avoid actually decompressing the KV cache during inference. DeepSeek solved this with decoupled RoPE: separate query and key vectors are introduced specifically for positional encoding, keeping the main latent keys isolated from rotation matrices.[16]

In DeepSeek-V2, MLA achieved a 93.3% reduction in KV-cache size compared to standard MHA while matching (and sometimes exceeding) model quality, and increased maximum generation throughput by 5.76 times. DeepSeek-V3 uses d_h = 128, H = 128, and d_c = 512, giving a compression ratio of 32. MLA was used in DeepSeek-V3 and DeepSeek-R1; subsequent research (TransMLA, 2025) has explored enabling MLA in any Transformer-based LLM.

Differential attention (2024)

Differential attention, introduced by Ye et al. at Microsoft Research and Tsinghua University in their 2024 paper "Differential Transformer" (arXiv:2410.05258, ICLR 2025 Oral), rethinks how attention scores are computed.[17] The mechanism partitions the query and key projections into two groups and computes two independent softmax distributions:

DiffAttn(X) = ( softmax(Q_1 K_1^T / sqrt(d)) - lambda * softmax(Q_2 K_2^T / sqrt(d)) ) V

The subtraction acts as noise cancellation: many tokens in standard attention receive small but non-negligible weight, diluting the signal. Differential attention subtracts these common noise patterns, causing attention to concentrate on genuinely relevant tokens. Experiments across model sizes from 830M to 13.1B parameters showed consistent improvements: a 6.8B Diff Transformer matched the validation loss of an 11B standard Transformer.[17]

Efficiency improvements

Flash Attention v1/v2/v3 (Dao 2022-2024)

flash attention, introduced by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re in their 2022 paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (arXiv:2205.14135), addresses the quadratic memory cost of attention by rethinking how the computation interacts with GPU hardware, without approximating the math.[18]

The key insight is that standard attention implementations are bottlenecked not by arithmetic operations but by memory transfers between GPU high-bandwidth memory (HBM) and the on-chip SRAM. Standard implementations materialize the full n x n attention matrix in HBM, which requires O(n^2) memory reads and writes. FlashAttention avoids this by tiling: it splits Q, K, and V into blocks, loads each block from HBM into SRAM, computes the attention for that block in fast on-chip memory, and writes only the final output back to HBM. A carefully designed online softmax normalization algorithm allows blocks to be processed incrementally without ever needing the full attention matrix in memory.

The result is exact attention (not an approximation) that uses O(n) memory instead of O(n^2) and achieves wall-clock speedups of 2 to 4 times over standard implementations.[18]

  • FlashAttention-2 (Tri Dao, 2023, arXiv:2307.08691) reduced non-matmul FLOPs by restructuring the algorithm to spend a higher fraction of time on matrix multiplications, improved parallelism across thread blocks, and refined warp partitioning. These changes yielded roughly a 2x speedup over FlashAttention v1, reaching 50 to 73% of theoretical maximum FLOPs/s on NVIDIA A100 GPUs.[19]
  • FlashAttention-3 (Tri Dao and Jay Shah, 2024, NeurIPS 2024) targets NVIDIA Hopper GPUs (H100). It exploits asynchronous execution of Tensor Cores and the Tensor Memory Accelerator via warp specialization, interleaved block-wise matmul and softmax operations, and FP8 low-precision computation with block quantization. It achieves up to 840 TFLOPs/s in BF16 on H100 (about 85% utilization), roughly 1.5 to 2x faster than FlashAttention-2.[20]
  • FlashAttention-4 (Zadouri, Shah, Hohnerbach, Liu, Thakkar, Dao, 2026, MLSys 2026) extends the line to NVIDIA Blackwell GPUs (B200). It introduces ping-pong scheduling, software exponential emulation using polynomial approximation on FMA units, and conditional online softmax rescaling. Written in CuTe-DSL, it reaches up to 1,605 TFLOPs/s in BF16 on B200 (about 71% utilization), 1.3x faster than cuDNN 9.13 and 2.7x faster than Triton.[21] See flash attention 3 for the dedicated v3 article.

Sparse and sliding-window attention (Longformer, Mistral)

sparse attention approaches restrict each token's attention to a subset of positions rather than the full sequence.

  • Longformer (Beltagy, Peters, and Cohan, 2020, arXiv:2004.05150) combines local sliding-window attention with global attention on a small number of designated tokens (e.g., the [CLS] token for classification). It also introduced dilated sliding-window attention. Longformer scales linearly with sequence length and was pretrained for up to 4,096 tokens.[22]
  • BigBird (Zaheer et al., 2020, NeurIPS) extends Longformer by adding random attention. The authors proved theoretically that BigBird's sparse pattern is a universal approximator of sequence functions and Turing complete.[23]
  • Mistral 7B (Mistral AI, 2023) uses sliding window attention with a window size of 4,096 tokens; stacking 32 layers gives an effective receptive field of about 131,000 tokens. A rolling-buffer KV cache limited to the window halves cache memory at long sequence lengths.[24]
  • Hybrid local-global designs: Gemma 2 (Google, 2024) uses a 1:1 ratio of local and global attention layers with a 4,096-token window. Gemma 3 (Google, 2025) shifted to a 5:1 ratio with a 1,024-token window, reducing attention compute by roughly 5x and KV cache memory from about 60% to 15% of total memory while still supporting 128K context lengths via RoPE frequency rescaling on the global layers.[25]

Linear attention (Performer, Linformer)

Katharopoulos et al. (2020) proposed linear attention in "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (ICML 2020).[26] The core idea replaces the softmax with a decomposable kernel function. Using feature maps phi:

LinearAttention(Q, K, V) = ( phi(Q) ( phi(K)^T V ) ) / ( phi(Q) sum(phi(K)^T) )

By exploiting the associative property of matrix multiplication, the computation avoids materializing the n x n attention matrix. The product phi(K)^T V produces a d x d matrix (independent of n), reducing complexity from O(n^2 d) to O(n d^2). Katharopoulos et al. used phi(x) = elu(x) + 1 and showed a direct connection between Transformers and RNNs that enables efficient autoregressive generation.

Related approaches include:

  • Linformer (Wang et al., 2020, arXiv:2006.04768) projects keys and values to a lower-dimensional space, achieving O(n) complexity but at the cost of fixing a maximum sequence length.[27]
  • Performer (Choromanski et al., 2020, ICLR 2021, arXiv:2009.14794) uses random feature maps (FAVOR+) to approximate softmax attention with provable accuracy bounds at O(n) cost.[28]
  • Reformer (Kitaev, Kaiser, and Levskaya, 2020, arXiv:2001.04451) uses locality-sensitive hashing to attend only to nearby items in hash space.[29]

Ring attention (Liu 2023)

Ring Attention (Liu, Zaharia, and Abbeel, ICLR 2024, arXiv:2310.01889) is a distributed sequence-parallelism technique that enables processing of extremely long sequences by splitting them across devices arranged in a ring topology.[30] Each device computes blockwise attention between its local query block and a visiting KV block, while simultaneously sending that KV block to the next device in the ring and receiving the next KV block from the previous device. Because block computation takes longer than block transfers, communication is fully hidden.

Ring Attention enables training and inference on sequences up to p times longer than what a single device can handle, where p is the number of devices. On 32 A100 GPUs, a 7B model can process over 1 million tokens; on TPUv4-1024, a 3B model can train with 16 million tokens.[30]

Native Sparse Attention (2025)

Native Sparse Attention (NSA), introduced by DeepSeek in February 2025 (arXiv:2502.11089, Best Paper at ACL 2025), is a hardware-aligned sparse attention mechanism designed to be natively trainable end-to-end.[31] NSA processes inputs through three parallel attention branches combined via learned gating: compressed attention (coarse-grained blocks), selected attention (top-n important blocks at full precision), and sliding-window attention (local recent tokens). On 64K sequences, NSA achieves 9.0x forward speedup, 6.0x backward speedup, and 11.6x decoding speedup while matching or exceeding full-attention quality.

Position encoding pairing

Attention is intrinsically permutation-equivariant: scaled dot-product attention treats its inputs as an unordered set, so positional information must be injected externally for sequence modeling. The choice of position encoding has become a major design lever in modern Transformers, and several encodings are tightly coupled to specific attention variants.

  • Sinusoidal absolute position embeddings, used in the original Transformer, add a deterministic sinusoidal vector to each token embedding before the first attention layer.[3]
  • Learned absolute position embeddings, used by bert and gpt-2, replace the sinusoid with a learned vector per position.[13]
  • rotary position embedding (RoPE), introduced by Su et al. (2021, arXiv:2104.09864), rotates the query and key vectors by a position-dependent angle inside each attention head.[32] RoPE has become the default in llama, mistral, deepseek, qwen, and most modern LLMs. The fact that RoPE acts directly on Q and K means it composes naturally with attention variants such as GQA, MLA (with decoupled RoPE), and sliding-window attention.
  • alibi (Attention with Linear Biases, Press et al., 2022, arXiv:2108.12409) adds a fixed linear penalty to the attention scores based on the distance between query and key positions.[33] ALiBi enables length extrapolation: models trained at one context length can be evaluated at much longer lengths without re-training.

These position encodings interact with attention in different ways: RoPE rotates Q and K, ALiBi biases the score matrix, and absolute encodings simply add to the token representation. The interaction is non-trivial; for example, MLA's compatibility with RoPE required the decoupled-RoPE design described above.

KV cache (inference optimization)

During autoregressive generation (producing one token at a time), a language model must compute attention over all previously generated tokens. Without optimization, this means recomputing the key and value projections for every past token at every generation step, leading to redundant computation that grows quadratically with sequence length.

The kv cache solves this by storing the key and value vectors from all previous time steps. At each new generation step, only the key and value for the new token need to be computed and appended to the cache. The query for the new token then attends over all cached keys and values. This reduces the per-step projection cost from O(n d) to O(d), though the attention computation itself still requires O(n d) per step.

The main challenge is memory: the KV cache grows linearly with sequence length, model width, and batch size. For a large model such as Llama 2 70B with a context window of 4,096 tokens, the KV cache alone can consume tens of gigabytes of GPU memory. Several strategies address this:

StrategyMechanismTypical reduction
MQA / GQAReduce number of KV headsKV cache reduced by factor of h (MQA) or h/g (GQA)
MLACompress KV into low-rank latent vector93.3% cache reduction (DeepSeek-V2)
KV cache quantizationStore cached K/V in FP8 or INT42-4x memory reduction
paged attention (PagedAttention)Virtual-memory-style non-contiguous cache blocksWaste reduced from 60-80% to under 4%
Sliding-window cachesLimit cache to fixed window sizeBounded memory regardless of sequence length
Token eviction / compressionSelectively remove or merge less important cached tokensVariable, task-dependent

PagedAttention (Kwon et al., 2023, SOSP 2023), used in the vllm serving framework, deserves special mention.[34] It borrows ideas from operating-system virtual memory to manage cache memory in non-contiguous blocks, reducing fragmentation. Standard implementations waste 60 to 80% of KV-cache memory; PagedAttention reduces waste to under 4% and improves serving throughput by 2 to 4x. radix attention (Zheng et al., 2024) extends this by sharing prefix KV blocks across requests in a radix tree, accelerating multi-turn conversation and structured generation.

Implementation tricks

Practical attention implementations in modern training and inference stacks rely on a handful of complementary techniques:

  • Mixed-precision training: matrix multiplications are computed in BF16 or FP16 while the softmax and gradient accumulation use FP32, balancing numerical stability with throughput.[35] On H100 and B200, FP8 attention (used by FlashAttention-3 and -4) further increases throughput, with careful scaling and incoherent processing to bound numerical error.
  • Triton kernels: many production attention kernels (including the reference FlashAttention implementations and DeepSeek's NSA) are written in OpenAI Triton, a Python-like DSL that targets GPUs and abstracts away much of the CUDA boilerplate.[36] Triton has become the de facto standard for custom attention kernels in research and is increasingly used in production.
  • CUTLASS and CuTe-DSL: NVIDIA's CUTLASS library and its successor CuTe-DSL provide high-performance GEMM building blocks that underpin FlashAttention-3 and -4 on Hopper and Blackwell GPUs.
  • Paged attention (vLLM): as described above, PagedAttention enables high-throughput serving by managing the KV cache as virtual-memory pages, making it feasible to serve many concurrent requests with shared prefixes.[34]
  • Continuous batching: serving frameworks such as vllm and TensorRT-LLM use continuous batching (also called in-flight batching), where new requests join an ongoing batch as previous ones finish, dramatically increasing GPU utilization for autoregressive workloads.
  • Speculative decoding: speculative-decoding and lookahead techniques generate multiple candidate tokens with a small draft model and verify them with a larger target model in a single attention pass, increasing tokens-per-second without changing the underlying attention algorithm.

Limitations

O(n^2) memory and compute

Standard self-attention has time and memory complexity of O(n^2 d), where n is sequence length and d is model dimension.[3] The quadratic dependence on n is the fundamental bottleneck for very long contexts. Although FlashAttention reduces the memory cost to O(n) (using O(n) auxiliary storage even though the conceptual attention matrix is n x n), the compute cost remains quadratic for exact attention. This is why sparse, linear, and state-space alternatives remain active research areas.

Long-context challenges

Empirically, long-context Transformers face several distinct failure modes:

  • Lost in the middle: Liu et al. (2023, arXiv:2307.03172) showed that even capable LLMs are markedly less accurate at retrieving information from the middle of a long context compared to the beginning or end, producing a U-shaped accuracy curve.[37]
  • Position-encoding extrapolation: many position encodings struggle to generalize beyond the training context length. Techniques like RoPE frequency scaling (NTK-aware scaling, YaRN), ALiBi, and position interpolation address this with varying success.
  • Softmax attention dilution: as context length grows, individual attention weights become smaller, making it harder to pick out a few important tokens. Differential attention[17] and learned sparse attention[31] are partial remedies.
  • Throughput and latency: even with FlashAttention, prefill and decode latency grow with context length, motivating the variants surveyed above.

Visualization and interpretation

Are attention weights an explanation?

One practical advantage of attention mechanisms is that attention weights can be inspected to gain insight into what the model is focusing on. Attention maps are typically visualized as heatmaps, where brighter entries indicate stronger attention between two positions. Bahdanau et al. (2014) and Vaswani et al. (2017) used such visualizations to argue that the attention mechanism recovers linguistically interpretable patterns (e.g., word alignments in translation, head-dependent relations in parsing).[4][3]

However, researchers have cautioned against over-interpreting attention weights. Jain and Wallace (2019) showed that attention weights often do not correlate well with other measures of feature importance and that alternative attention distributions can produce identical predictions, challenging the view that attention is itself an explanation.[38] Subsequent work (Wiegreffe and Pinter, 2019, EMNLP) clarified the conditions under which attention can or cannot be interpreted as explanation.[39] Attention weights indicate how information flows through the network but do not necessarily indicate which inputs are causally important for the output; more rigorous interpretability methods, such as probing classifiers and gradient-based attribution, are typically needed to draw reliable conclusions.

Alternatives to attention

A line of research aims to replace attention entirely with mechanisms that have linear or sub-quadratic complexity while retaining the parallelism and expressivity of Transformers.

  • State-space models (Mamba): mamba (Gu and Dao, 2023, arXiv:2312.00752) makes the state transition matrices of a structured state-space model input-dependent (selective), enabling dynamic information routing.[40] Mamba achieves linear-time complexity O(n), 5x higher inference throughput than Transformers, and scales to million-length sequences; Mamba-3B matches or outperforms Transformers of twice its size on language-modeling benchmarks. Recent hybrid architectures such as jamba (AI21 Labs, 2024) interleave Mamba, attention, and Mixture-of-Experts layers.
  • Linear RNNs (RWKV): rwkv (Peng et al., 2023, arXiv:2305.13048) is an RNN-style architecture with linear complexity in sequence length that nonetheless can be trained in parallel like a Transformer.[41] RWKV combines the expressivity of attention with the inference efficiency of RNNs; the project has released open-weight models from 100M to 14B parameters.
  • Hyena (Poli 2023): hyena (Poli et al., 2023, ICML 2023, arXiv:2302.10866) replaces attention with a recurrence built from long convolutions and data-controlled gating.[42] Hyena achieves subquadratic complexity and is competitive with attention on language modeling at sequence lengths up to 64K, where it offers 100x speedup over standard attention.
  • RetNet (Retentive Networks): Sun et al. (2023, arXiv:2307.08621) propose retention, a mechanism that admits a parallel form (similar to attention) and a recurrent form (similar to an RNN), enabling O(1) inference per step with parallel training.[43]
  • DeltaNet: Schlag, Irie, and Schmidhuber (2021) and later Yang et al. (2024) developed DeltaNet, a linear-attention variant with an explicit delta-rule update that improves recall over plain linear attention.[44]
  • Hybrid architectures: rather than fully replacing attention, several systems interleave attention layers with state-space or linear-recurrence layers. Examples include jamba, Striped Hyena, and Samba. These hybrids attempt to combine the precise recall of attention with the efficiency of sub-quadratic mechanisms.

Recent benchmarks suggest a nuanced picture: attention excels at precise recall from context (the "needle in a haystack" task and related associative recall), while SSMs and linear recurrences excel at compression and efficiency over long sequences.[45] Hybrid architectures are an attempt to combine the strengths of both.

Comparison of attention variants

The following table summarizes major attention variants, their key characteristics, and the models that adopt them.

VariantYearAuthorsKey ideaComplexityNotable models
Additive (Bahdanau)2014Bahdanau, Cho, BengioLearned alignment via feedforward netO(n m)Early NMT
Multiplicative (Luong)2015Luong, Pham, ManningDot-product / general scoringO(n m)Early NMT
Scaled dot-product2017Vaswani et al.Q, K, V with sqrt(d_k) scalingO(n^2 d)All Transformers
Multi-head attention2017Vaswani et al.h parallel heads in subspacesO(n^2 d)All Transformers
Sparse (Longformer)2020Beltagy et al.Local + global + dilatedO(n w)Longformer
Sparse (BigBird)2020Zaheer et al.Local + global + randomO(n (w+r+g))BigBird
Linear attention2020Katharopoulos et al.Kernel feature mapO(n d^2)Linear Transformer
Performer2020Choromanski et al.Random features (FAVOR+)O(n d^2)Performer
Multi-query (MQA)2019ShazeerSingle shared KV headO(n^2 d)PaLM, Falcon
Grouped-query (GQA)2023Ainslie et al.g shared KV groupsO(n^2 d)Llama 2, Llama 3, Mistral
FlashAttention2022Dao et al.IO-aware tiling, exact, O(n) memoryO(n^2 d) computeWidely adopted
FlashAttention-22023DaoBetter parallelism, warp partitioningO(n^2 d) computeWidely adopted
FlashAttention-32024Dao, ShahHopper async, FP8, warp specializationO(n^2 d) computeH100 deployments
FlashAttention-42026Zadouri, Shah, Dao et al.Blackwell pipeliningO(n^2 d) computeB200 deployments
Sliding window2023Mistral AILocal window + rolling KV cacheO(n w)Mistral 7B, Gemma 3
Multi-head latent (MLA)2024DeepSeekLatent compression of KV cacheO(n^2 d)DeepSeek-V2, V3, R1
Differential attention2024Ye et al.Difference of two softmax mapsO(n^2 d)Diff Transformer
Native sparse (NSA)2025DeepSeekTrained sparse: compress + select + slideO(n (n/r + k + w))DeepSeek (research)
Ring Attention2024Liu, Zaharia, AbbeelDistributed ring sequence parallelO(n^2 d / p) per deviceLong-context training
Selective SSM (Mamba)2023Gu, DaoInput-dependent SSMO(n d)Mamba, Jamba

Attention in computer vision

Vision Transformer (ViT)

The Vision Transformer (ViT, Dosovitskiy et al., 2020, ICLR 2021, arXiv:2010.11929) demonstrated that a pure Transformer applied directly to sequences of image patches can match or exceed state-of-the-art convolutional neural networks (CNNs) on image classification.[46] ViT splits an image into fixed-size patches (typically 16x16 pixels), flattens each patch into a vector, projects it to the model dimension, prepends a learnable [CLS] token, and adds positional embeddings, then processes the resulting sequence with a standard Transformer encoder using multi-head self-attention.

Self-attention in ViT allows every patch to attend to every other patch, capturing global relationships across the entire image from the very first layer, in contrast to CNNs which build global understanding only gradually through stacked local convolutions. ViT has since spawned many variants, including DeiT (Touvron et al., 2021), Swin Transformer (Liu et al., 2021) which uses shifted-window attention for efficiency, and BEiT (Bao et al., 2021).

Attention in diffusion models

Modern text-to-image diffusion models like Stable Diffusion, DALL-E, and Imagen rely heavily on attention. Self-attention operates within the image latent representations at multiple spatial resolutions, preserving global geometric coherence. Cross-attention connects the text prompt (providing keys and values from a text encoder such as CLIP or T5) to the image latent features (providing queries), controlling which regions of the image correspond to which words.[47] Researchers have leveraged cross-attention maps for prompt-to-prompt editing, attention-based layout control, and interpretability analysis.

Applications of attention

What is attention used for?

Attention mechanisms have been adopted across virtually every domain of machine learning:

  • Natural language processing: attention is the backbone of bert, the gpt series, t5, llama, claude, and deepseek models, enabling translation, summarization, question answering, and code generation.
  • Computer vision: Vision Transformers and their variants use self-attention for image classification, object detection, and segmentation.
  • Speech and audio: models such as Whisper (OpenAI, 2022) use cross-attention between audio features and text tokens for speech recognition.
  • Multimodal learning: cross-attention connects different modalities in models such as Flamingo (DeepMind, 2022), Stable Diffusion, and video understanding systems.
  • Protein structure prediction: alphafold 2 (DeepMind, 2021) uses a specialized attention mechanism, the Evoformer, that applies self- and cross-attention to protein sequences and structural features.[48]
  • Reinforcement learning: Decision Transformer (Chen et al., 2021) frames RL as a sequence-modeling problem, applying self-attention to sequences of states, actions, and rewards.[49]

Explain like I'm 5 (ELI5)

Imagine you are in a classroom and the teacher asks a question. You look around the room for clues. Some classmates are whispering the answer, some are drawing in their notebooks, and some are looking out the window. Attention is like choosing to listen more closely to the classmates who seem to know the answer and ignoring the ones looking out the window. In machine learning, attention lets the computer do something similar: it decides which pieces of information are most helpful for the task at hand and pays more attention to those, while downplaying the rest.

See also

References

  1. Wikipedia. "Attention (machine learning)." https://en.wikipedia.org/wiki/Attention_(machine_learning)
  2. Lilian Weng. "Attention? Attention!" *Lil'Log*, June 2018. https://lilianweng.github.io/posts/2018-06-24-attention/
  3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." *NeurIPS 2017*. arXiv:1706.03762. https://arxiv.org/abs/1706.03762
  4. Bahdanau, D., Cho, K., and Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." *ICLR 2015*. arXiv:1409.0473. https://arxiv.org/abs/1409.0473
  5. Cherry, E. C. (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears." *Journal of the Acoustical Society of America*, 25(5), 975-979. https://doi.org/10.1121/1.1907229
  6. Treisman, A. M., and Gelade, G. (1980). "A Feature-Integration Theory of Attention." *Cognitive Psychology*, 12(1), 97-136. https://doi.org/10.1016/0010-0285(80)90005-5
  7. Corbetta, M., and Shulman, G. L. (2002). "Control of Goal-Directed and Stimulus-Driven Attention in the Brain." *Nature Reviews Neuroscience*, 3, 201-215. https://doi.org/10.1038/nrn755
  8. Itti, L., Koch, C., and Niebur, E. (1998). "A Model of Saliency-Based Visual Attention for Rapid Scene Analysis." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(11), 1254-1259. https://doi.org/10.1109/34.730558
  9. Xu, K., Ba, J., Kiros, R., et al. (2015). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." *ICML 2015*. arXiv:1502.03044. https://arxiv.org/abs/1502.03044
  10. Sutskever, I., Vinyals, O., and Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." *NeurIPS 2014*. arXiv:1409.3215. https://arxiv.org/abs/1409.3215
  11. Luong, M.-T., Pham, H., and Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." *EMNLP 2015*. arXiv:1508.04025. https://arxiv.org/abs/1508.04025
  12. Google Scholar citation count for "Attention Is All You Need" (Vaswani et al., 2017). https://scholar.google.com/scholar?cluster=2960712678066186980
  13. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *NAACL 2019*. arXiv:1810.04805. https://arxiv.org/abs/1810.04805
  14. Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150. https://arxiv.org/abs/1911.02150
  15. Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." *EMNLP 2023*. arXiv:2305.13245. https://arxiv.org/abs/2305.13245
  16. DeepSeek-AI. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434. https://arxiv.org/abs/2405.04434
  17. Ye, T., Dong, L., Xia, Y., Sun, Y., Zhu, Y., Huang, G., and Wei, F. (2024). "Differential Transformer." arXiv:2410.05258 (ICLR 2025 Oral). https://arxiv.org/abs/2410.05258
  18. Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." *NeurIPS 2022*. arXiv:2205.14135. https://arxiv.org/abs/2205.14135
  19. Dao, T. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv:2307.08691. https://arxiv.org/abs/2307.08691
  20. Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. (2024). "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." *NeurIPS 2024*. arXiv:2407.08608. https://arxiv.org/abs/2407.08608
  21. Zadouri, T., Shah, J., Hohnerbach, M., Liu, T., Thakkar, V., and Dao, T. (2026). "FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling." *MLSys 2026*. https://tridao.me/blog/2025/flash4/
  22. Beltagy, I., Peters, M. E., and Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv:2004.05150. https://arxiv.org/abs/2004.05150
  23. Zaheer, M., Guruganesh, G., Dubey, K. A., et al. (2020). "Big Bird: Transformers for Longer Sequences." *NeurIPS 2020*. arXiv:2007.14062. https://arxiv.org/abs/2007.14062
  24. Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). "Mistral 7B." arXiv:2310.06825. https://arxiv.org/abs/2310.06825
  25. Google DeepMind. (2025). "Gemma 3 Technical Report." arXiv:2503.19786. https://arxiv.org/abs/2503.19786
  26. Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." *ICML 2020*. arXiv:2006.16236. https://arxiv.org/abs/2006.16236
  27. Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. (2020). "Linformer: Self-Attention with Linear Complexity." arXiv:2006.04768. https://arxiv.org/abs/2006.04768
  28. Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2020). "Rethinking Attention with Performers." *ICLR 2021*. arXiv:2009.14794. https://arxiv.org/abs/2009.14794
  29. Kitaev, N., Kaiser, L., and Levskaya, A. (2020). "Reformer: The Efficient Transformer." *ICLR 2020*. arXiv:2001.04451. https://arxiv.org/abs/2001.04451
  30. Liu, H., Zaharia, M., and Abbeel, P. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." *ICLR 2024*. arXiv:2310.01889. https://arxiv.org/abs/2310.01889
  31. DeepSeek-AI. (2025). "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention." arXiv:2502.11089 (ACL 2025 Best Paper). https://arxiv.org/abs/2502.11089
  32. Su, J., Lu, Y., Pan, S., Wen, B., and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. https://arxiv.org/abs/2104.09864
  33. Press, O., Smith, N. A., and Lewis, M. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." *ICLR 2022*. arXiv:2108.12409. https://arxiv.org/abs/2108.12409
  34. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *SOSP 2023*. arXiv:2309.06180. https://arxiv.org/abs/2309.06180
  35. Micikevicius, P., Narang, S., Alben, J., et al. (2018). "Mixed Precision Training." *ICLR 2018*. arXiv:1710.03740. https://arxiv.org/abs/1710.03740
  36. Tillet, P., Kung, H. T., and Cox, D. (2019). "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." *MAPL 2019*. https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
  37. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172. https://arxiv.org/abs/2307.03172
  38. Jain, S., and Wallace, B. C. (2019). "Attention is not Explanation." *NAACL 2019*. arXiv:1902.10186. https://arxiv.org/abs/1902.10186
  39. Wiegreffe, S., and Pinter, Y. (2019). "Attention is not not Explanation." *EMNLP 2019*. arXiv:1908.04626. https://arxiv.org/abs/1908.04626
  40. Gu, A., and Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. https://arxiv.org/abs/2312.00752
  41. Peng, B., Alcaide, E., Anthony, Q., et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." *EMNLP 2023 Findings*. arXiv:2305.13048. https://arxiv.org/abs/2305.13048
  42. Poli, M., Massaroli, S., Nguyen, E., et al. (2023). "Hyena Hierarchy: Towards Larger Convolutional Language Models." *ICML 2023*. arXiv:2302.10866. https://arxiv.org/abs/2302.10866
  43. Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. (2023). "Retentive Network: A Successor to Transformer for Large Language Models." arXiv:2307.08621. https://arxiv.org/abs/2307.08621
  44. Yang, S., Wang, B., Shen, Y., Panda, R., and Kim, Y. (2024). "Gated Linear Attention Transformers with Hardware-Efficient Training." *ICML 2024*. arXiv:2312.06635. https://arxiv.org/abs/2312.06635
  45. Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Zou, J., Rudra, A., and Re, C. (2024). "Zoology: Measuring and Improving Recall in Efficient Language Models." *ICLR 2024*. arXiv:2312.04927. https://arxiv.org/abs/2312.04927
  46. Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." *ICLR 2021*. arXiv:2010.11929. https://arxiv.org/abs/2010.11929
  47. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." *CVPR 2022*. arXiv:2112.10752. https://arxiv.org/abs/2112.10752
  48. Jumper, J., Evans, R., Pritzel, A., et al. (2021). "Highly Accurate Protein Structure Prediction with AlphaFold." *Nature*, 596, 583-589. https://doi.org/10.1038/s41586-021-03819-3
  49. Chen, L., Lu, K., Rajeswaran, A., et al. (2021). "Decision Transformer: Reinforcement Learning via Sequence Modeling." *NeurIPS 2021*. arXiv:2106.01345. https://arxiv.org/abs/2106.01345
  50. Pearson, H., and Ledford, H. (2025). "Exclusive: the most-cited papers of the twenty-first century." *Nature*, 640, 588-592, ranking "Attention Is All You Need" seventh across five major citation databases. https://www.nature.com/articles/d41586-025-01125-y

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation.

Suggest edit