# Attention

> Source: https://aiwiki.ai/wiki/attention
> Updated: 2026-06-20
> Categories: Deep Learning, Machine Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

> This article gives a high-level overview of attention as a family of mechanisms in machine learning. For deeper treatments, see the dedicated pages on [self attention](/wiki/self_attention), [multi-head self-attention](/wiki/multi-head_self-attention), [cross attention](/wiki/cross_attention), [bahdanau attention](/wiki/bahdanau_attention), [attention is all you need](/wiki/attention_is_all_you_need), [flash attention](/wiki/flash_attention), [grouped query attention](/wiki/grouped_query_attention), [multi-head latent attention](/wiki/multi-head_latent_attention), and [paged attention](/wiki/paged_attention).

## What is the attention mechanism?

The **attention mechanism** is a neural-network operation that lets a model focus on the most relevant parts of its input by computing a weighted sum of values, where the weight on each value is set by how well its key matches a query.[^3] First introduced for [neural machine translation](/wiki/machine_translation) in 2014 and generalized in the 2017 [transformer](/wiki/transformer) paper "[Attention Is All You Need](/wiki/attention_is_all_you_need)," attention is the core building block of nearly every modern AI system, from [large language models](/wiki/large_language_model) to image generators and protein-folding models.[^3][^4] The 2017 paper that established it has been cited more than 173,000 times and was ranked by Nature in 2025 as the seventh most-cited research paper of the 21st century.[^12][^50]

**Attention** is a family of techniques in [machine learning](/wiki/machine_learning) that allow a model to focus on specific parts of an input while making predictions.[^1] Rather than compressing an entire input into a single fixed-size representation, attention mechanisms let the model dynamically weigh the importance of different input elements at each step of computation. This selective focus mirrors, in a loose sense, how biological attention works: irrelevant information is suppressed while relevant information is amplified.[^2]

Mathematically, attention can be viewed as **weighted aggregation**: given a query that describes what is being looked for and a set of (key, value) pairs that describe what is available, the mechanism produces a weighted sum of the values, with weights derived from how well each key matches the query.[^3] This simple operation, soft and differentiable lookup, has proven extraordinarily expressive when stacked into deep networks.

Originally developed in 2014 for [neural machine translation](/wiki/machine_translation), attention has become the foundational building block of modern deep learning.[^4] It is the core operation inside the [transformer](/wiki/transformer) architecture introduced in 2017, which underpins [large language models](/wiki/large_language_model) such as the [gpt](/wiki/gpt) series, [llama](/wiki/llama), [claude](/wiki/claude), and [deepseek](/wiki/deepseek) models, as well as Vision Transformers, [diffusion models](/wiki/diffusion_model) for image and video generation, and protein-structure systems such as [alphafold](/wiki/alphafold). Understanding attention is essential for understanding modern [artificial intelligence](/wiki/artificial_intelligence).

## Cognitive science origins

Although the term "attention" in machine learning is a metaphor rather than a literal model of brain function, the concept is deeply rooted in **cognitive psychology** and **neuroscience**. Selective attention, the ability to focus mental resources on a subset of available stimuli, has been studied empirically since at least the 1950s. The classic "cocktail party effect" described by Colin Cherry (1953) demonstrated that listeners can selectively attend to one conversation in a crowded room while suppressing others, providing one of the earliest experimental frameworks for attention research.[^5]

A particularly influential theoretical framework was **feature integration theory**, introduced by Anne Treisman and Garry Gelade in their 1980 paper "A Feature-Integration Theory of Attention" in *Cognitive Psychology*.[^6] Treisman and Gelade proposed that visual processing proceeds in two stages: a parallel, pre-attentive stage in which simple features (color, orientation, motion) are detected automatically across the visual field, followed by an attentive stage in which focal attention "binds" these features into coherent object representations. Without attention, the theory predicts that features can become incorrectly conjoined, producing **illusory conjunctions**, for example, perceiving a red O and a green X as a green O and a red X. Their experimental findings supported the role of focused attention as a binding mechanism for object perception.

In neuroscience, attention is associated with networks involving the **prefrontal cortex** and the **parietal cortex**, particularly the dorsal attention network and the ventral attention network identified by Maurizio Corbetta and Gordon Shulman (2002).[^7] **Top-down attention** is directed by goals and expectations, while **bottom-up attention** is captured by salient stimuli. Computational models of biological attention, such as the saliency maps of Itti, Koch, and Niebur (1998)[^8], predate machine learning attention and inspired some early work on visual attention in deep networks.

It is important to note that machine-learning attention does not closely model these biological mechanisms. The shared name is largely metaphorical: in both cases, a limited resource is allocated selectively over inputs, but the underlying mathematics and biology differ substantially. Still, the cognitive-science framing has motivated several design choices, including the idea that attention should be **soft** (continuous, differentiable) rather than **hard** (a one-of-many selection that would not admit gradient-based learning).[^9]

## History in neural networks

### Early sequence-to-sequence models

Before attention was introduced, [sequence-to-sequence](/wiki/sequence-to-sequence_task) (seq2seq) models for tasks like machine translation relied on an encoder-decoder framework built from [recurrent neural networks](/wiki/recurrent_neural_network) (RNNs). The encoder processed the source sentence token by token and compressed it into a single fixed-length context vector, which the decoder then used to generate the target sentence. Sutskever, Vinyals, and Le (2014) demonstrated that this approach could achieve strong results with [LSTM](/wiki/long_short-term_memory_lstm) networks,[^10] but the fixed-length bottleneck caused performance to degrade on longer sentences because a single vector could not adequately capture all the information in a long input sequence.

### When was attention invented? Bahdanau attention (2014, additive)

The first widely recognized attention mechanism for neural networks was proposed by Dzmitry Bahdanau, KyungHyun Cho, and [Yoshua Bengio](/wiki/yoshua_bengio) in their 2014 paper "Neural Machine Translation by Jointly Learning to Align and Translate," initially posted to arXiv on September 1, 2014 (arXiv:1409.0473) and published at [ICLR](/wiki/iclr) 2015.[^4] See also [bahdanau](/wiki/bahdanau) and [bahdanau attention](/wiki/bahdanau_attention) for the dedicated page. The key insight was to replace the fixed-length context vector with a dynamic one: instead of forcing the encoder to compress the entire source sentence into a single vector, the decoder could look back at all encoder hidden states and select the most relevant ones at each generation step. The authors diagnosed the core problem directly, writing that "the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture," and proposed instead "allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word."[^4]

Bahdanau attention works as follows. For each decoder time step t, the mechanism computes an alignment score e_{t,j} between the previous decoder hidden state s_{t-1} and each encoder hidden state h_j using a learned feedforward network:

```
e_{t,j} = v^T tanh(W_s s_{t-1} + W_h h_j)
```

These scores are normalized through a [softmax](/wiki/softmax) function to produce attention weights alpha_{t,j}. The context vector c_t is then the weighted sum of encoder hidden states:

```
c_t = sum_j alpha_{t,j} h_j
```

Because the alignment scores are computed using an additive combination passed through a neural network, this variant is often called **additive attention**.[^4] The approach yielded translation quality comparable to the state-of-the-art phrase-based system on English-to-French translation, and crucially, it did not suffer the same degradation on long sentences that earlier encoder-decoder models exhibited. Bahdanau et al. also showed qualitatively that the alignment weights recovered linguistically reasonable word alignments, anticipating the use of attention as an interpretability tool.

### Luong attention (2015, multiplicative variants)

In 2015, Minh-Thang Luong, Hieu Pham, and Christopher Manning published "Effective Approaches to Attention-based Neural Machine Translation" (arXiv:1508.04025, EMNLP 2015), which introduced several refinements and alternatives to Bahdanau attention.[^11] Luong et al. proposed two broad classes of attention:

- **Global attention**, which attends to all source positions (similar to Bahdanau attention but architecturally simpler).
- **Local attention**, which attends only to a small window of source positions around an aligned position p_t, reducing computational cost.

Luong attention also introduced multiple scoring functions for computing alignment:

| Scoring function | Formula | Notes |
|---|---|---|
| Dot product | score(s_t, h_j) = s_t^T h_j | Simplest; no extra parameters |
| General | score(s_t, h_j) = s_t^T W_a h_j | Learned weight matrix W_a |
| Concat (additive) | score(s_t, h_j) = v^T tanh(W [s_t; h_j]) | Similar to Bahdanau |

A key implementation difference is that Luong attention uses the current decoder hidden state s_t to compute alignment scores, whereas Bahdanau attention uses the previous state s_{t-1}. Because the dot product and general scoring functions rely on matrix multiplication rather than a feedforward network, Luong attention is sometimes called **multiplicative attention** and tends to be computationally faster.[^11]

### Vaswani et al. 2017: Transformer and self-attention

The attention mechanism reached its most influential form in the 2017 paper "[attention is all you need](/wiki/attention_is_all_you_need)" by Ashish Vaswani, Noam Shazeer, Niki Parmar, [Jakob Uszkoreit](/wiki/jakob_uszkoreit), Llion Jones, Aidan Gomez, Lukasz Kaiser, and [Illia Polosukhin](/wiki/illia_polosukhin) (arXiv:1706.03762, submitted June 12, 2017, NeurIPS 2017).[^3] See also [vaswani](/wiki/vaswani) for the lead author. The paper introduced the [transformer](/wiki/transformer) architecture, which dispenses with recurrence and convolutions entirely and relies solely on attention mechanisms. In the abstract, the authors state: "We propose a novel, simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely."[^3] As of 2025, the paper had been cited more than 173,000 times, and a Nature analysis spanning five major citation databases ranked it the seventh most-cited research paper of the 21st century, making it one of the most cited papers in the history of computer science.[^12][^50]

The Transformer achieved 28.4 [BLEU](/wiki/bleu_bilingual_evaluation_understudy) on the WMT 2014 English-to-German translation task, improving over the previous best (including ensembles) by more than 2 BLEU points. On English-to-French, it set a new single-model state-of-the-art of 41.8 BLEU after training for only 3.5 days on eight P100 GPUs, a small fraction of the training cost of the best models in the literature.[^3]

Critically, Vaswani et al. demonstrated that **self-attention layers**, when stacked, can replace recurrence and convolution as the primary mechanism for sequence modeling. This unlocked unprecedented parallelism during training (because all positions can be processed simultaneously, unlike RNNs) and led to the cascade of model-scale advances that followed, culminating in modern large language models.

## Mathematical formulation

### Queries, keys, and values

The Transformer formalized attention using the vocabulary of **queries**, **keys**, and **values** (Q, K, V).[^3] The analogy is drawn from information retrieval: a query represents what the model is looking for, keys describe the items available to attend to, and values hold the content that will be retrieved. In self-attention, all three are derived from the same input sequence through learned linear projections:

```
Q = X W_Q,    K = X W_K,    V = X W_V
```

where X is the input matrix (each row is a token embedding) and W_Q, W_K, W_V are learned weight matrices. In cross-attention, Q is derived from one sequence and K, V from another.

### Scaled dot-product attention

The core computation in the Transformer is **scaled dot-product attention**:[^3]

```
Attention(Q, K, V) = softmax( Q K^T / sqrt(d_k) ) V
```

where d_k is the dimensionality of the key vectors. The formula works in three stages:

1. **Compute similarity scores.** The dot product Q K^T produces a matrix of raw attention scores, where each entry measures the similarity between a query and a key.
2. **Scale.** The scores are divided by sqrt(d_k).
3. **Softmax and aggregate.** The scaled scores pass through a softmax function to produce a probability distribution (the attention weights), which is then used to take a weighted sum of the value vectors.

**Why scale by sqrt(d_k)?** Vaswani et al. explicitly motivate the scaling factor in Section 3.2.1 of the 2017 paper.[^3] When d_k is large, dot products between queries and keys tend to grow in magnitude. If the individual elements of Q and K are independent random variables with mean 0 and variance 1, then their dot product has mean 0 and variance d_k. Large-magnitude dot products push the softmax into regions where it has extremely small gradients, slowing or stalling learning. Dividing by sqrt(d_k) normalizes the variance of the dot products back to 1, keeping the softmax in a region with healthier gradients. This scaling is a load-bearing detail of the original Transformer formulation.

### Multi-head attention

Rather than performing a single attention computation with full-dimensional queries, keys, and values, the Transformer uses **[multi head attention](/wiki/multi_head_attention)** (often written multi-head attention or MHA).[^3] The idea is to run h attention "heads" in parallel, each operating on a different learned linear projection of Q, K, and V into lower-dimensional subspaces:

```
head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O
```

Each head can learn to attend to different types of relationships. For instance, one head might focus on syntactic dependencies (e.g., subject-verb agreement) while another captures semantic similarity. In the original Transformer (d_model = 512, h = 8), each head operates on d_k = d_v = 64 dimensional projections. The concatenated outputs are projected back to d_model dimensions through a final weight matrix W_O.

Multi-head attention adds essentially no computational overhead compared to single-head attention with the full dimensionality, because the per-head dimensionality is reduced proportionally. The benefit is representational: multiple heads allow the model to jointly attend to information from different representation subspaces at different positions.

## Types of attention

### How does self-attention differ from cross-attention?

A central distinction concerns where the queries, keys, and values come from. See [self attention](/wiki/self_attention) for the dedicated article.

- **[self attention](/wiki/self_attention)** (also called intra-attention) applies the attention mechanism within a single sequence. Each token generates a query, a key, and a value; every token's query is compared against every other token's key, and the resulting attention weights determine how much each token's value contributes to the output representation at that position.[^3] Self-attention is what powers the encoder and decoder stacks of modern Transformers.
- **[cross attention](/wiki/cross_attention)** (also called encoder-decoder attention) is a variant where the queries come from one sequence and the keys and values come from a different sequence. In the original Transformer, the decoder generates queries from its own hidden states, while the keys and values come from the encoder's output, allowing each position in the decoder to attend to all positions in the encoder.[^3] Cross-attention is the standard mechanism for bridging modalities (e.g., text-to-image diffusion models, text-conditioned speech synthesis) and for retrieval-augmented systems.

In short, self-attention relates a sequence to itself, while cross-attention relates one sequence (the queries) to another (the keys and values). Self-attention has a critical advantage over recurrent layers: it connects every pair of positions in a sequence with a constant number of operations (O(1) path length), whereas an RNN requires O(n) sequential steps to propagate information from one end of the sequence to the other.[^3] This makes self-attention far more effective at capturing long-range dependencies and is also more parallelizable during training.

### Masked / causal attention (autoregressive)

For autoregressive language modeling, predicting the next token given the previous tokens, the model must not be allowed to "see the future." This is enforced by **masked self-attention**, also called **causal attention**.[^3] In implementation, the attention score matrix is augmented with a triangular mask: entries above the main diagonal (corresponding to attending to future positions) are set to negative infinity before the softmax, so they receive zero attention weight.

```
Mask_{i,j} = 0      if j <= i
           = -inf   if j  > i
```

This simple mask is what makes models like [gpt](/wiki/gpt) and [llama](/wiki/llama) autoregressive: at training time the model sees the entire sequence at once but is structurally prevented from leaking information backward from later positions. At inference time, tokens are generated one at a time, and each new token attends over all preceding tokens via the [kv cache](/wiki/kv_cache) (see below).

### Bidirectional attention (BERT)

Encoder-only models such as [bert](/wiki/bert) (Devlin et al., 2018) use **bidirectional attention**: every token attends to every other token in the sequence, with no causal mask.[^13] To train such models without trivial copying, BERT replaces selected tokens with a special [MASK] symbol and trains the model to predict the original token from context: the **masked language modeling** objective. Bidirectional attention is well-suited to representation learning and discriminative tasks (classification, named-entity recognition, span extraction) but is not directly suited to open-ended text generation, which is why the [gpt](/wiki/gpt) family (causal) is the dominant architecture for generative language models.

The encoder layers of encoder-decoder Transformers (e.g., the original Transformer for translation, [t5](/wiki/t5)) also use bidirectional self-attention; the decoder uses causal self-attention together with cross-attention back to the encoder.

## Variants

### Multi-head attention (MHA, 2017)

Standard multi-head attention as introduced by Vaswani et al. (2017) gives each of the h heads its own query, key, and value projections.[^3] The KV cache during autoregressive inference therefore stores h sets of key and value vectors per token. This is memory-intensive at large model scale; the variants below trade off some quality for substantial KV-cache savings.

### Multi-query attention (MQA, Shazeer 2019)

Multi-Query Attention was proposed by Noam Shazeer in his 2019 paper "Fast Transformer Decoding: One Write-Head Is All You Need" (arXiv:1911.02150).[^14] See also [multi query attention](/wiki/multi_query_attention). The central observation is that during autoregressive inference, the primary performance bottleneck on modern accelerators is the memory bandwidth required to load the key-value cache, not the arithmetic computation itself.

MQA addresses this by having all query heads **share a single set of key and value projections**. Instead of h independent key and value heads (as in MHA), there is just one key head and one value head. Each query head still has its own projection, so the model retains h different query perspectives, but the KV cache is reduced by a factor of h. In practice, MQA speeds up inference decoding substantially with only a small quality degradation. It was adopted in [palm](/wiki/palm) (Google, 2022) and [falcon](/wiki/falcon) (TII, 2023).

### Grouped-query attention (GQA, Ainslie 2023)

[grouped query attention](/wiki/grouped_query_attention) (GQA), introduced by Ainslie et al. (2023, arXiv:2305.13245, EMNLP 2023), is a compromise between MHA and MQA.[^15] Instead of sharing a single KV head across all query heads (MQA) or having unique KV heads for every query head (MHA), GQA divides the query heads into **g groups**, where each group shares one set of key and value projections.

| Variant | Query heads | KV heads | KV cache size |
|---|---|---|---|
| Multi-Head Attention (MHA) | h | h | h * d_k * 2 * n |
| Grouped-Query Attention (GQA) | h | g | g * d_k * 2 * n |
| Multi-Query Attention (MQA) | h | 1 | 1 * d_k * 2 * n |

**GQA generalizes both extremes**: when g = h, GQA reduces to MHA; when g = 1, GQA reduces to MQA. By choosing an intermediate g, GQA achieves most of the inference speed benefits of MQA while maintaining quality closer to MHA. Meta adopted GQA with 8 KV groups in [Llama 2](/wiki/llama_2) 70B (2023), and it has since become the default attention variant in most production large language models, including [llama](/wiki/llama) 3, Mistral, and many others.[^15]

### Multi-head Latent Attention (MLA, DeepSeek 2024)

[mla](/wiki/mla) (Multi-head Latent Attention, also covered at [multi-head latent attention](/wiki/multi-head_latent_attention)) was introduced in [DeepSeek-V2](/wiki/deepseek) (arXiv:2405.04434, May 2024).[^16] MLA takes a fundamentally different approach to KV cache reduction: rather than reducing the number of KV heads (as in MQA and GQA), MLA **compresses** the key and value representations into a low-dimensional latent vector before caching. At inference time, the compressed representation is projected back to produce unique keys and values for each head.

Given an input token x_n, MLA first compresses it into a latent representation:

```
c^{KV}_n = W^{DKV} x_n
```

where W^{DKV} is a down-projection matrix mapping the model dimension d to a much smaller latent dimension d_c. This compact vector is stored in the KV cache instead of the full key and value vectors. When attention is computed, separate up-projection matrices W^{UK} and W^{UV} reconstruct unique keys and values for each head.

A key challenge is compatibility with [rotary position embedding](/wiki/rotary_position_embedding) (RoPE). Standard RoPE entangles positional information with content, which would prevent the "absorption trick" that lets MLA fold the up-projection matrices into the query projection and avoid actually decompressing the KV cache during inference. DeepSeek solved this with **decoupled RoPE**: separate query and key vectors are introduced specifically for positional encoding, keeping the main latent keys isolated from rotation matrices.[^16]

In DeepSeek-V2, MLA achieved a 93.3% reduction in KV-cache size compared to standard MHA while matching (and sometimes exceeding) model quality, and increased maximum generation throughput by 5.76 times. DeepSeek-V3 uses d_h = 128, H = 128, and d_c = 512, giving a compression ratio of 32. MLA was used in [DeepSeek-V3](/wiki/deepseek_3_0) and [DeepSeek-R1](/wiki/deepseek_r1); subsequent research (TransMLA, 2025) has explored enabling MLA in any Transformer-based LLM.

### Differential attention (2024)

Differential attention, introduced by Ye et al. at Microsoft Research and Tsinghua University in their 2024 paper "[Differential Transformer](/wiki/differential_transformer)" (arXiv:2410.05258, ICLR 2025 Oral), rethinks how attention scores are computed.[^17] The mechanism partitions the query and key projections into two groups and computes two independent softmax distributions:

```
DiffAttn(X) = ( softmax(Q_1 K_1^T / sqrt(d)) - lambda * softmax(Q_2 K_2^T / sqrt(d)) ) V
```

The subtraction acts as **noise cancellation**: many tokens in standard attention receive small but non-negligible weight, diluting the signal. Differential attention subtracts these common noise patterns, causing attention to concentrate on genuinely relevant tokens. Experiments across model sizes from 830M to 13.1B parameters showed consistent improvements: a 6.8B Diff Transformer matched the validation loss of an 11B standard Transformer.[^17]

## Efficiency improvements

### Flash Attention v1/v2/v3 (Dao 2022-2024)

[flash attention](/wiki/flash_attention), introduced by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re in their 2022 paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (arXiv:2205.14135), addresses the quadratic memory cost of attention by rethinking how the computation interacts with GPU hardware, without approximating the math.[^18]

The key insight is that standard attention implementations are bottlenecked not by arithmetic operations but by memory transfers between GPU high-bandwidth memory (HBM) and the on-chip SRAM. Standard implementations materialize the full n x n attention matrix in HBM, which requires O(n^2) memory reads and writes. FlashAttention avoids this by **tiling**: it splits Q, K, and V into blocks, loads each block from HBM into SRAM, computes the attention for that block in fast on-chip memory, and writes only the final output back to HBM. A carefully designed online softmax normalization algorithm allows blocks to be processed incrementally without ever needing the full attention matrix in memory.

The result is **exact** attention (not an approximation) that uses O(n) memory instead of O(n^2) and achieves wall-clock speedups of 2 to 4 times over standard implementations.[^18]

- **FlashAttention-2** (Tri Dao, 2023, arXiv:2307.08691) reduced non-matmul FLOPs by restructuring the algorithm to spend a higher fraction of time on matrix multiplications, improved parallelism across thread blocks, and refined warp partitioning. These changes yielded roughly a 2x speedup over FlashAttention v1, reaching 50 to 73% of theoretical maximum FLOPs/s on NVIDIA A100 GPUs.[^19]
- **FlashAttention-3** (Tri Dao and Jay Shah, 2024, NeurIPS 2024) targets NVIDIA Hopper GPUs (H100). It exploits asynchronous execution of Tensor Cores and the Tensor Memory Accelerator via warp specialization, interleaved block-wise matmul and softmax operations, and FP8 low-precision computation with block quantization. It achieves up to 840 TFLOPs/s in BF16 on H100 (about 85% utilization), roughly 1.5 to 2x faster than FlashAttention-2.[^20]
- **FlashAttention-4** (Zadouri, Shah, Hohnerbach, Liu, Thakkar, Dao, 2026, MLSys 2026) extends the line to NVIDIA Blackwell GPUs (B200). It introduces ping-pong scheduling, software exponential emulation using polynomial approximation on FMA units, and conditional online softmax rescaling. Written in CuTe-DSL, it reaches up to 1,605 TFLOPs/s in BF16 on B200 (about 71% utilization), 1.3x faster than cuDNN 9.13 and 2.7x faster than Triton.[^21] See [flash attention 3](/wiki/flash_attention_3) for the dedicated v3 article.

### Sparse and sliding-window attention (Longformer, Mistral)

[sparse attention](/wiki/sparse_attention) approaches restrict each token's attention to a subset of positions rather than the full sequence.

- **Longformer** (Beltagy, Peters, and Cohan, 2020, arXiv:2004.05150) combines local sliding-window attention with global attention on a small number of designated tokens (e.g., the [CLS] token for classification). It also introduced dilated sliding-window attention. Longformer scales linearly with sequence length and was pretrained for up to 4,096 tokens.[^22]
- **BigBird** (Zaheer et al., 2020, NeurIPS) extends Longformer by adding random attention. The authors proved theoretically that BigBird's sparse pattern is a universal approximator of sequence functions and Turing complete.[^23]
- **Mistral 7B** (Mistral AI, 2023) uses sliding window attention with a window size of 4,096 tokens; stacking 32 layers gives an effective receptive field of about 131,000 tokens. A rolling-buffer KV cache limited to the window halves cache memory at long sequence lengths.[^24]
- **Hybrid local-global designs**: [Gemma](/wiki/gemma) 2 (Google, 2024) uses a 1:1 ratio of local and global attention layers with a 4,096-token window. [Gemma](/wiki/gemma) 3 (Google, 2025) shifted to a 5:1 ratio with a 1,024-token window, reducing attention compute by roughly 5x and KV cache memory from about 60% to 15% of total memory while still supporting 128K context lengths via RoPE frequency rescaling on the global layers.[^25]

### Linear attention (Performer, Linformer)

Katharopoulos et al. (2020) proposed **linear attention** in "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" (ICML 2020).[^26] The core idea replaces the softmax with a decomposable kernel function. Using feature maps phi:

```
LinearAttention(Q, K, V) = ( phi(Q) ( phi(K)^T V ) ) / ( phi(Q) sum(phi(K)^T) )
```

By exploiting the associative property of matrix multiplication, the computation avoids materializing the n x n attention matrix. The product phi(K)^T V produces a d x d matrix (independent of n), reducing complexity from O(n^2 d) to O(n d^2). Katharopoulos et al. used phi(x) = elu(x) + 1 and showed a direct connection between Transformers and RNNs that enables efficient autoregressive generation.

Related approaches include:

- **Linformer** (Wang et al., 2020, arXiv:2006.04768) projects keys and values to a lower-dimensional space, achieving O(n) complexity but at the cost of fixing a maximum sequence length.[^27]
- **Performer** (Choromanski et al., 2020, ICLR 2021, arXiv:2009.14794) uses random feature maps (FAVOR+) to approximate softmax attention with provable accuracy bounds at O(n) cost.[^28]
- **Reformer** (Kitaev, Kaiser, and Levskaya, 2020, arXiv:2001.04451) uses locality-sensitive hashing to attend only to nearby items in hash space.[^29]

### Ring attention (Liu 2023)

**Ring Attention** (Liu, Zaharia, and Abbeel, ICLR 2024, arXiv:2310.01889) is a distributed sequence-parallelism technique that enables processing of extremely long sequences by splitting them across devices arranged in a ring topology.[^30] Each device computes blockwise attention between its local query block and a visiting KV block, while simultaneously sending that KV block to the next device in the ring and receiving the next KV block from the previous device. Because block computation takes longer than block transfers, communication is fully hidden.

Ring Attention enables training and inference on sequences up to p times longer than what a single device can handle, where p is the number of devices. On 32 A100 GPUs, a 7B model can process over 1 million tokens; on TPUv4-1024, a 3B model can train with 16 million tokens.[^30]

### Native Sparse Attention (2025)

**Native Sparse Attention (NSA)**, introduced by DeepSeek in February 2025 (arXiv:2502.11089, Best Paper at ACL 2025), is a hardware-aligned sparse attention mechanism designed to be natively trainable end-to-end.[^31] NSA processes inputs through three parallel attention branches combined via learned gating: compressed attention (coarse-grained blocks), selected attention (top-n important blocks at full precision), and sliding-window attention (local recent tokens). On 64K sequences, NSA achieves 9.0x forward speedup, 6.0x backward speedup, and 11.6x decoding speedup while matching or exceeding full-attention quality.

## Position encoding pairing

Attention is intrinsically **permutation-equivariant**: scaled dot-product attention treats its inputs as an unordered set, so positional information must be injected externally for sequence modeling. The choice of position encoding has become a major design lever in modern Transformers, and several encodings are tightly coupled to specific attention variants.

- **Sinusoidal absolute position embeddings**, used in the original Transformer, add a deterministic sinusoidal vector to each token embedding before the first attention layer.[^3]
- **Learned absolute position embeddings**, used by [bert](/wiki/bert) and [gpt](/wiki/gpt)-2, replace the sinusoid with a learned vector per position.[^13]
- **[rotary position embedding](/wiki/rotary_position_embedding)** (RoPE), introduced by Su et al. (2021, arXiv:2104.09864), rotates the query and key vectors by a position-dependent angle inside each attention head.[^32] RoPE has become the default in [llama](/wiki/llama), [mistral](/wiki/mistral), [deepseek](/wiki/deepseek), [qwen](/wiki/qwen), and most modern LLMs. The fact that RoPE acts directly on Q and K means it composes naturally with attention variants such as GQA, MLA (with decoupled RoPE), and sliding-window attention.
- **[alibi](/wiki/alibi)** (Attention with Linear Biases, Press et al., 2022, arXiv:2108.12409) adds a fixed linear penalty to the attention scores based on the distance between query and key positions.[^33] ALiBi enables length extrapolation: models trained at one context length can be evaluated at much longer lengths without re-training.

These position encodings interact with attention in different ways: RoPE rotates Q and K, ALiBi biases the score matrix, and absolute encodings simply add to the token representation. The interaction is non-trivial; for example, MLA's compatibility with RoPE required the decoupled-RoPE design described above.

## KV cache (inference optimization)

During autoregressive generation (producing one token at a time), a [language model](/wiki/language_model) must compute attention over all previously generated tokens. Without optimization, this means recomputing the key and value projections for every past token at every generation step, leading to redundant computation that grows quadratically with sequence length.

The **[kv cache](/wiki/kv_cache)** solves this by storing the key and value vectors from all previous time steps. At each new generation step, only the key and value for the new token need to be computed and appended to the cache. The query for the new token then attends over all cached keys and values. This reduces the per-step projection cost from O(n d) to O(d), though the attention computation itself still requires O(n d) per step.

The main challenge is memory: the KV cache grows linearly with sequence length, model width, and batch size. For a large model such as Llama 2 70B with a context window of 4,096 tokens, the KV cache alone can consume tens of gigabytes of GPU memory. Several strategies address this:

| Strategy | Mechanism | Typical reduction |
|---|---|---|
| MQA / GQA | Reduce number of KV heads | KV cache reduced by factor of h (MQA) or h/g (GQA) |
| MLA | Compress KV into low-rank latent vector | 93.3% cache reduction (DeepSeek-V2) |
| KV cache quantization | Store cached K/V in FP8 or INT4 | 2-4x memory reduction |
| [paged attention](/wiki/paged_attention) (PagedAttention) | Virtual-memory-style non-contiguous cache blocks | Waste reduced from 60-80% to under 4% |
| Sliding-window caches | Limit cache to fixed window size | Bounded memory regardless of sequence length |
| Token eviction / compression | Selectively remove or merge less important cached tokens | Variable, task-dependent |

**PagedAttention** (Kwon et al., 2023, SOSP 2023), used in the [vllm](/wiki/vllm) serving framework, deserves special mention.[^34] It borrows ideas from operating-system virtual memory to manage cache memory in non-contiguous blocks, reducing fragmentation. Standard implementations waste 60 to 80% of KV-cache memory; PagedAttention reduces waste to under 4% and improves serving throughput by 2 to 4x. **[radix attention](/wiki/radix_attention)** (Zheng et al., 2024) extends this by sharing prefix KV blocks across requests in a radix tree, accelerating multi-turn conversation and structured generation.

## Implementation tricks

Practical attention implementations in modern training and inference stacks rely on a handful of complementary techniques:

- **Mixed-precision training**: matrix multiplications are computed in BF16 or FP16 while the softmax and gradient accumulation use FP32, balancing numerical stability with throughput.[^35] On H100 and B200, FP8 attention (used by FlashAttention-3 and -4) further increases throughput, with careful scaling and incoherent processing to bound numerical error.
- **Triton kernels**: many production attention kernels (including the reference FlashAttention implementations and DeepSeek's NSA) are written in **OpenAI Triton**, a Python-like DSL that targets GPUs and abstracts away much of the CUDA boilerplate.[^36] Triton has become the de facto standard for custom attention kernels in research and is increasingly used in production.
- **CUTLASS and CuTe-DSL**: NVIDIA's CUTLASS library and its successor CuTe-DSL provide high-performance GEMM building blocks that underpin FlashAttention-3 and -4 on Hopper and Blackwell GPUs.
- **Paged attention (vLLM)**: as described above, PagedAttention enables high-throughput serving by managing the KV cache as virtual-memory pages, making it feasible to serve many concurrent requests with shared prefixes.[^34]
- **Continuous batching**: serving frameworks such as [vllm](/wiki/vllm) and TensorRT-LLM use continuous batching (also called in-flight batching), where new requests join an ongoing batch as previous ones finish, dramatically increasing GPU utilization for autoregressive workloads.
- **Speculative decoding**: speculative-decoding and lookahead techniques generate multiple candidate tokens with a small draft model and verify them with a larger target model in a single attention pass, increasing tokens-per-second without changing the underlying attention algorithm.

## Limitations

### O(n^2) memory and compute

Standard self-attention has time and memory complexity of O(n^2 d), where n is sequence length and d is model dimension.[^3] The quadratic dependence on n is the fundamental bottleneck for very long contexts. Although FlashAttention reduces the **memory** cost to O(n) (using O(n) auxiliary storage even though the conceptual attention matrix is n x n), the **compute** cost remains quadratic for exact attention. This is why sparse, linear, and state-space alternatives remain active research areas.

### Long-context challenges

Empirically, long-context Transformers face several distinct failure modes:

- **Lost in the middle**: Liu et al. (2023, arXiv:2307.03172) showed that even capable LLMs are markedly less accurate at retrieving information from the middle of a long context compared to the beginning or end, producing a U-shaped accuracy curve.[^37]
- **Position-encoding extrapolation**: many position encodings struggle to generalize beyond the training context length. Techniques like RoPE frequency scaling (NTK-aware scaling, YaRN), ALiBi, and position interpolation address this with varying success.
- **Softmax attention dilution**: as context length grows, individual attention weights become smaller, making it harder to pick out a few important tokens. Differential attention[^17] and learned sparse attention[^31] are partial remedies.
- **Throughput and latency**: even with FlashAttention, prefill and decode latency grow with context length, motivating the variants surveyed above.

## Visualization and interpretation

### Are attention weights an explanation?

One practical advantage of attention mechanisms is that attention weights can be inspected to gain insight into what the model is focusing on. Attention maps are typically visualized as heatmaps, where brighter entries indicate stronger attention between two positions. Bahdanau et al. (2014) and Vaswani et al. (2017) used such visualizations to argue that the attention mechanism recovers linguistically interpretable patterns (e.g., word alignments in translation, head-dependent relations in parsing).[^4][^3]

However, researchers have cautioned against over-interpreting attention weights. **Jain and Wallace (2019)** showed that attention weights often do not correlate well with other measures of feature importance and that **alternative attention distributions can produce identical predictions**, challenging the view that attention is itself an explanation.[^38] Subsequent work (Wiegreffe and Pinter, 2019, EMNLP) clarified the conditions under which attention can or cannot be interpreted as explanation.[^39] Attention weights indicate how information flows through the network but do not necessarily indicate which inputs are causally important for the output; more rigorous interpretability methods, such as probing classifiers and gradient-based attribution, are typically needed to draw reliable conclusions.

## Alternatives to attention

A line of research aims to replace attention entirely with mechanisms that have linear or sub-quadratic complexity while retaining the parallelism and expressivity of Transformers.

- **State-space models (Mamba)**: [mamba](/wiki/mamba) (Gu and Dao, 2023, arXiv:2312.00752) makes the state transition matrices of a structured state-space model **input-dependent** (selective), enabling dynamic information routing.[^40] Mamba achieves linear-time complexity O(n), 5x higher inference throughput than Transformers, and scales to million-length sequences; Mamba-3B matches or outperforms Transformers of twice its size on language-modeling benchmarks. Recent hybrid architectures such as [jamba](/wiki/jamba) (AI21 Labs, 2024) interleave Mamba, attention, and Mixture-of-Experts layers.
- **Linear RNNs (RWKV)**: [rwkv](/wiki/rwkv) (Peng et al., 2023, arXiv:2305.13048) is an RNN-style architecture with linear complexity in sequence length that nonetheless can be trained in parallel like a Transformer.[^41] RWKV combines the expressivity of attention with the inference efficiency of RNNs; the project has released open-weight models from 100M to 14B parameters.
- **Hyena (Poli 2023)**: [hyena](/wiki/hyena) (Poli et al., 2023, ICML 2023, arXiv:2302.10866) replaces attention with a recurrence built from long convolutions and data-controlled gating.[^42] Hyena achieves subquadratic complexity and is competitive with attention on language modeling at sequence lengths up to 64K, where it offers 100x speedup over standard attention.
- **RetNet (Retentive Networks)**: Sun et al. (2023, arXiv:2307.08621) propose retention, a mechanism that admits a parallel form (similar to attention) and a recurrent form (similar to an RNN), enabling O(1) inference per step with parallel training.[^43]
- **DeltaNet**: Schlag, Irie, and Schmidhuber (2021) and later Yang et al. (2024) developed DeltaNet, a linear-attention variant with an explicit delta-rule update that improves recall over plain linear attention.[^44]
- **Hybrid architectures**: rather than fully replacing attention, several systems interleave attention layers with state-space or linear-recurrence layers. Examples include [jamba](/wiki/jamba), Striped Hyena, and Samba. These hybrids attempt to combine the precise recall of attention with the efficiency of sub-quadratic mechanisms.

Recent benchmarks suggest a nuanced picture: attention excels at **precise recall** from context (the "needle in a haystack" task and related associative recall), while SSMs and linear recurrences excel at **compression and efficiency** over long sequences.[^45] Hybrid architectures are an attempt to combine the strengths of both.

## Comparison of attention variants

The following table summarizes major attention variants, their key characteristics, and the models that adopt them.

| Variant | Year | Authors | Key idea | Complexity | Notable models |
|---|---|---|---|---|---|
| Additive (Bahdanau) | 2014 | Bahdanau, Cho, Bengio | Learned alignment via feedforward net | O(n m) | Early NMT |
| Multiplicative (Luong) | 2015 | Luong, Pham, Manning | Dot-product / general scoring | O(n m) | Early NMT |
| Scaled dot-product | 2017 | Vaswani et al. | Q, K, V with sqrt(d_k) scaling | O(n^2 d) | All Transformers |
| Multi-head attention | 2017 | Vaswani et al. | h parallel heads in subspaces | O(n^2 d) | All Transformers |
| Sparse (Longformer) | 2020 | Beltagy et al. | Local + global + dilated | O(n w) | Longformer |
| Sparse (BigBird) | 2020 | Zaheer et al. | Local + global + random | O(n (w+r+g)) | BigBird |
| Linear attention | 2020 | Katharopoulos et al. | Kernel feature map | O(n d^2) | Linear Transformer |
| Performer | 2020 | Choromanski et al. | Random features (FAVOR+) | O(n d^2) | Performer |
| Multi-query (MQA) | 2019 | Shazeer | Single shared KV head | O(n^2 d) | PaLM, Falcon |
| Grouped-query (GQA) | 2023 | Ainslie et al. | g shared KV groups | O(n^2 d) | Llama 2, Llama 3, Mistral |
| FlashAttention | 2022 | Dao et al. | IO-aware tiling, exact, O(n) memory | O(n^2 d) compute | Widely adopted |
| FlashAttention-2 | 2023 | Dao | Better parallelism, warp partitioning | O(n^2 d) compute | Widely adopted |
| FlashAttention-3 | 2024 | Dao, Shah | Hopper async, FP8, warp specialization | O(n^2 d) compute | H100 deployments |
| FlashAttention-4 | 2026 | Zadouri, Shah, Dao et al. | Blackwell pipelining | O(n^2 d) compute | B200 deployments |
| Sliding window | 2023 | Mistral AI | Local window + rolling KV cache | O(n w) | Mistral 7B, Gemma 3 |
| Multi-head latent (MLA) | 2024 | DeepSeek | Latent compression of KV cache | O(n^2 d) | DeepSeek-V2, V3, R1 |
| Differential attention | 2024 | Ye et al. | Difference of two softmax maps | O(n^2 d) | Diff Transformer |
| Native sparse (NSA) | 2025 | DeepSeek | Trained sparse: compress + select + slide | O(n (n/r + k + w)) | DeepSeek (research) |
| Ring Attention | 2024 | Liu, Zaharia, Abbeel | Distributed ring sequence parallel | O(n^2 d / p) per device | Long-context training |
| Selective SSM (Mamba) | 2023 | Gu, Dao | Input-dependent SSM | O(n d) | Mamba, Jamba |

## Attention in computer vision

### Vision Transformer (ViT)

The [Vision Transformer](/wiki/vision_transformer) (ViT, Dosovitskiy et al., 2020, ICLR 2021, arXiv:2010.11929) demonstrated that a pure Transformer applied directly to sequences of image patches can match or exceed state-of-the-art [convolutional neural networks](/wiki/convolutional_neural_network) (CNNs) on image classification.[^46] ViT splits an image into fixed-size patches (typically 16x16 pixels), flattens each patch into a vector, projects it to the model dimension, prepends a learnable [CLS] token, and adds positional embeddings, then processes the resulting sequence with a standard Transformer encoder using multi-head self-attention.

Self-attention in ViT allows every patch to attend to every other patch, capturing global relationships across the entire image from the very first layer, in contrast to CNNs which build global understanding only gradually through stacked local convolutions. ViT has since spawned many variants, including DeiT (Touvron et al., 2021), [Swin Transformer](/wiki/swin_transformer) (Liu et al., 2021) which uses shifted-window attention for efficiency, and BEiT (Bao et al., 2021).

### Attention in diffusion models

Modern text-to-image [diffusion models](/wiki/diffusion_model) like [Stable Diffusion](/wiki/stable_diffusion), [DALL-E](/wiki/dall-e), and [Imagen](/wiki/imagen) rely heavily on attention. **Self-attention** operates within the image latent representations at multiple spatial resolutions, preserving global geometric coherence. **Cross-attention** connects the text prompt (providing keys and values from a text encoder such as [CLIP](/wiki/clip) or [T5](/wiki/t5)) to the image latent features (providing queries), controlling which regions of the image correspond to which words.[^47] Researchers have leveraged cross-attention maps for prompt-to-prompt editing, attention-based layout control, and interpretability analysis.

## Applications of attention

### What is attention used for?

Attention mechanisms have been adopted across virtually every domain of machine learning:

- **[Natural language processing](/wiki/natural_language_processing)**: attention is the backbone of [bert](/wiki/bert), the [gpt](/wiki/gpt) series, [t5](/wiki/t5), [llama](/wiki/llama), [claude](/wiki/claude), and [deepseek](/wiki/deepseek) models, enabling translation, summarization, question answering, and code generation.
- **[Computer vision](/wiki/computer_vision)**: Vision Transformers and their variants use self-attention for image classification, object detection, and segmentation.
- **Speech and audio**: models such as [Whisper](/wiki/whisper) ([OpenAI](/wiki/openai), 2022) use cross-attention between audio features and text tokens for speech recognition.
- **Multimodal learning**: cross-attention connects different modalities in models such as Flamingo (DeepMind, 2022), Stable Diffusion, and video understanding systems.
- **Protein structure prediction**: [alphafold](/wiki/alphafold) 2 (DeepMind, 2021) uses a specialized attention mechanism, the Evoformer, that applies self- and cross-attention to protein sequences and structural features.[^48]
- **[Reinforcement learning](/wiki/reinforcement_learning)**: Decision Transformer (Chen et al., 2021) frames RL as a sequence-modeling problem, applying self-attention to sequences of states, actions, and rewards.[^49]

## Explain like I'm 5 (ELI5)

Imagine you are in a classroom and the teacher asks a question. You look around the room for clues. Some classmates are whispering the answer, some are drawing in their notebooks, and some are looking out the window. Attention is like choosing to listen more closely to the classmates who seem to know the answer and ignoring the ones looking out the window. In machine learning, attention lets the computer do something similar: it decides which pieces of information are most helpful for the task at hand and pays more attention to those, while downplaying the rest.

## See also

- [self attention](/wiki/self_attention): the detailed treatment of self-attention as an operation.
- [multi-head self-attention](/wiki/multi-head_self-attention): multi-head self-attention specifically.
- [cross attention](/wiki/cross_attention): attention between two different sequences.
- [bahdanau attention](/wiki/bahdanau_attention): the 2014 paper that started it all.
- [attention is all you need](/wiki/attention_is_all_you_need) / [attention is all you need transformer](/wiki/attention_is_all_you_need_transformer): the 2017 Transformer paper.
- [multi query attention](/wiki/multi_query_attention): MQA (Shazeer 2019).
- [grouped query attention](/wiki/grouped_query_attention): GQA (Ainslie 2023), the modern default.
- [mla](/wiki/mla) / [multi-head latent attention](/wiki/multi-head_latent_attention): MLA (DeepSeek 2024).
- [flash attention](/wiki/flash_attention) / [flash attention 3](/wiki/flash_attention_3): IO-aware exact attention.
- [paged attention](/wiki/paged_attention) / [radix attention](/wiki/radix_attention): KV-cache management for serving.
- [sparse attention](/wiki/sparse_attention): sparse and sliding-window variants.
- [rotary position embedding](/wiki/rotary_position_embedding) / [alibi](/wiki/alibi): position encodings paired with attention.
- [kv cache](/wiki/kv_cache): autoregressive inference optimization.
- [transformer](/wiki/transformer): the architecture built on attention.
- [bert](/wiki/bert), [gpt](/wiki/gpt): bidirectional vs causal exemplars.
- [mamba](/wiki/mamba), [rwkv](/wiki/rwkv), [hyena](/wiki/hyena): alternatives to attention.
- [vaswani](/wiki/vaswani), [bahdanau](/wiki/bahdanau): researchers behind the key papers.

## References

[^1]: Wikipedia. "Attention (machine learning)." https://en.wikipedia.org/wiki/Attention_(machine_learning)
[^2]: Lilian Weng. "Attention? Attention!" *Lil'Log*, June 2018. https://lilianweng.github.io/posts/2018-06-24-attention/
[^3]: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." *NeurIPS 2017*. arXiv:1706.03762. https://arxiv.org/abs/1706.03762
[^4]: Bahdanau, D., Cho, K., and Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." *ICLR 2015*. arXiv:1409.0473. https://arxiv.org/abs/1409.0473
[^5]: Cherry, E. C. (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears." *Journal of the Acoustical Society of America*, 25(5), 975-979. https://doi.org/10.1121/1.1907229
[^6]: Treisman, A. M., and Gelade, G. (1980). "A Feature-Integration Theory of Attention." *Cognitive Psychology*, 12(1), 97-136. https://doi.org/10.1016/0010-0285(80)90005-5
[^7]: Corbetta, M., and Shulman, G. L. (2002). "Control of Goal-Directed and Stimulus-Driven Attention in the Brain." *Nature Reviews Neuroscience*, 3, 201-215. https://doi.org/10.1038/nrn755
[^8]: Itti, L., Koch, C., and Niebur, E. (1998). "A Model of Saliency-Based Visual Attention for Rapid Scene Analysis." *IEEE Transactions on Pattern Analysis and Machine Intelligence*, 20(11), 1254-1259. https://doi.org/10.1109/34.730558
[^9]: Xu, K., Ba, J., Kiros, R., et al. (2015). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention." *ICML 2015*. arXiv:1502.03044. https://arxiv.org/abs/1502.03044
[^10]: Sutskever, I., Vinyals, O., and Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." *NeurIPS 2014*. arXiv:1409.3215. https://arxiv.org/abs/1409.3215
[^11]: Luong, M.-T., Pham, H., and Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." *EMNLP 2015*. arXiv:1508.04025. https://arxiv.org/abs/1508.04025
[^12]: Google Scholar citation count for "Attention Is All You Need" (Vaswani et al., 2017). https://scholar.google.com/scholar?cluster=2960712678066186980
[^13]: Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *NAACL 2019*. arXiv:1810.04805. https://arxiv.org/abs/1810.04805
[^14]: Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150. https://arxiv.org/abs/1911.02150
[^15]: Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." *EMNLP 2023*. arXiv:2305.13245. https://arxiv.org/abs/2305.13245
[^16]: DeepSeek-AI. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434. https://arxiv.org/abs/2405.04434
[^17]: Ye, T., Dong, L., Xia, Y., Sun, Y., Zhu, Y., Huang, G., and Wei, F. (2024). "Differential Transformer." arXiv:2410.05258 (ICLR 2025 Oral). https://arxiv.org/abs/2410.05258
[^18]: Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." *NeurIPS 2022*. arXiv:2205.14135. https://arxiv.org/abs/2205.14135
[^19]: Dao, T. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv:2307.08691. https://arxiv.org/abs/2307.08691
[^20]: Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. (2024). "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." *NeurIPS 2024*. arXiv:2407.08608. https://arxiv.org/abs/2407.08608
[^21]: Zadouri, T., Shah, J., Hohnerbach, M., Liu, T., Thakkar, V., and Dao, T. (2026). "FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling." *MLSys 2026*. https://tridao.me/blog/2025/flash4/
[^22]: Beltagy, I., Peters, M. E., and Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv:2004.05150. https://arxiv.org/abs/2004.05150
[^23]: Zaheer, M., Guruganesh, G., Dubey, K. A., et al. (2020). "Big Bird: Transformers for Longer Sequences." *NeurIPS 2020*. arXiv:2007.14062. https://arxiv.org/abs/2007.14062
[^24]: Jiang, A. Q., Sablayrolles, A., Mensch, A., et al. (2023). "Mistral 7B." arXiv:2310.06825. https://arxiv.org/abs/2310.06825
[^25]: Google DeepMind. (2025). "Gemma 3 Technical Report." arXiv:2503.19786. https://arxiv.org/abs/2503.19786
[^26]: Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." *ICML 2020*. arXiv:2006.16236. https://arxiv.org/abs/2006.16236
[^27]: Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. (2020). "Linformer: Self-Attention with Linear Complexity." arXiv:2006.04768. https://arxiv.org/abs/2006.04768
[^28]: Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2020). "Rethinking Attention with Performers." *ICLR 2021*. arXiv:2009.14794. https://arxiv.org/abs/2009.14794
[^29]: Kitaev, N., Kaiser, L., and Levskaya, A. (2020). "Reformer: The Efficient Transformer." *ICLR 2020*. arXiv:2001.04451. https://arxiv.org/abs/2001.04451
[^30]: Liu, H., Zaharia, M., and Abbeel, P. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." *ICLR 2024*. arXiv:2310.01889. https://arxiv.org/abs/2310.01889
[^31]: DeepSeek-AI. (2025). "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention." arXiv:2502.11089 (ACL 2025 Best Paper). https://arxiv.org/abs/2502.11089
[^32]: Su, J., Lu, Y., Pan, S., Wen, B., and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. https://arxiv.org/abs/2104.09864
[^33]: Press, O., Smith, N. A., and Lewis, M. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." *ICLR 2022*. arXiv:2108.12409. https://arxiv.org/abs/2108.12409
[^34]: Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *SOSP 2023*. arXiv:2309.06180. https://arxiv.org/abs/2309.06180
[^35]: Micikevicius, P., Narang, S., Alben, J., et al. (2018). "Mixed Precision Training." *ICLR 2018*. arXiv:1710.03740. https://arxiv.org/abs/1710.03740
[^36]: Tillet, P., Kung, H. T., and Cox, D. (2019). "Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations." *MAPL 2019*. https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
[^37]: Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2023). "Lost in the Middle: How Language Models Use Long Contexts." arXiv:2307.03172. https://arxiv.org/abs/2307.03172
[^38]: Jain, S., and Wallace, B. C. (2019). "Attention is not Explanation." *NAACL 2019*. arXiv:1902.10186. https://arxiv.org/abs/1902.10186
[^39]: Wiegreffe, S., and Pinter, Y. (2019). "Attention is not not Explanation." *EMNLP 2019*. arXiv:1908.04626. https://arxiv.org/abs/1908.04626
[^40]: Gu, A., and Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. https://arxiv.org/abs/2312.00752
[^41]: Peng, B., Alcaide, E., Anthony, Q., et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." *EMNLP 2023 Findings*. arXiv:2305.13048. https://arxiv.org/abs/2305.13048
[^42]: Poli, M., Massaroli, S., Nguyen, E., et al. (2023). "Hyena Hierarchy: Towards Larger Convolutional Language Models." *ICML 2023*. arXiv:2302.10866. https://arxiv.org/abs/2302.10866
[^43]: Sun, Y., Dong, L., Huang, S., Ma, S., Xia, Y., Xue, J., Wang, J., and Wei, F. (2023). "Retentive Network: A Successor to Transformer for Large Language Models." arXiv:2307.08621. https://arxiv.org/abs/2307.08621
[^44]: Yang, S., Wang, B., Shen, Y., Panda, R., and Kim, Y. (2024). "Gated Linear Attention Transformers with Hardware-Efficient Training." *ICML 2024*. arXiv:2312.06635. https://arxiv.org/abs/2312.06635
[^45]: Arora, S., Eyuboglu, S., Timalsina, A., Johnson, I., Poli, M., Zou, J., Rudra, A., and Re, C. (2024). "Zoology: Measuring and Improving Recall in Efficient Language Models." *ICLR 2024*. arXiv:2312.04927. https://arxiv.org/abs/2312.04927
[^46]: Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2020). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." *ICLR 2021*. arXiv:2010.11929. https://arxiv.org/abs/2010.11929
[^47]: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." *CVPR 2022*. arXiv:2112.10752. https://arxiv.org/abs/2112.10752
[^48]: Jumper, J., Evans, R., Pritzel, A., et al. (2021). "Highly Accurate Protein Structure Prediction with AlphaFold." *Nature*, 596, 583-589. https://doi.org/10.1038/s41586-021-03819-3
[^49]: Chen, L., Lu, K., Rajeswaran, A., et al. (2021). "Decision Transformer: Reinforcement Learning via Sequence Modeling." *NeurIPS 2021*. arXiv:2106.01345. https://arxiv.org/abs/2106.01345
[^50]: Pearson, H., and Ledford, H. (2025). "Exclusive: the most-cited papers of the twenty-first century." *Nature*, 640, 588-592, ranking "Attention Is All You Need" seventh across five major citation databases. https://www.nature.com/articles/d41586-025-01125-y

