See also: Machine learning terms
Attention is a family of techniques in machine learning that allow a model to focus on specific parts of an input while making predictions. Rather than compressing an entire input into a single fixed-size representation, attention mechanisms let the model dynamically weigh the importance of different input elements at each step of computation. This selective focus mirrors, in a loose sense, how biological attention works: irrelevant information is suppressed while relevant information is amplified.
Originally developed for neural machine translation, attention has become a foundational building block across nearly every area of deep learning. It is the core operation inside the Transformer architecture, which underpins large language models such as GPT-4, LLaMA, and Claude, as well as vision models, speech systems, and generative image models. Understanding attention is essential for understanding modern artificial intelligence.
Before attention was introduced, sequence-to-sequence (seq2seq) models for tasks like machine translation relied on an encoder-decoder framework built from recurrent neural networks (RNNs). The encoder processed the source sentence token by token and compressed it into a single fixed-length context vector, which the decoder then used to generate the target sentence. Sutskever, Vinyals, and Le (2014) demonstrated that this approach could achieve strong results with LSTM networks, but the fixed-length bottleneck caused performance to degrade on longer sentences because a single vector could not adequately capture all the information in a long input sequence.
The first widely recognized attention mechanism for neural networks was proposed by Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio in their 2014 paper "Neural Machine Translation by Jointly Learning to Align and Translate," published at ICLR 2015. The key insight was to replace the fixed-length context vector with a dynamic one: instead of forcing the encoder to compress the entire source sentence into a single vector, the decoder could look back at all encoder hidden states and select the most relevant ones at each generation step.
Bahdanau attention works as follows. For each decoder time step t, the mechanism computes an alignment score e_tj between the previous decoder hidden state s_{t-1} and each encoder hidden state h_j using a learned feedforward network:
e_tj = v^T tanh(W_s * s_{t-1} + W_h * h_j)
These scores are normalized through a softmax function to produce attention weights alpha_tj. The context vector c_t is then the weighted sum of encoder hidden states:
c_t = sum_j(alpha_tj * h_j)
Because the alignment scores are computed using an additive combination passed through a neural network, this variant is often called additive attention. The approach yielded translation quality comparable to the state-of-the-art phrase-based system on English-to-French translation, and crucially, it did not suffer the same degradation on long sentences that earlier encoder-decoder models exhibited.
In 2015, Minh-Thang Luong, Hieu Pham, and Christopher Manning published "Effective Approaches to Attention-based Neural Machine Translation," which introduced several refinements and alternatives to Bahdanau attention. Luong et al. proposed two broad classes of attention:
Luong attention also introduced multiple scoring functions for computing alignment:
| Scoring function | Formula | Notes |
|---|---|---|
| Dot product | score(s_t, h_j) = s_t^T h_j | Simplest; no extra parameters |
| General | score(s_t, h_j) = s_t^T W_a h_j | Learned weight matrix W_a |
| Concat (additive) | score(s_t, h_j) = v^T tanh(W[s_t; h_j]) | Similar to Bahdanau |
A key implementation difference is that Luong attention uses the current decoder hidden state s_t to compute alignment scores, whereas Bahdanau attention uses the previous state s_{t-1}. Because the dot product and general scoring functions rely on matrix multiplication rather than a feedforward network, Luong attention is sometimes called multiplicative attention and tends to be computationally faster.
The attention mechanism reached its most influential form in the 2017 paper "Attention Is All You Need" by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin. Published at NeurIPS 2017, this paper introduced the Transformer architecture, which dispenses with recurrence and convolutions entirely and relies solely on attention mechanisms. As of 2025, the paper has been cited more than 173,000 times, making it one of the most cited papers in the history of computer science.
The Transformer achieved 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the previous best by more than 2 BLEU points. On English-to-French, it set a new single-model record of 41.8 BLEU after training for only 3.5 days on eight GPUs.
The Transformer formalized attention using the vocabulary of queries, keys, and values (Q, K, V). The analogy is drawn from information retrieval: a query represents what the model is looking for, keys represent the items available to attend to, and values hold the content that will be retrieved. In self-attention, all three are derived from the same input sequence through learned linear projections:
Q = X * W_Q, K = X * W_K, V = X * W_V
where X is the input matrix (each row is a token embedding) and W_Q, W_K, W_V are learned weight matrices.
The core computation in the Transformer is scaled dot-product attention:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
where d_k is the dimensionality of the key vectors.
The formula works in three stages:
Why scale by sqrt(d_k)? When d_k is large, the dot products between queries and keys tend to grow in magnitude. If the individual elements of Q and K are independent random variables with mean 0 and variance 1, then their dot product has mean 0 and variance d_k. Large-magnitude dot products push the softmax function into regions where it has extremely small gradients, slowing or stalling learning. Dividing by sqrt(d_k) normalizes the variance of the dot products back to 1, keeping the softmax in a region with healthier gradients.
Self-attention (also called intra-attention) is the application of the attention mechanism within a single sequence. Each token generates a query, a key, and a value. Every token's query is compared against every other token's key, and the resulting attention weights determine how much each token's value contributes to the output representation at that position.
Self-attention has a critical advantage over recurrent layers: it connects every pair of positions in a sequence with a constant number of operations (O(1) path length), whereas an RNN requires O(n) sequential steps to propagate information from one end of the sequence to the other. This makes self-attention far more effective at capturing long-range dependencies and is also more parallelizable during training.
Rather than performing a single attention computation with full-dimensional queries, keys, and values, the Transformer uses multi-head attention. The idea is to run h attention "heads" in parallel, each operating on a different learned linear projection of Q, K, and V into lower-dimensional subspaces:
head_i = Attention(Q * W_i^Q, K * W_i^K, V * W_i^V)
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
Each head can learn to attend to different types of relationships. For instance, one head might focus on syntactic dependencies (e.g., subject-verb agreement) while another captures semantic similarity. In the original Transformer (d_model = 512, h = 8), each head operates on d_k = d_v = 64 dimensional projections. The concatenated outputs are projected back to d_model dimensions through a final weight matrix W_O.
Multi-head attention adds no computational overhead compared to single-head attention with the full dimensionality, because the per-head dimensionality is reduced proportionally. The benefit is representational: multiple heads allow the model to jointly attend to information from different representation subspaces at different positions.
Cross-attention (also called encoder-decoder attention) is a variant where the queries come from one sequence and the keys and values come from a different sequence. In the original Transformer, the decoder generates queries from its own hidden states, while the keys and values come from the encoder's output. This allows each position in the decoder to attend to all positions in the encoder, enabling the model to align source and target representations.
Cross-attention is the mechanism that bridges two different modalities or sequences. In machine translation, when the decoder generates the French word "chat," cross-attention allows it to focus heavily on the encoder representation of the English word "cat." The same principle applies beyond text: in text-to-image diffusion models like Stable Diffusion, cross-attention connects the text prompt (providing keys and values from a text encoder) with the image latent representation (providing queries), enabling the generated image to reflect the textual description.
Standard self-attention has time and memory complexity of O(n^2 * d), where n is the sequence length and d is the model dimension. The quadratic dependence on n arises because every token must compute attention scores against every other token, producing an n x n attention matrix. For short sequences (a few hundred tokens), this is manageable. But for long documents, high-resolution images, or genomic sequences that may span tens of thousands to millions of positions, the O(n^2) cost becomes a serious bottleneck.
This quadratic cost has motivated a large body of research into efficient attention variants. Approaches fall into several categories:
| Strategy | Examples | Complexity | Trade-off |
|---|---|---|---|
| Sparse attention | Longformer, BigBird, Sparse Transformer | O(n * sqrt(n)) or O(n * w) | Restricts attention to subsets of tokens |
| Sliding window | Mistral, Longformer local | O(n * w) | Only nearby tokens attend to each other |
| Linear attention | Linear Transformer, RWKV | O(n * d^2) | Approximates softmax via kernel features |
| Low-rank approximation | Linformer, Performer | O(n * k * d) | Projects keys/values to lower dimension |
| State space models | Mamba, S4, S5 | O(n * d) | Replaces attention with recurrence |
| IO-aware optimization | FlashAttention | O(n^2 * d) (exact) | Same result, fewer memory transfers |
| Distributed sequence parallelism | Ring Attention | O(n^2 * d / p) per device | Splits across p devices along sequence dim |
FlashAttention, introduced by Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Re in their 2022 paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness," addresses the quadratic memory cost of attention not by approximating the computation but by rethinking how it interacts with GPU hardware.
The key insight is that standard attention implementations are bottlenecked not by arithmetic operations but by memory transfers between GPU high-bandwidth memory (HBM) and the on-chip SRAM (the GPU's fast but small cache). Standard implementations materialize the full n x n attention matrix in HBM, which requires O(n^2) memory reads and writes.
FlashAttention avoids materializing the full attention matrix by using a tiling strategy: it splits Q, K, and V into blocks, loads each block from HBM into SRAM, computes the attention for that block in fast on-chip memory, and writes only the final output back to HBM. A carefully designed online softmax normalization algorithm allows blocks to be processed incrementally without ever needing the full attention matrix in memory.
The result is exact attention (not an approximation) that uses O(n) memory instead of O(n^2) and achieves wall-clock speedups of 2 to 4 times over standard implementations. FlashAttention enabled a 15% end-to-end speedup on BERT-large training and a 3x speedup on GPT-2 with sequence length 1024.
FlashAttention-2 (Tri Dao, 2023) introduced further optimizations:
These changes yielded roughly a 2x speedup over the original FlashAttention, reaching 50 to 73% of the theoretical maximum FLOPs/s on NVIDIA A100 GPUs.
FlashAttention-3 (Tri Dao and Jay Shah, 2024) targets NVIDIA Hopper architecture GPUs (H100). It exploits three hardware-specific features:
FlashAttention-3 achieves up to 840 TFLOPs/s in BF16 on H100 (85% utilization), roughly 1.5 to 2x faster than FlashAttention-2. In FP8 mode, it reaches 1.3 PFLOPs/s while producing 2.6x lower numerical error than a baseline FP8 attention implementation. The paper was published at NeurIPS 2024.
FlashAttention-4 (Zadouri, Shah, Hohnerbach, Liu, Thakkar, and Dao, 2026) extends the FlashAttention line to NVIDIA Blackwell GPUs (B200) and is published at MLSys 2026. The central challenge it addresses is asymmetric hardware scaling: from H100 to B200, BF16 tensor core throughput grows from 1 to 2.25 PFLOPs, while special function units for computing exponentials and shared memory bandwidth remain static. FlashAttention-4 overcomes this bottleneck through several innovations:
Written in CuTe-DSL (CUTLASS's Python kernel DSL), FlashAttention-4 achieves up to 1,605 TFLOPs/s on B200 in BF16 (71% utilization), which is 1.3x faster than cuDNN 9.13 and 2.7x faster than Triton.
| Version | Year | Target GPU | Peak throughput | Utilization | Key innovation |
|---|---|---|---|---|---|
| FlashAttention | 2022 | A100 | N/A | ~35% | IO-aware tiling |
| FlashAttention-2 | 2023 | A100 | N/A | 50-73% | Better parallelism, warp partitioning |
| FlashAttention-3 | 2024 | H100 | 840 TFLOPs/s (BF16) | ~85% | Async execution, FP8, warp specialization |
| FlashAttention-4 | 2026 | B200 | 1,605 TFLOPs/s (BF16) | ~71% | Software exponential, ping-pong scheduling |
Multi-Query Attention was proposed by Noam Shazeer in his 2019 paper "Fast Transformer Decoding: One Write-Head Is All You Need." The central observation is that during autoregressive inference, the primary performance bottleneck on modern accelerators is the memory bandwidth required to load the key-value (KV) cache, not the arithmetic computation itself.
MQA addresses this by having all query heads share a single set of key and value projections. Instead of h independent key and value heads (as in standard multi-head attention), there is just one key head and one value head. Each query head still has its own projection, so the model retains h different query perspectives, but the KV cache is reduced by a factor of h.
In practice, MQA speeds up inference decoding substantially with only a small quality degradation. It was adopted in models such as PaLM (Google, 2022) and Falcon (TII, 2023).
Grouped-Query Attention, introduced by Ainslie et al. (2023), is a compromise between standard multi-head attention (MHA) and multi-query attention (MQA). Instead of sharing a single KV head across all query heads (MQA) or having unique KV heads for every query head (MHA), GQA divides the query heads into g groups, where each group shares one set of key and value projections.
| Variant | Query heads | KV heads | KV cache size |
|---|---|---|---|
| Multi-Head Attention (MHA) | h | h | h * d_k * 2 * n |
| Multi-Query Attention (MQA) | h | 1 | 1 * d_k * 2 * n |
| Grouped-Query Attention (GQA) | h | g | g * d_k * 2 * n |
When g = h, GQA reduces to MHA. When g = 1, GQA reduces to MQA. By choosing an intermediate g, GQA achieves most of the inference speed benefits of MQA while maintaining quality closer to MHA. Meta adopted GQA with 8 KV groups in Llama 2 70B (2023), and it has since become the default attention variant in most production large language models.
Multi-Head Latent Attention, introduced in DeepSeek-V2 (2024), takes a fundamentally different approach to KV cache reduction. Rather than reducing the number of KV heads (as in MQA and GQA), MLA compresses the key and value representations into a low-dimensional latent vector before caching. At inference time, the compressed representation is projected back to produce unique keys and values for each head.
The mechanism works as follows. Given an input token x_n, MLA first compresses it into a latent representation:
c^{KV}_n = W^{DKV} * x_n
where W^{DKV} is a down-projection matrix that maps the model dimension d to a much smaller latent dimension d_c. This compact vector c^{KV} is stored in the KV cache instead of the full key and value vectors. When attention is computed, separate up-projection matrices W^{UK} and W^{UV} reconstruct unique keys and values for each head:
K = W^{UK} * C^{KV}, V = W^{UV} * C^{KV}
A key challenge is compatibility with rotary positional embeddings (RoPE). Standard RoPE entangles positional information with content information, which would prevent the "absorption trick" that lets MLA fold the up-projection matrices into the query projection and avoid actually decompressing the KV cache during inference. DeepSeek solved this with decoupled RoPE: separate query and key vectors are introduced specifically for positional encoding, keeping the main latent keys isolated from rotation matrices.
In DeepSeek-V2, MLA achieved a 93.3% reduction in KV cache size compared to standard MHA while maintaining (and sometimes exceeding) model quality, and it increased maximum generation throughput by 5.76 times. DeepSeek-V3 uses d_h = 128, H = 128, and d_c = 512, giving a compression ratio of 32. MLA was also used in DeepSeek-V3 and DeepSeek-R1, and subsequent research (TransMLA, 2025) has explored enabling MLA in any Transformer-based LLM.
Differential attention, introduced by Ye et al. at Microsoft Research and Tsinghua University in their 2024 paper "Differential Transformer" (ICLR 2025 Oral), rethinks how attention scores are computed. The core idea is to calculate attention as the difference between two separate softmax attention maps, which cancels out shared noise and promotes sparser, more focused attention patterns.
The mechanism partitions the query and key projections into two groups and computes two independent softmax distributions:
DiffAttn(X) = (softmax(Q_1 * K_1^T / sqrt(d)) - lambda * softmax(Q_2 * K_2^T / sqrt(d))) * V
where [Q_1; Q_2] = X * W^Q and [K_1; K_2] = X * W^K. The scalar lambda is a learnable parameter re-parameterized as lambda = exp(lambda_q1 * lambda_k1) - exp(lambda_q2 * lambda_k2) + lambda_init, with lambda_init varying by layer depth (initialized around 0.8 for early layers, decreasing for later layers).
The subtraction acts as noise cancellation. In standard attention, many tokens receive small but non-negligible attention weight, diluting the signal. Differential attention subtracts these common noise patterns, causing attention to concentrate on genuinely relevant tokens. The authors demonstrated that at a 64K context length, Diff Transformer allocated 0.27 to 0.40 normalized attention to answer-relevant spans in retrieval tasks, compared to just 0.03 for a standard Transformer.
Experimental results across model sizes from 830M to 13.1B parameters showed consistent improvements. A 6.8B Diff Transformer matched the validation loss of an 11B standard Transformer, requiring only 62.2% of the parameters. On downstream tasks, the mechanism improved hallucination detection accuracy (53% vs. 44% on XSum summarization), in-context learning performance (21.6% gain on 150-class classification), and key information retrieval from long contexts.
Native Sparse Attention (NSA), introduced by DeepSeek in February 2025, is a hardware-aligned sparse attention mechanism designed to be natively trainable end-to-end. Unlike earlier sparse attention methods that used fixed or heuristic patterns, NSA learns which tokens to attend to during pretraining and is specifically designed to exploit modern GPU memory hierarchies. The paper won a Best Paper award at ACL 2025.
NSA processes input sequences through three parallel attention branches that are combined via learned gating scores:
The kernel design achieves near-optimal arithmetic intensity through group-centric query loading, shared KV fetching across heads, and Triton-based grid scheduling. On 64K-length sequences, NSA achieves 9.0x forward speedup, 6.0x backward speedup, and 11.6x decoding speedup over full attention. Despite this efficiency, NSA matched or exceeded full attention quality on general benchmarks (average score 0.456 vs. 0.443), long-context tasks (LongBench 0.469 vs. 0.437), and chain-of-thought reasoning.
Sliding window attention restricts each token's attention to a fixed local window of w neighboring tokens rather than the full sequence. This reduces the per-layer complexity from O(n^2) to O(n * w), where w is typically much smaller than n.
Mistral 7B (Mistral AI, 2023) popularized sliding window attention with a window size of 4,096 tokens. A crucial insight is that stacking multiple layers of sliding window attention extends the effective receptive field: at layer k, a token can indirectly attend to information from up to k * w positions away. With 32 layers and w = 4,096, Mistral's theoretical attention span reaches approximately 131,000 tokens. Mistral also uses a rolling buffer KV cache limited to w entries, which halves cache memory requirements for sequence lengths of 8,192 compared to a full cache.
More recent models have adopted hybrid architectures that interleave sliding window layers with full (global) attention layers. Gemma 2 (Google, 2024) used a 1:1 ratio of local and global attention layers with a 4,096-token window. Gemma 3 (Google, 2025) shifted to a 5:1 ratio (five local layers for every one global layer) and reduced the window to just 1,024 tokens. This design slashes attention computation by roughly 5x and trims KV cache memory from about 60% of total memory to approximately 15%, while still supporting 128K context lengths through RoPE frequency rescaling on the global layers.
Ring Attention (Liu, Zaharia, and Abbeel, ICLR 2024) is a distributed sequence parallelism technique that enables processing of extremely long sequences by splitting them across multiple devices arranged in a ring topology. Rather than requiring each device to hold the full KV pairs for the entire sequence, Ring Attention distributes the sequence into blocks, with each device responsible for one block's queries.
The mechanism works by overlapping communication and computation. Each device computes blockwise attention between its local query block and a visiting KV block, while simultaneously sending that KV block to the next device in the ring and receiving the next KV block from the previous device. Because block computation takes longer than block transfers, the communication is fully hidden, adding no overhead compared to standard attention.
Ring Attention enables training and inference on sequences that are up to p times longer than what a single device can handle, where p is the number of devices. On 32 A100 GPUs, a 7B model can process over 1 million tokens (a 32x improvement), and on TPUv4-1024, a 3B model can train with 16 million tokens (a 512x increase over prior methods).
The Longformer (Beltagy, Peters, and Cohan, 2020) combines local sliding window attention with global attention on a small number of designated tokens. The local window handles nearby context efficiently, while the global tokens (for example, the [CLS] token in classification tasks) can attend to and be attended by all positions in the sequence. The paper also introduced dilated sliding window attention, which increases the receptive field by attending to every k-th token within the window rather than consecutive tokens. Longformer scales linearly with sequence length and was pretrained to handle up to 4,096 tokens.
BigBird (Zaheer et al., 2020, NeurIPS) extends the Longformer approach by adding random attention to the combination of local and global attention. Each token attends to r randomly selected tokens in addition to its local window and the global tokens. The authors proved theoretically that BigBird's sparse attention pattern is a universal approximator of sequence functions and is Turing complete, preserving the theoretical expressiveness of full attention. BigBird handles sequences up to 8x longer than what standard Transformers can process on the same hardware.
Katharopoulos et al. (2020) proposed linear attention in their paper "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." The core idea is to replace the softmax in standard attention with a decomposable kernel function. Standard attention computes:
Attention(Q, K, V) = softmax(Q * K^T) * V
Linear attention rewrites this using feature maps phi:
LinearAttention(Q, K, V) = (phi(Q) * (phi(K)^T * V)) / (phi(Q) * sum(phi(K)^T))
By applying the associative property of matrix multiplication, the computation avoids materializing the n x n attention matrix entirely. The term phi(K)^T * V produces a d x d matrix (independent of sequence length n), reducing the complexity from O(n^2 * d) to O(n * d^2). For typical model dimensions where d is much smaller than n, this is a significant improvement.
Katharopoulos et al. used phi(x) = elu(x) + 1 as the feature map. An added benefit is that this formulation admits a recurrent implementation, drawing a direct connection between Transformers and RNNs and enabling efficient autoregressive generation.
Structured state space models (SSMs) represent a fundamentally different approach to sequence modeling that avoids the attention mechanism altogether. Mamba (Gu and Dao, 2023) is the most prominent example. While traditional SSMs use fixed parameters, Mamba makes the state transition matrices input-dependent ("selective"), allowing the model to dynamically decide what information to retain and what to discard, analogous to the gating mechanisms in LSTMs.
Mamba achieves linear-time complexity O(n) in sequence length, enjoys 5x higher inference throughput than Transformers, and scales to million-length sequences. The Mamba-3B model matches or outperforms Transformers of twice its size on language modeling benchmarks. However, recent research suggests that SSMs and attention have complementary strengths: attention excels at precise recall from context ("needle in a haystack" tasks), while SSMs excel at compression and efficiency for long sequences. Hybrid architectures like Jamba (AI21 Labs, 2024) combine both, merging Transformer, Mamba, and Mixture-of-Experts layers to achieve performance comparable to Llama 2 70B with 2 to 7x longer context windows and 3x higher throughput.
During autoregressive generation (producing one token at a time), a language model must compute attention over all previously generated tokens. Without optimization, this means recomputing the key and value projections for every past token at every generation step, leading to redundant computation that grows quadratically with sequence length.
The KV cache solves this by storing the key and value vectors from all previous time steps. At each new generation step, only the key and value for the new token need to be computed and appended to the cache. The query for the new token then attends over all cached keys and values. This reduces the per-step computation from O(n * d) to O(d) for the projection step, though the attention computation itself still requires O(n * d) per step.
The main challenge is memory: the KV cache grows linearly with sequence length, model width, and batch size. For a large model like Llama 2 70B with a context window of 4,096 tokens, the KV cache alone can consume tens of gigabytes of GPU memory. Several strategies address this:
| Strategy | Mechanism | Typical reduction |
|---|---|---|
| Multi-query / grouped-query attention | Reduce number of KV heads | KV cache reduced by factor of h (MQA) or h/g (GQA) |
| Multi-head latent attention (MLA) | Compress KV into low-rank latent vector | 93.3% cache reduction (DeepSeek-V2) |
| KV cache quantization | Store cached keys/values in FP8 or INT4 | 2-4x memory reduction |
| PagedAttention | Virtual-memory-style non-contiguous cache blocks | Waste reduced from 60-80% to under 4% |
| Sliding window caches | Limit cache to fixed window size | Bounded memory regardless of sequence length |
| Token eviction and compression | Selectively remove or merge less important cached tokens | Variable, task-dependent |
PagedAttention (Kwon et al., 2023), used in the vLLM serving framework, deserves special mention. It borrows ideas from operating system virtual memory to manage cache memory in non-contiguous blocks, reducing fragmentation. Standard implementations waste 60 to 80% of KV cache memory due to fragmentation; PagedAttention reduces waste to under 4% and improves serving throughput by 2 to 4x.
The Vision Transformer (ViT), introduced by Dosovitskiy et al. in their 2020 paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (ICLR 2021), demonstrated that a pure Transformer applied directly to sequences of image patches can match or exceed the performance of state-of-the-art convolutional neural networks (CNNs) on image classification.
ViT works by splitting an image into fixed-size patches (typically 16x16 pixels), flattening each patch into a vector, projecting it to the model dimension through a linear embedding, prepending a learnable [CLS] token, and adding positional embeddings. The resulting sequence of patch embeddings is then processed by a standard Transformer encoder with multi-head self-attention.
Self-attention in ViT allows every patch to attend to every other patch, enabling the model to capture global relationships across the entire image from the very first layer. In contrast, CNNs build global understanding only gradually through stacked local convolution layers. ViT attention maps can be visualized as heatmaps showing which image regions each patch attends to most strongly, revealing that the model often learns to focus on semantically meaningful regions (e.g., the face of an animal, the outline of an object).
ViT has since spawned many variants, including DeiT (Data-efficient Image Transformers, Touvron et al., 2021), Swin Transformer (Liu et al., 2021) which uses shifted window attention for efficiency, and BEiT (Bao et al., 2021).
Modern text-to-image diffusion models like Stable Diffusion, DALL-E, and Imagen rely heavily on attention mechanisms within their denoising U-Net or Transformer backbone.
Two types of attention are used:
Self-attention operates within the image latent representations at multiple spatial resolutions (e.g., 64x64, 32x32, 16x16). It allows the model to capture global spatial coherence, ensuring that distant parts of the image remain consistent (e.g., the lighting direction is the same on both sides of a scene). Self-attention in diffusion models primarily preserves geometric and structural details.
Cross-attention connects the text prompt to the image generation process. Text embeddings (produced by a text encoder such as CLIP or T5) provide the keys and values, while the image latent features provide the queries. This mechanism controls which regions of the image correspond to which words in the prompt. For example, cross-attention ensures that the concept "red" is spatially aligned with "apple" rather than with "table" when generating an image from the prompt "a red apple on a table."
Researchers have leveraged attention maps in diffusion models for various applications, including prompt-to-prompt image editing (manipulating cross-attention maps to edit specific objects), attention-based layout control, and interpretability analysis.
The following table summarizes major attention variants, their key characteristics, and the models that adopt them.
| Variant | Year | Authors | Key idea | Complexity | Notable models |
|---|---|---|---|---|---|
| Additive (Bahdanau) | 2014 | Bahdanau, Cho, Bengio | Learned alignment via feedforward network | O(n * m) | Early NMT systems |
| Multiplicative (Luong) | 2015 | Luong, Pham, Manning | Dot-product and general scoring functions | O(n * m) | Early NMT systems |
| Scaled dot-product | 2017 | Vaswani et al. | Q, K, V formulation with sqrt(d_k) scaling | O(n^2 * d) | Transformer, BERT, GPT |
| Multi-head attention | 2017 | Vaswani et al. | Parallel attention in h subspaces | O(n^2 * d) | All Transformers |
| Sparse (Longformer) | 2020 | Beltagy et al. | Local + global + dilated windows | O(n * w) | Longformer |
| Sparse (BigBird) | 2020 | Zaheer et al. | Local + global + random | O(n * (w + r + g)) | BigBird |
| Linear attention | 2020 | Katharopoulos et al. | Kernel feature map, avoid n x n matrix | O(n * d^2) | Linear Transformer |
| Multi-query (MQA) | 2019 | Shazeer | Single shared KV head | O(n^2 * d) | PaLM, Falcon |
| Grouped-query (GQA) | 2023 | Ainslie et al. | g shared KV groups | O(n^2 * d) | Llama 2, Mistral |
| FlashAttention | 2022 | Dao et al. | IO-aware tiling, exact, O(n) memory | O(n^2 * d) compute, O(n) memory | Widely adopted |
| FlashAttention-2 | 2023 | Dao | Better parallelism and warp partitioning | O(n^2 * d) compute, O(n) memory | Widely adopted |
| FlashAttention-3 | 2024 | Dao, Shah | Hopper async, FP8, warp specialization | O(n^2 * d) compute, O(n) memory | H100 deployments |
| FlashAttention-4 | 2026 | Zadouri, Shah, Dao et al. | Blackwell pipelining, software exponential | O(n^2 * d) compute, O(n) memory | B200 deployments |
| Sliding window | 2023 | Mistral AI | Fixed local window with rolling KV cache | O(n * w) | Mistral 7B, Gemma 3 |
| Multi-head latent (MLA) | 2024 | DeepSeek | Latent compression of KV cache | O(n^2 * d) | DeepSeek-V2, V3, R1 |
| Differential attention | 2024 | Ye et al. (Microsoft) | Difference of two softmax maps | O(n^2 * d) | Diff Transformer |
| Native sparse (NSA) | 2025 | DeepSeek | Trained sparse: compress + select + slide | O(n * (n/r + k + w)) | DeepSeek (research) |
| Ring Attention | 2024 | Liu, Zaharia, Abbeel | Distributed blockwise ring topology | O(n^2 * d / p) per device | Long-context training |
| Selective SSM (Mamba) | 2023 | Gu, Dao | Input-dependent state space model | O(n * d) | Mamba, Jamba |
Attention mechanisms have been adopted across virtually every domain of machine learning:
One practical advantage of attention mechanisms is that attention weights can be inspected to gain insight into what the model is focusing on. Attention maps can be visualized as heatmaps, where brighter entries indicate stronger attention between two positions.
However, researchers have cautioned against over-interpreting attention weights. Jain and Wallace (2019) showed that attention weights often do not correlate well with other measures of feature importance and that alternative attention distributions can produce identical predictions. Attention weights indicate how information flows through the network but do not necessarily indicate which inputs are causally important for the output. More rigorous interpretability methods, such as probing classifiers and gradient-based attribution, are typically needed to draw reliable conclusions.
Imagine you are in a classroom and the teacher asks a question. You look around the room for clues. Some classmates are whispering the answer, some are drawing on their notebooks, and some are looking out the window. Attention is like choosing to listen more closely to the classmates who seem to know the answer and ignoring the ones looking out the window. In machine learning, attention lets the computer do something similar: it decides which pieces of information are most helpful for the task at hand and pays more attention to those, while downplaying the rest.