Self-attention is a core mechanism in modern deep learning that allows a neural network to relate different positions within a single input sequence to compute a richer representation of that sequence. Unlike traditional attention mechanisms where queries attend to a separate context (as in sequence-to-sequence models), self-attention derives its queries, keys, and values entirely from the same input. Introduced as "intra-attention" in earlier work and formalized as a central building block in the transformer architecture by Vaswani et al. (2017), self-attention has become the dominant mechanism powering large language models, vision transformers, and a wide range of other architectures across natural language processing, computer vision, and beyond [1].
At its core, self-attention answers a simple question: for each element in a sequence, which other elements in the same sequence are most relevant to it? Consider the sentence "The cat sat on the mat because it was tired." To understand what "it" refers to, a model needs to relate the pronoun back to "cat" earlier in the sentence. Self-attention enables this by computing a weighted sum over all positions in the sequence, where the weights reflect the relevance of each position to the current one.
The term "self" distinguishes this mechanism from cross-attention, where queries come from one sequence and keys/values come from a different sequence (for example, an encoder-decoder setup). In self-attention, all three components originate from the same input, making it a form of intra-sequence reasoning. The mechanism was originally referred to as "intra-attention" in earlier work on summarization and reading comprehension, and Vaswani et al. (2017) elevated it to the centerpiece of the transformer architecture by showing that an entire sequence model could be built from self-attention layers without any recurrence or convolution [1].
A helpful mental model is that of a soft, learnable lookup. Each token issues a query that asks "what kind of information do I need from the rest of the sequence?" Every other token offers a key that advertises "this is the kind of information I carry," and a value that contains the actual information to be passed along. The dot product between the query and each key produces a relevance score, and the value vectors are then averaged according to these scores after a softmax normalization. The whole operation is differentiable, so the projections that produce queries, keys, and values are learned end to end with the rest of the network.
Given an input sequence of n tokens, each represented as a d-dimensional vector, the input can be organized as a matrix X of shape (n, d). Self-attention transforms this input through three learned linear projections to produce queries, keys, and values:
Here, W_Q, W_K, and W_V are learned weight matrices. W_Q and W_K have shape (d, d_k), and W_V has shape (d, d_v), where d_k is the key/query dimension and d_v is the value dimension. In the original transformer, d_k = d_v = d_model / h, where h is the number of attention heads.
The attention output is then computed using scaled dot-product attention:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
Breaking this down step by step:
The result is a new sequence of n vectors, each of which incorporates information from all other positions in the input, weighted by learned relevance.
The scaling factor 1/sqrt(d_k) is one of the most discussed details of the transformer paper. Vaswani et al. argued that as d_k grows, the dot product q · k becomes a sum of d_k random products. If the components of q and k are independent random variables with mean 0 and variance 1, then the dot product has mean 0 and variance d_k. Without scaling, the magnitudes of the scores grow with sqrt(d_k), and large scores push softmax into a regime where one entry dominates and the gradient with respect to all other entries is essentially zero [1]. Dividing by sqrt(d_k) keeps the variance of the pre-softmax scores roughly constant regardless of head dimension, which keeps gradients usable and lets the model train stably even when d_k is large.
This seemingly minor detail matters in practice. Earlier additive attention by Bahdanau et al. (2014) avoided the issue by using a small feed-forward network to score query-key pairs, which is computationally heavier but does not suffer from the same scaling problem. Scaled dot-product attention combines the speed of dot products with the numerical stability of additive attention, which is one reason it became the default in modern architectures.
The dominant cost of self-attention is the matrix multiplication Q K^T, which produces the (n, n) attention score matrix. For a sequence of length n with head dimension d_k, this operation requires O(n^2 d_k) floating-point operations. The subsequent multiplication with V adds another O(n^2 d_v). Overall, self-attention has:
The quadratic scaling in sequence length n is the primary bottleneck of self-attention. Doubling the sequence length quadruples both computation and memory. For a 2,048-token sequence, the attention matrix has about 4 million entries; for a 128K-token sequence, it balloons to over 16 billion entries. In a multi-head setting with h heads, the total cost is multiplied by h, although in practice each head has dimension d_model / h so the per-layer parameter count and FLOP count remain comparable to a single full-dimensional head.
This quadratic cost has driven a rich body of research into efficient alternatives, covered in the efficiency section below.
Self-attention solved several fundamental problems that plagued earlier sequence modeling approaches.
Recurrent neural networks (RNNs) process sequences step by step, meaning information from early tokens must pass through many sequential operations to influence later tokens. In practice, this makes it difficult for RNNs to learn dependencies spanning dozens or hundreds of tokens, even with gating mechanisms like LSTM. Self-attention computes direct pairwise interactions between all positions in a single operation, so the path length between any two tokens is O(1) rather than O(n) [1].
Because self-attention computes all pairwise interactions simultaneously, the entire operation can be parallelized efficiently on modern GPUs. RNNs, by contrast, require sequential computation: the hidden state at time step t depends on the hidden state at time step t-1. This sequential bottleneck made RNN training much slower on parallel hardware. Convolutional sequence models such as ConvS2S avoid the bottleneck but reach distant tokens only through stacks of layers, which still bounds path length by the receptive field.
The attention weight matrix provides a form of soft alignment that can be inspected. While attention weights should not be treated as definitive explanations of model behavior (see the interpretability section below), they do offer a useful diagnostic tool for understanding which tokens the model considers relevant to each other.
The following table compares self-attention to previous sequence modeling approaches, using the analysis from Vaswani et al. (2017):
| Property | Self-attention | RNN | CNN |
|---|---|---|---|
| Maximum path length | O(1) | O(n) | O(log_k n) |
| Computation per layer | O(n^2 d) | O(n d^2) | O(k n d^2) |
| Parallelizable | Yes | No | Yes |
| Long-range dependencies | Direct | Difficult | Moderate |
| Sequential operations | O(1) | O(n) | O(1) |
The transformer architecture employs two distinct forms of attention: self-attention and cross-attention. Understanding the difference is essential for grasping how encoder-decoder models work.
In self-attention, queries, keys, and values all come from the same sequence. Both the encoder and decoder in the original transformer use self-attention layers. The encoder's self-attention lets each token in the input attend to all other input tokens. The decoder's self-attention (which is masked; see below) lets each token in the output attend to previously generated output tokens.
In cross-attention (also called encoder-decoder attention), queries come from the decoder while keys and values come from the encoder's output. This is the mechanism by which the decoder "reads" the encoded input. For example, in machine translation, cross-attention allows each word being generated in the target language to attend to all words in the source language sentence.
| Aspect | Self-attention | Cross-attention |
|---|---|---|
| Source of Q | Same sequence | Decoder |
| Source of K, V | Same sequence | Encoder output |
| Purpose | Model intra-sequence relationships | Model inter-sequence relationships |
| Found in | Encoder layers, decoder layers | Decoder layers only (in encoder-decoder models) |
| Example use | Contextual word representation | Aligning translation output to input |
Decoder-only models such as GPT and LLaMA use only self-attention with causal masking, since they do not have a separate encoder. Encoder-decoder models like T5 and the original transformer use both self-attention and cross-attention. Cross-attention is also central to many multi-modal models, where a text decoder cross-attends to image features produced by a vision encoder.
In autoregressive language modeling, a model generates tokens one at a time, left to right. During training, the model processes an entire sequence at once for efficiency, but each position should only be able to attend to positions at or before itself. Allowing a position to attend to future tokens would let the model "cheat" by looking at the answer it is supposed to predict.
Masked self-attention (also called causal attention) enforces this constraint by applying a mask to the attention score matrix before the softmax step. Specifically, all entries above the diagonal of the (n, n) score matrix are set to negative infinity, so that after softmax they become zero. This means position i can only attend to positions 1 through i.
The masked attention formula is:
Attention(Q, K, V) = softmax(mask(Q K^T / sqrt(d_k))) V
where mask(.) sets future positions to negative infinity.
Causal masking is used in all decoder-only models, including the GPT family, LLaMA, Mistral, and other autoregressive LLMs. During inference, when tokens are generated one at a time, the causal property is naturally satisfied since future tokens do not yet exist. The mask is primarily needed during training when full sequences are processed in parallel, and at prefill time when the prompt is processed in one pass.
A practical consequence of causal masking is that the attention score matrix is lower triangular, which halves the FLOPs needed for the QK^T computation in principle. Most production kernels, including FlashAttention, exploit this triangular structure to skip the upper triangle entirely, which roughly doubles speed in the long-context regime.
A single self-attention operation can only capture one type of relationship at a time. Multi-head attention, introduced alongside self-attention in the original transformer paper, addresses this by running multiple self-attention operations in parallel, each with its own learned projections [1].
Given h attention heads, the input X is projected into h separate sets of queries, keys, and values using different weight matrices. Each head computes attention independently:
head_i = Attention(X W_Q^i, X W_K^i, X W_V^i)
The outputs of all heads are concatenated and projected through a final linear layer:
MultiHead(X) = Concat(head_1, ..., head_h) W_O
where W_O is a learned output projection matrix.
Typically, the per-head dimension is d_k = d_model / h, so the total computation is roughly equivalent to a single head with full dimensionality. The original transformer used h = 8 heads with d_model = 512, giving d_k = 64 per head [1]. Modern large language models use larger numbers of heads to scale up. GPT-3 uses 96 heads with d_model = 12,288 in its largest variant. LLaMA 2 70B uses 64 heads, and PaLM uses 48. Common values for h in production models are 8, 12, 16, 32, 64, and 96.
Different heads tend to learn different types of relationships. Empirical analysis has shown that some heads specialize in syntactic relationships (subject-verb agreement, dependency parsing), while others focus on positional patterns (attending to adjacent tokens) or semantic relationships [10]. This diversity of learned patterns is one reason multi-head attention is so effective. The intuition Vaswani et al. offer is that a single head averages over all relations and loses information; multiple heads let the network keep separate "channels" for different kinds of attention patterns and combine them at the output projection.
Multi-head attention is expensive at inference time, not because of FLOPs but because of memory bandwidth. Every step of autoregressive decoding has to load the entire key and value cache for every head from GPU memory. As context grows, this KV cache becomes the dominant cost, and several variants have emerged to shrink it without sacrificing too much quality.
| Variant | Year | K/V sharing | KV cache size (relative to MHA) | Adopted by |
|---|---|---|---|---|
| Multi-Head Attention (MHA) | 2017 | Every head has its own K and V | 1x | Original transformer, GPT-2, GPT-3 |
| Multi-Query Attention (MQA) | 2019 | Single K and V shared by all query heads | 1/h | PaLM, Falcon, StarCoder |
| Grouped-Query Attention (GQA) | 2023 | Groups of query heads share one K and V | 1/g (g = group size) | LLaMA 2 70B, LLaMA 3, Mistral, Mixtral |
| Multi-head Latent Attention (MLA) | 2024 | Low-rank joint compression of K and V into a latent vector | About 1/14 in DeepSeek-V2 | DeepSeek-V2, DeepSeek-V3 |
Multi-Query Attention was proposed by Noam Shazeer (2019) to attack the memory bandwidth problem in autoregressive decoding. By sharing a single key and value head across all query heads, MQA shrinks the KV cache by a factor of h and roughly matches the same factor in decoding speedup. The price is some quality loss and training instability, which led several teams to look for a middle ground [6].
Grouped-Query Attention (GQA), introduced by Ainslie et al. (2023), generalizes MQA by partitioning query heads into g groups, with each group sharing one key and value head. With g = 1, GQA reduces to MQA; with g = h, it reduces to MHA. Ainslie et al. also showed that an existing MHA checkpoint could be uptrained into GQA using only about 5% of the original pretraining compute, making the variant practical to retrofit. GQA is now standard in LLaMA 2 70B, LLaMA 3, Mistral, Mixtral, Qwen, and many other recent models [7].
Multi-head Latent Attention (MLA), introduced in DeepSeek-V2 (2024), takes a different angle by jointly compressing keys and values into a low-rank latent representation. Only the latent vector needs to be cached, not the full per-head K and V. DeepSeek reports that MLA reduces the KV cache by 93.3% compared with MHA while matching or beating its quality, and it boosts maximum generation throughput by about 5.76x. MLA also drives the architecture of DeepSeek-V3 and R1 [8].
One important property of self-attention is that it is permutation-equivariant: the same output is produced regardless of the order of input tokens (up to the corresponding reordering of outputs). This means self-attention has no inherent notion of position or sequence order. To work with sequences whose order matters, transformers must add some form of positional encoding to the input or to the attention computation itself.
Four broad families of positional encoding have been used in practice:
| Scheme | Year and paper | Where the signal is added | Used by |
|---|---|---|---|
| Sinusoidal absolute | Vaswani et al. 2017 | Added to token embeddings before layer 1 | Original transformer |
| Learned absolute | Devlin et al. 2018 (BERT) | Learned position embedding added to token embedding | BERT, GPT-2, RoBERTa |
| Relative position bias | Shaw et al. 2018; T5 (Raffel 2019) | Added to attention scores inside each layer | T5, DeBERTa |
| RoPE (rotary) | Su et al. 2021 | Rotates Q and K vectors based on absolute position before the dot product | LLaMA, GPT-NeoX, PaLM, Mistral, Qwen, DeepSeek |
| ALiBi | Press et al. 2021 | Adds a fixed linear bias to attention scores based on token distance | MPT, BLOOM, Baichuan |
Sinusoidal encodings, the original choice in Vaswani et al. (2017), use sines and cosines of different frequencies so that any fixed offset between two positions can be expressed as a linear function of the two encodings. This is parameter-free and was hoped to extrapolate to longer sequences than seen in training, although in practice this generalization is limited [1].
Learned absolute positions, used in BERT and GPT-2, treat position embeddings as trainable parameters of the same size as the token vocabulary's embedding table. They tend to outperform sinusoidal encodings inside the trained range but cannot be applied to positions beyond the training context length.
Relative position representations, introduced by Shaw et al. (2018), encode the offset i - j directly in the attention computation rather than encoding absolute position in the input. T5 follows a closely related approach with a learned scalar bias per relative distance bucket. Relative encodings improve translation quality and can generalize a bit beyond the training context [9].
Rotary Position Embedding (RoPE), introduced by Su et al. (2021) in the RoFormer paper, has become the dominant choice in modern large language models. Instead of adding a position vector to the input, RoPE rotates each query and key vector by an angle that depends on its absolute position. The dot product between a rotated query at position i and a rotated key at position j depends only on the offset i - j, so the result behaves like a relative position encoding while being implemented through pure rotations. RoPE is used in LLaMA, GPT-NeoX, PaLM, Mistral, Qwen, DeepSeek, and most other recent decoder-only LLMs. Variants such as YaRN, NTK-aware scaling, and position interpolation extend RoPE to context lengths far beyond what was seen during pretraining [4].
ALiBi (Attention with Linear Biases), introduced by Press et al. (2021), drops position embeddings entirely and instead adds a fixed, head-specific linear penalty to the attention score, equal to a slope times the query-key distance. This biases attention toward nearby tokens and lets the model extrapolate to inputs longer than the training context. Press et al. showed that a 1.3B model trained on length 1,024 with ALiBi extrapolates to length 2,048 with no quality loss while training 11% faster than a sinusoidal baseline. ALiBi powers MPT, BLOOM, and several other models [5].
The O(n^2) complexity of self-attention has driven a rich body of work aimed at reducing computation and memory usage, especially for long sequences. The methods break roughly into three categories: hardware-aware exact computation, sparse and structured patterns, and linear-complexity approximations.
| Method | Year | Approach | Time complexity | Memory complexity | Exact or approximate |
|---|---|---|---|---|---|
| FlashAttention | 2022 | IO-aware tiling and kernel fusion | O(n^2 d) | O(n) | Exact |
| FlashAttention-2 | 2023 | Better parallelism, fewer non-matmul FLOPs | O(n^2 d) | O(n) | Exact |
| FlashAttention-3 | 2024 | Hopper asynchrony, FP8 | O(n^2 d) | O(n) | Exact (FP16) or approximate (FP8) |
| Sparse attention (Child 2019) | 2019 | Strided and fixed sparse patterns | O(n sqrt(n)) | O(n sqrt(n)) | Approximate |
| Longformer (Beltagy 2020) | 2020 | Sliding window plus global tokens | O(n) | O(n) | Approximate |
| BigBird (Zaheer 2020) | 2020 | Random plus window plus global | O(n) | O(n) | Approximate |
| Reformer (Kitaev 2020) | 2020 | LSH-based bucketing | O(n log n) | O(n log n) | Approximate |
| Linformer (Wang 2020) | 2020 | Project K and V to fixed length k | O(n) | O(n) | Approximate |
| Performer (Choromanski 2020) | 2020 | Random feature kernel approximation (FAVOR+) | O(n) | O(n) | Approximate |
| Linear Transformer (Katharopoulos 2020) | 2020 | Replace softmax with kernel feature map phi | O(n) | O(n) | Approximate |
| Ring Attention | 2023 | Distribute attention across devices | O(n^2 / devices) | O(n / devices) | Exact |
FlashAttention, introduced by Dao et al. (2022), takes a hardware-aware approach. Rather than changing the mathematical computation, it reorganizes how attention is computed to minimize expensive memory transfers between GPU HBM and on-chip SRAM. Using tiling and kernel fusion, FlashAttention computes exact attention while reducing memory usage from O(n^2) to O(n). The original paper reports a 15% end-to-end wall-clock speedup on BERT-large at sequence length 512, a 3x speedup on GPT-2 at length 1K, and a 2.4x speedup on Long Range Arena at lengths 1K to 4K [3].
FlashAttention-2 (Dao 2023) restructured the algorithm to parallelize over the sequence length dimension as well as batch and heads, and reduced the number of non-matmul FLOPs that the previous version spent on rescaling. It reaches 50% to 73% of theoretical peak FLOPs on A100 GPUs, roughly twice the throughput of the original FlashAttention. FlashAttention-3 (Shah, Dao et al. 2024) targets the Hopper architecture and uses warp-specialized scheduling, the Tensor Memory Accelerator, and FP8 to push throughput to 1.2 petaflops per second on H100, about 1.5x to 2x faster than FlashAttention-2 on the same hardware [11].
FlashAttention has become the default attention kernel in most production LLM training and inference stacks, including PyTorch's scaled_dot_product_attention, vLLM, TensorRT-LLM, and the Hugging Face transformers library.
Sparse attention methods restrict each token to attending to only a subset of other tokens, reducing complexity from O(n^2) to O(n sqrt(n)) or O(n log n) or even O(n).
The Sparse Transformer of Child et al. (2019) was an early example, introducing two patterns called strided and fixed attention. Strided attention has each position attend to its row and column in a virtual 2D grid, while fixed attention attends to a fixed column and to the elements after the latest column. The result reduces complexity to O(n sqrt(n)) and lets the model handle sequences in the tens of thousands of tokens. Sparse Transformer set state-of-the-art density modeling results on Enwik8, CIFAR-10, and ImageNet-64 at the time.
Longformer (Beltagy et al. 2020) combines a sliding window of local attention with a small number of global tokens that attend to all positions and are attended to by all positions. This works well for document-level tasks like summarization and question answering. BigBird (Zaheer et al. 2020) extends the idea by adding random attention edges to local and global patterns and proves that the resulting sparse attention is a universal approximator of sequence functions.
Reformer (Kitaev et al. 2020) replaces dot-product attention with locality-sensitive hashing, bucketing tokens whose query and key vectors are likely to have a high dot product into the same group and computing attention only within each bucket. This reduces complexity to O(n log n) and lets Reformer fit sequences of length 64K on a single GPU. Reformer also uses reversible layers to avoid storing intermediate activations.
These methods trade some representational capacity for linear or near-linear scaling, which makes them suitable for very long documents. Their adoption in production language models has been mixed. Mistral and other recent decoder-only LLMs use a sliding window pattern (similar in spirit to Longformer's local attention) as a way to bound the KV cache while keeping quality close to full attention.
Linear attention methods replace the softmax with a kernel function that allows the attention computation to be decomposed into a form with O(n) complexity. Katharopoulos et al. (2020) showed that by using a feature map phi instead of softmax, attention can be rewritten as:
Attention(Q, K, V) = phi(Q) (phi(K)^T V)
By computing phi(K)^T V first (which has shape d_k by d_v), the computation avoids forming the n by n attention matrix. The same paper points out that this lets autoregressive transformers be implemented as recurrent networks, with a fixed-size matrix-valued state that updates one step at a time, and reports up to 4,000x speedups for very long autoregressive prediction.
Linformer (Wang et al. 2020) takes a different route, projecting the n by d_k key and value matrices down to a fixed length k along the sequence dimension. The resulting attention runs in O(n k) which is linear in n when k is held constant. The justification is empirical: the singular value spectrum of the attention matrix in trained transformers decays quickly, so the matrix is well approximated by a low-rank projection.
Performer (Choromanski et al. 2020) approximates the softmax kernel using positive orthogonal random features, an algorithm called FAVOR+. The approximation is unbiased and provably accurate, and lets Performer compute exact softmax-like attention in O(n) time and memory.
Linear attention methods have generally not matched the quality of standard softmax attention for large-scale language modeling, which has limited their direct adoption in frontier LLMs. Newer hybrid designs and ideas like RWKV, RetNet, and Mamba (see below) build on the same intuition that constant-state recurrence can replace explicit pairwise attention for many sequence tasks.
Ring Attention (Liu et al. 2023) is not an algorithmic change to attention but a way of distributing the computation across many devices. Each device holds a slice of the sequence and the keys and values rotate around a ring, so each device sees every other device's K and V exactly once. The mathematics is exact, the memory per device is O(n / devices), and total throughput grows roughly linearly with the device count. Ring attention combined with FlashAttention is one of the standard approaches to training million-token context models.
During autoregressive generation, every new token requires computing attention over the entire prefix. Recomputing keys and values for the prefix at each step would be O(n^2) per step. The standard remedy is the KV cache: keys and values for each generated token are computed once and stored, so each new step only computes the new token's Q and reads the cached K and V.
The KV cache turns generation into an O(n) per-step operation in compute but an O(n) per-step operation in memory bandwidth, since the entire cache must be streamed from HBM at every step. For long contexts and large models, this memory bandwidth becomes the binding constraint on inference throughput. MQA, GQA, MLA, and sliding-window attention are all in part attempts to shrink the KV cache so that more requests can fit on a single GPU and so that decoding is bound by compute rather than bandwidth.
Production inference servers such as vLLM and TensorRT-LLM use a paged KV cache, similar in spirit to virtual memory, that lets cached blocks for different requests share GPU memory more efficiently and supports continuous batching across many concurrent users.
Self-attention is the primary mechanism through which transformers build contextual representations. Each transformer layer consists of two sublayers: a multi-head self-attention sublayer followed by a position-wise feed-forward network. Both sublayers are wrapped in residual connections and layer normalization [1].
In the encoder, self-attention is bidirectional: each token can attend to all other tokens. This produces richly contextual representations useful for tasks like classification, named entity recognition, and extractive question answering. BERT and its variants use encoder-only architectures with bidirectional self-attention.
In the decoder, self-attention is causal (masked), restricting each token to attending only to earlier positions. This is the architecture used by GPT, LLaMA, and virtually all modern autoregressive language models.
Self-attention is also the mechanism that enables in-context learning, the ability of large language models to adapt to new tasks based on examples provided in the prompt. Olsson et al. (2022) at Anthropic argued that a specific kind of attention head, the induction head, is responsible for much of the in-context learning ability of small attention-only models. An induction head learns to attend back to a previous occurrence of the current token and copy what came after it, which is a primitive form of pattern matching that supports many in-context learning behaviors [12].
While self-attention was popularized through language tasks, it has been successfully applied across many domains:
Attention weights are easy to visualize, which led early work to treat them as natural explanations for model behavior. The picture is more nuanced.
Jain and Wallace (2019) ran a controlled study on text classification models and concluded that attention is not explanation. They found that learned attention weights are often weakly correlated with gradient-based feature importance and that very different attention distributions can produce essentially the same predictions. The implication is that any single attention map is one of many possible plausible explanations, not the explanation [14].
Wiegreffe and Pinter (2019) responded with "Attention is not not Explanation," pointing out that the experiments in Jain and Wallace allow attention to be replaced freely without retraining, which is too permissive a test. They proposed alternative tests under which attention can serve as a useful explanation for at least some models and tasks. The takeaway from this exchange is that attention weights are useful but should not be over-interpreted, and that any claim of explanation should be supported by behavioral tests, not raw weights alone.
Abnar and Zuidema (2020) addressed the problem that information mixes across layers and proposed two methods, attention rollout and attention flow, that estimate how much each input token contributes to a given output by recursively combining attention weights across layers. These methods correlate better with gradient-based importance than raw attention does and are still used as a quick sanity check on what a transformer is looking at [15].
A more recent line of work, mechanistic interpretability, treats attention heads as small computational circuits to be reverse-engineered. The Transformer Circuits Thread led by Anthropic has developed a mathematical framework in which each attention head is decomposed into a QK circuit (which positions to attend to) and an OV circuit (what information to write into the residual stream). Within this framework, Olsson et al. (2022) identified induction heads in small attention-only models, in which a head in one layer detects that the current token has appeared before, and a head in the next layer copies the token that followed it. Induction heads emerge during training in a sharp phase transition that closely tracks the appearance of in-context learning, suggesting that this circuit is one of the basic building blocks of LLM behavior [12, 16].
The quadratic O(n^2) compute and memory cost in sequence length is the most discussed limitation of self-attention. It bounds practical context lengths and forces long-context models to lean on FlashAttention, GQA or MLA, sliding window or sparse patterns, ring attention, and KV-cache compression to keep training and inference tractable. Even with these tricks, the cost of training a frontier LLM at 1M-token context dominates the budget.
A second limitation is that self-attention is permutation-equivariant by construction, so positional information has to be added separately. Different positional encodings have different inductive biases and different abilities to extrapolate beyond the training context. Length extrapolation remains an active research area, with techniques like position interpolation, NTK-aware RoPE scaling, YaRN, and ALiBi all attempting to push the context window beyond what the model originally saw.
A third limitation is interpretability. As discussed above, attention weights look interpretable but do not always explain model decisions. Mechanistic interpretability has made progress on small models but scaling these methods to frontier-size systems is hard.
A fourth limitation, which is more architectural than computational, is that self-attention is uniform: every token attends to every other token through the same kind of dot-product comparison. This treats all tokens the same regardless of role, which can be wasteful when only a small subset of tokens are relevant at each step.
Self-attention is the foundation of all modern LLMs in production today, but it is no longer the only option. Two related lines of work try to remove or reduce reliance on quadratic attention.
State space models such as Mamba (Gu and Dao, 2023) replace attention with a selective recurrent layer whose parameters depend on the input. Mamba scales linearly in sequence length and runs about 5x faster at inference than a transformer of the same size. Mamba-3B matches the quality of transformers twice its size on language modeling, audio, and genomics benchmarks [17]. RWKV and RetNet are similar in spirit, recasting attention as a constant-state recurrence.
Hybrid architectures combine attention layers with state space layers to capture the strengths of both. AI21's Jamba (2024) is a hybrid Transformer-Mamba mixture-of-experts model that interleaves Mamba and attention layers (one attention block out of every eight, in the original ratio) and supports context lengths up to 256K tokens on a single 80GB GPU. Jamba 1.5 scales the design to 398B total parameters with 94B active and reports up to 3x throughput improvement over LLaMA 2 70B and Mixtral [18].
Despite these alternatives, the major frontier model releases of 2024 and 2025 (GPT-4o, Claude 3 and 4, Gemini 1.5 and 2, LLaMA 3 and 4, Qwen, DeepSeek-V3 and R1) remain transformer-based. Self-attention with FlashAttention kernels, GQA or MLA for KV-cache reduction, RoPE for positions, and increasingly sliding-window or hybrid attention patterns is the dominant recipe. Whether attention will keep its central role or be supplanted by a recurrent or hybrid mechanism is one of the more interesting open architectural questions in the field.
Attention as a learnable alignment between a query and a set of values appeared in machine translation with Bahdanau et al. (2014), who used it to let a decoder soft-align over an encoder's hidden states. Luong et al. (2015) introduced multiplicative dot-product attention. The first explicit uses of self-attention (sometimes called intra-attention) appeared in the context of natural language inference and abstractive summarization in 2016 and 2017. Vaswani et al. (2017) gave self-attention its modern definition as the load-bearing component of the transformer, replacing recurrence and convolution entirely [1].
Since then, self-attention has been the central object of study in modern deep learning. The term "attention" today usually refers, by default, to scaled dot-product self-attention as defined in the transformer paper, with multi-head, causal masking, and FlashAttention as the standard implementation choices.
[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems (NeurIPS) 30. arXiv:1706.03762.
[2] Bahdanau, D., Cho, K., and Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv:1409.0473.
[3] Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. arXiv:2205.14135.
[4] Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. Published in Neurocomputing 568 (2024).
[5] Press, O., Smith, N. A., and Lewis, M. (2021). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." ICLR 2022. arXiv:2108.12409.
[6] Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150.
[7] Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. arXiv:2305.13245.
[8] DeepSeek-AI (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434.
[9] Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). "Self-Attention with Relative Position Representations." NAACL 2018. arXiv:1803.02155.
[10] Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention." BlackboxNLP 2019. arXiv:1906.04341.
[11] Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. (2024). "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." NeurIPS 2024. arXiv:2407.08608.
[12] Olsson, C., Elhage, N., Nanda, N., Joseph, N., et al. (2022). "In-context Learning and Induction Heads." Transformer Circuits Thread, Anthropic.
[13] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2020). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021. arXiv:2010.11929.
[14] Jain, S. and Wallace, B. C. (2019). "Attention is not Explanation." NAACL 2019. arXiv:1902.10186. See also Wiegreffe, S. and Pinter, Y. (2019). "Attention is not not Explanation." EMNLP 2019. arXiv:1908.04626.
[15] Abnar, S. and Zuidema, W. (2020). "Quantifying Attention Flow in Transformers." ACL 2020. arXiv:2005.00928.
[16] Elhage, N., Nanda, N., Olsson, C., et al. (2021). "A Mathematical Framework for Transformer Circuits." Transformer Circuits Thread, Anthropic.
[17] Gu, A. and Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752.
[18] Lieber, O., Lenz, B., Bata, H., et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887.
[19] Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv:1904.10509.
[20] Beltagy, I., Peters, M. E., and Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv:2004.05150.
[21] Zaheer, M., Guruganesh, G., Dubey, A., et al. (2020). "Big Bird: Transformers for Longer Sequences." NeurIPS 2020. arXiv:2007.14062.
[22] Kitaev, N., Kaiser, L., and Levskaya, A. (2020). "Reformer: The Efficient Transformer." ICLR 2020. arXiv:2001.04451.
[23] Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. (2020). "Linformer: Self-Attention with Linear Complexity." arXiv:2006.04768.
[24] Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2020). "Rethinking Attention with Performers." ICLR 2021. arXiv:2009.14794.
[25] Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." ICML 2020. arXiv:2006.16236.
[26] Liu, H., Zaharia, M., and Abbeel, P. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." arXiv:2310.01889.