Self-attention

Self-attention is a core mechanism in modern deep learning that allows a neural network to relate different positions within a single input sequence to compute a richer representation of that sequence. Unlike traditional attention mechanisms where queries attend to a separate context (as in sequence-to-sequence models), self-attention derives its queries, keys, and values entirely from the same input. Introduced as "intra-attention" in earlier work and formalized as a central building block in the transformer architecture by Vaswani et al. (2017), self-attention has become the dominant mechanism powering large language models such as GPT, BERT, and LLaMA, as well as vision transformers, and a wide range of other architectures across natural language processing, computer vision, speech recognition, and beyond [1].

definition and intuition

At its core, self-attention answers a simple question: for each element in a sequence, which other elements in the same sequence are most relevant to it? Consider the sentence "The cat sat on the mat because it was tired." To understand what "it" refers to, a model needs to relate the pronoun back to "cat" earlier in the sentence. Self-attention enables this by computing a weighted sum over all positions in the sequence, where the weights reflect the relevance of each position to the current one.

The term "self" distinguishes this mechanism from cross-attention, where queries come from one sequence and keys/values come from a different sequence (for example, an encoder-decoder setup). In self-attention, all three components originate from the same input, making it a form of intra-sequence reasoning. The mechanism was originally referred to as "intra-attention" in earlier work on summarization and reading comprehension, and Vaswani et al. (2017) elevated it to the centerpiece of the transformer architecture by showing that an entire sequence model could be built from self-attention layers without any recurrence or convolution [1].

A helpful mental model is that of a soft, learnable lookup. Each token issues a query that asks "what kind of information do I need from the rest of the sequence?" Every other token offers a key that advertises "this is the kind of information I carry," and a value that contains the actual information to be passed along. The dot product between the query and each key produces a relevance score, and the value vectors are then averaged according to these scores after a softmax normalization. The whole operation is differentiable, so the projections that produce queries, keys, and values are learned end to end with the rest of the network.

A related analogy compares self-attention to a library search. The query is the question a reader brings; the key is the label on each book's spine; and the value is the content inside the book. For each question, the mechanism checks how well it matches each book label, and then returns a blend of the book contents, weighted by how relevant each label was to the question. This captures both the soft lookup nature of the mechanism and the fact that the response is a blend rather than a single chosen item.

mathematical formulation

Given an input sequence of n tokens, each represented as a d-dimensional vector, the input can be organized as a matrix X of shape (n, d). Self-attention transforms this input through three learned linear projections to produce queries, keys, and values:

Q = X W_Q (queries: "what am I looking for?")
K = X W_K (keys: "what do I contain?")
V = X W_V (values: "what information do I actually provide?")

Here, W_Q, W_K, and W_V are learned weight matrices. W_Q and W_K have shape (d, d_k), and W_V has shape (d, d_v), where d_k is the key/query dimension and d_v is the value dimension. In the original transformer, d_k = d_v = d_model / h, where h is the number of attention heads. The three matrices are learned parameters of the neural network, updated during training through backpropagation.

The attention output is then computed using scaled dot-product attention:

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

Breaking this down step by step:

Step	Operation	Purpose
1. Dot product	Q K^T	Compute a raw similarity score between every pair of tokens. The result is an (n, n) attention matrix where entry (i, j) measures the compatibility between position i's query and position j's key.
2. Scaling	Divide by sqrt(d_k)	Prevent the dot products from growing too large in magnitude, which would push the softmax into regions with extremely small gradients.
3. Softmax	softmax over each row	Normalize the scores into a probability distribution so that the weights for each query sum to 1.
4. Weighted sum	Multiply by V	Produce the output by taking a weighted combination of the value vectors, where higher-scoring tokens contribute more.

The result is a new sequence of n vectors, each of which incorporates information from all other positions in the input, weighted by learned relevance.

why scale by sqrt(d_k)

The scaling factor 1/sqrt(d_k) is one of the most discussed details of the transformer paper. Vaswani et al. argued that as d_k grows, the dot product q · k becomes a sum of d_k random products. If the components of q and k are independent random variables with mean 0 and variance 1, then the dot product has mean 0 and variance d_k. Without scaling, the magnitudes of the scores grow with sqrt(d_k), and large scores push softmax into a regime where one entry dominates and the gradient with respect to all other entries is essentially zero [1]. Dividing by sqrt(d_k) keeps the variance of the pre-softmax scores roughly constant regardless of head dimension, which keeps gradients usable and lets the model train stably even when d_k is large.

This seemingly minor detail matters in practice. Earlier additive attention by Bahdanau et al. (2014) avoided the issue by using a small feed-forward network to score query-key pairs, which is computationally heavier but does not suffer from the same scaling problem. Scaled dot-product attention combines the speed of dot products with the numerical stability of additive attention, which is one reason it became the default in modern architectures.

computational and memory complexity

The dominant cost of self-attention is the matrix multiplication Q K^T, which produces the (n, n) attention score matrix. For a sequence of length n with head dimension d_k, this operation requires O(n^2 d_k) floating-point operations. The subsequent multiplication with V adds another O(n^2 d_v). Overall, self-attention has:

Time complexity: O(n^2 d) per layer per head
Memory complexity: O(n^2) for the attention score matrix

The quadratic scaling in sequence length n is the primary bottleneck of self-attention. Doubling the sequence length quadruples both computation and memory. For a 2,048-token sequence, the attention matrix has about 4 million entries; for a 128K-token sequence, it balloons to over 16 billion entries. In a multi-head setting with h heads, the total cost is multiplied by h, although in practice each head has dimension d_model / h so the per-layer parameter count and FLOP count remain comparable to a single full-dimensional head.

This quadratic cost has driven a rich body of research into efficient alternatives, covered in the efficiency section below.

why self-attention matters

Self-attention solved several fundamental problems that plagued earlier sequence modeling approaches.

long-range dependencies

Recurrent neural networks (RNNs) process sequences step by step, meaning information from early tokens must pass through many sequential operations to influence later tokens. In practice, this makes it difficult for RNNs to learn dependencies spanning dozens or hundreds of tokens, even with gating mechanisms like LSTM. Self-attention computes direct pairwise interactions between all positions in a single operation, so the path length between any two tokens is O(1) rather than O(n) [1].

parallelization

Because self-attention computes all pairwise interactions simultaneously, the entire operation can be parallelized efficiently on modern GPUs. RNNs, by contrast, require sequential computation: the hidden state at time step t depends on the hidden state at time step t-1. This sequential bottleneck made RNN training much slower on parallel hardware. Convolutional sequence models such as ConvS2S avoid the bottleneck but reach distant tokens only through stacks of layers, which still bounds path length by the receptive field.

interpretability

The attention weight matrix provides a form of soft alignment that can be inspected. While attention weights should not be treated as definitive explanations of model behavior (see the interpretability section below), they do offer a useful diagnostic tool for understanding which tokens the model considers relevant to each other.

The following table compares self-attention to previous sequence modeling approaches, using the analysis from Vaswani et al. (2017):

Property	Self-attention	RNN	CNN
Maximum path length	O(1)	O(n)	O(log_k n)
Computation per layer	O(n^2 d)	O(n d^2)	O(k n d^2)
Parallelizable	Yes	No	Yes
Long-range dependencies	Direct	Difficult	Moderate
Sequential operations	O(1)	O(n)	O(1)

self-attention vs cross-attention

The transformer architecture employs two distinct forms of attention: self-attention and cross-attention. Understanding the difference is essential for grasping how encoder-decoder models work.

In self-attention, queries, keys, and values all come from the same sequence. Both the encoder and decoder in the original transformer use self-attention layers. The encoder's self-attention lets each token in the input attend to all other input tokens. The decoder's self-attention (which is masked; see below) lets each token in the output attend to previously generated output tokens.

In cross-attention (also called encoder-decoder attention), queries come from the decoder while keys and values come from the encoder's output. This is the mechanism by which the decoder "reads" the encoded input. For example, in machine translation, cross-attention allows each word being generated in the target language to attend to all words in the source language sentence.

Aspect	Self-attention	Cross-attention
Source of Q	Same sequence	Decoder
Source of K, V	Same sequence	Encoder output
Purpose	Model intra-sequence relationships	Model inter-sequence relationships
Found in	Encoder layers, decoder layers	Decoder layers only (in encoder-decoder models)
Example use	Contextual word representation	Aligning translation output to input

Decoder-only models such as GPT and LLaMA use only self-attention with causal masking, since they do not have a separate encoder. Encoder-decoder models like T5 and the original transformer use both self-attention and cross-attention. Cross-attention is also central to many multi-modal models, where a text decoder cross-attends to image features produced by a vision encoder.

masked (causal) self-attention

In autoregressive language modeling, a model generates tokens one at a time, left to right. During training, the model processes an entire sequence at once for efficiency, but each position should only be able to attend to positions at or before itself. Allowing a position to attend to future tokens would let the model "cheat" by looking at the answer it is supposed to predict.

Masked self-attention (also called causal attention) enforces this constraint by applying a mask to the attention score matrix before the softmax step. Specifically, all entries above the diagonal of the (n, n) score matrix are set to negative infinity, so that after softmax they become zero. This means position i can only attend to positions 1 through i.

The masked attention formula is:

Attention(Q, K, V) = softmax(mask(Q K^T / sqrt(d_k))) V

where mask(.) sets future positions to negative infinity.

Causal masking is used in all decoder-only models, including the GPT family, LLaMA, Mistral, and other autoregressive LLMs. During inference, when tokens are generated one at a time, the causal property is naturally satisfied since future tokens do not yet exist. The mask is primarily needed during training when full sequences are processed in parallel, and at prefill time when the prompt is processed in one pass.

A practical consequence of causal masking is that the attention score matrix is lower triangular, which halves the FLOPs needed for the QK^T computation in principle. Most production kernels, including FlashAttention, exploit this triangular structure to skip the upper triangle entirely, which roughly doubles speed in the long-context regime.

bidirectional vs causal in practice

Bidirectional self-attention, used in encoder-based models such as BERT and RoBERTa, places no restrictions on which positions can interact. Each token can see the entire sequence, which is well suited to tasks that require understanding the full input, such as text classification, named entity recognition, and extractive question answering. In BERT's masked language modeling (MLM) objective, the model predicts randomly masked tokens by leveraging context from both preceding and following tokens. Causal self-attention, by contrast, is what makes left-to-right generation safe in models like GPT and LLaMA, where the cached hidden states for earlier tokens remain valid as new tokens are generated, enabling efficient sequential decoding through a key-value cache.

multi-head self-attention

A single self-attention operation can only capture one type of relationship at a time. Multi-head attention, introduced alongside self-attention in the original transformer paper, addresses this by running multiple self-attention operations in parallel, each with its own learned projections [1].

Given h attention heads, the input X is projected into h separate sets of queries, keys, and values using different weight matrices. Each head computes attention independently:

head_i = Attention(X W_Q^i, X W_K^i, X W_V^i)

The outputs of all heads are concatenated and projected through a final linear layer:

MultiHead(X) = Concat(head_1, ..., head_h) W_O

where W_O is a learned output projection matrix.

Typically, the per-head dimension is d_k = d_model / h, so the total computation is roughly equivalent to a single head with full dimensionality. The original transformer used h = 8 heads with d_model = 512, giving d_k = d_v = 64 per head [1]. Modern large language models use larger numbers of heads to scale up. GPT-3 uses 96 heads with d_model = 12,288 in its largest variant. LLaMA 2 70B uses 64 heads, and PaLM uses 48. Common values for h in production models are 8, 12, 16, 32, 64, and 96.

Different heads tend to learn different types of relationships. Empirical analysis has shown that some heads specialize in syntactic relationships (subject-verb agreement, dependency parsing), while others focus on positional patterns (attending to adjacent tokens) or semantic relationships [10]. This diversity of learned patterns is one reason multi-head attention is so effective. The intuition Vaswani et al. offer is that a single head averages over all relations and loses information; multiple heads let the network keep separate "channels" for different kinds of attention patterns and combine them at the output projection.

head reduction strategies for inference

Multi-head attention is expensive at inference time, not because of FLOPs but because of memory bandwidth. Every step of autoregressive decoding has to load the entire key and value cache for every head from GPU memory. As context grows, this KV cache becomes the dominant cost, and several variants have emerged to shrink it without sacrificing too much quality.

Variant	Year	K/V sharing	KV cache size (relative to MHA)	Adopted by
Multi-Head Attention (MHA)	2017	Every head has its own K and V	1x	Original transformer, GPT-2, GPT-3
Multi-Query Attention (MQA)	2019	Single K and V shared by all query heads	1/h	PaLM, Falcon, StarCoder
Grouped-Query Attention (GQA)	2023	Groups of query heads share one K and V	1/g (g = group size)	LLaMA 2 70B, LLaMA 3, Mistral, Mixtral
Multi-head Latent Attention (MLA)	2024	Low-rank joint compression of K and V into a latent vector	About 1/14 in DeepSeek-V2	DeepSeek-V2, DeepSeek-V3

Multi-Query Attention was proposed by Noam Shazeer (2019) to attack the memory bandwidth problem in autoregressive decoding. By sharing a single key and value head across all query heads, MQA shrinks the KV cache by a factor of h and roughly matches the same factor in decoding speedup. The price is some quality loss and training instability, which led several teams to look for a middle ground [6].

Grouped-Query Attention (GQA), introduced by Ainslie et al. (2023), generalizes MQA by partitioning query heads into g groups, with each group sharing one key and value head. With g = 1, GQA reduces to MQA; with g = h, it reduces to MHA. Ainslie et al. also showed that an existing MHA checkpoint could be uptrained into GQA using only about 5% of the original pretraining compute, making the variant practical to retrofit. GQA is now standard in LLaMA 2 70B, LLaMA 3, Mistral, Mixtral, Qwen, and many other recent models [7].

Multi-head Latent Attention (MLA), introduced in DeepSeek-V2 (2024), takes a different angle by jointly compressing keys and values into a low-rank latent representation. Only the latent vector needs to be cached, not the full per-head K and V. DeepSeek reports that MLA reduces the KV cache by 93.3% compared with MHA while matching or beating its quality, and it boosts maximum generation throughput by about 5.76x. MLA also drives the architecture of DeepSeek-V3 and R1 [8].

positional encoding

One important property of self-attention is that it is permutation-equivariant: the same output is produced regardless of the order of input tokens (up to the corresponding reordering of outputs). This means self-attention has no inherent notion of position or sequence order. Without additional information, self-attention cannot distinguish "The cat sat on the mat" from "Mat the on sat cat the." To work with sequences whose order matters, transformers must add some form of positional encoding to the input or to the attention computation itself.

Four broad families of positional encoding have been used in practice:

Scheme	Year and paper	Where the signal is added	Used by
Sinusoidal absolute	Vaswani et al. 2017	Added to token embeddings before layer 1	Original transformer
Learned absolute	Devlin et al. 2018 (BERT)	Learned position embedding added to token embedding	BERT, GPT-2, RoBERTa
Relative position bias	Shaw et al. 2018; T5 (Raffel 2019)	Added to attention scores inside each layer	T5, DeBERTa
RoPE (rotary)	Su et al. 2021	Rotates Q and K vectors based on absolute position before the dot product	LLaMA, GPT-NeoX, PaLM, Mistral, Qwen, DeepSeek
ALiBi	Press et al. 2021	Adds a fixed linear bias to attention scores based on token distance	MPT, BLOOM, Baichuan

Sinusoidal encodings, the original choice in Vaswani et al. (2017), use sines and cosines of different frequencies so that any fixed offset between two positions can be expressed as a linear function of the two encodings. This is parameter-free and was hoped to extrapolate to longer sequences than seen in training, although in practice this generalization is limited [1].

Learned absolute positions, used in BERT and GPT-2, treat position embeddings as trainable parameters of the same size as the token vocabulary's embedding table. They tend to outperform sinusoidal encodings inside the trained range but cannot be applied to positions beyond the training context length.

Relative position representations, introduced by Shaw et al. (2018), encode the offset i - j directly in the attention computation rather than encoding absolute position in the input. T5 follows a closely related approach with a learned scalar bias per relative distance bucket. Relative encodings improve translation quality and can generalize a bit beyond the training context [9].

Rotary Position Embedding (RoPE), introduced by Su et al. (2021) in the RoFormer paper, has become the dominant choice in modern large language models. Instead of adding a position vector to the input, RoPE rotates each query and key vector by an angle that depends on its absolute position. The dot product between a rotated query at position i and a rotated key at position j depends only on the offset i - j, so the result behaves like a relative position encoding while being implemented through pure rotations. RoPE is used in LLaMA, GPT-NeoX, PaLM, Mistral, Qwen, DeepSeek, and most other recent decoder-only LLMs. Variants such as YaRN, NTK-aware scaling, and position interpolation extend RoPE to context lengths far beyond what was seen during pretraining [4].

ALiBi (Attention with Linear Biases), introduced by Press et al. (2021), drops position embeddings entirely and instead adds a fixed, head-specific linear penalty to the attention score, equal to a slope times the query-key distance. This biases attention toward nearby tokens and lets the model extrapolate to inputs longer than the training context. Press et al. showed that a 1.3B model trained on length 1,024 with ALiBi extrapolates to length 2,048 with no quality loss while training 11% faster than a sinusoidal baseline. ALiBi powers MPT, BLOOM, and several other models [5].

efficiency improvements

The O(n^2) complexity of self-attention has driven a rich body of work aimed at reducing computation and memory usage, especially for long sequences. The methods break roughly into three categories: hardware-aware exact computation, sparse and structured patterns, and linear-complexity approximations.

Method	Year	Approach	Time complexity	Memory complexity	Exact or approximate
FlashAttention	2022	IO-aware tiling and kernel fusion	O(n^2 d)	O(n)	Exact
FlashAttention-2	2023	Better parallelism, fewer non-matmul FLOPs	O(n^2 d)	O(n)	Exact
FlashAttention-3	2024	Hopper asynchrony, FP8	O(n^2 d)	O(n)	Exact (FP16) or approximate (FP8)
Sparse attention (Child 2019)	2019	Strided and fixed sparse patterns	O(n sqrt(n))	O(n sqrt(n))	Approximate
Longformer (Beltagy 2020)	2020	Sliding window plus global tokens	O(n)	O(n)	Approximate
BigBird (Zaheer 2020)	2020	Random plus window plus global	O(n)	O(n)	Approximate
Reformer (Kitaev 2020)	2020	LSH-based bucketing	O(n log n)	O(n log n)	Approximate
Linformer (Wang 2020)	2020	Project K and V to fixed length k	O(n)	O(n)	Approximate
Performer (Choromanski 2020)	2020	Random feature kernel approximation (FAVOR+)	O(n)	O(n)	Approximate
Linear Transformer (Katharopoulos 2020)	2020	Replace softmax with kernel feature map phi	O(n)	O(n)	Approximate
Ring Attention	2023	Distribute attention across devices	O(n^2 / devices)	O(n / devices)	Exact

flashattention

FlashAttention, introduced by Dao et al. (2022), takes a hardware-aware approach. Rather than changing the mathematical computation, it reorganizes how attention is computed to minimize expensive memory transfers between GPU HBM and on-chip SRAM. Using tiling and kernel fusion, FlashAttention computes exact attention while reducing memory usage from O(n^2) to O(n). The original paper reports a 15% end-to-end wall-clock speedup on BERT-large at sequence length 512, a 3x speedup on GPT-2 at length 1K, and a 2.4x speedup on Long Range Arena at lengths 1K to 4K [3].

FlashAttention-2 (Dao 2023) restructured the algorithm to parallelize over the sequence length dimension as well as batch and heads, and reduced the number of non-matmul FLOPs that the previous version spent on rescaling. It reaches 50% to 73% of theoretical peak FLOPs on A100 GPUs, roughly twice the throughput of the original FlashAttention. FlashAttention-3 (Shah, Dao et al. 2024) targets the Hopper architecture and uses warp-specialized scheduling, the Tensor Memory Accelerator, and FP8 to push throughput to 1.2 petaflops per second on H100, about 1.5x to 2x faster than FlashAttention-2 on the same hardware [11].

FlashAttention has become the default attention kernel in most production LLM training and inference stacks, including PyTorch's scaled_dot_product_attention, vLLM, TensorRT-LLM, and the Hugging Face transformers library.

sparse attention

Sparse attention methods restrict each token to attending to only a subset of other tokens, reducing complexity from O(n^2) to O(n sqrt(n)) or O(n log n) or even O(n).

The Sparse Transformer of Child et al. (2019) was an early example, introducing two patterns called strided and fixed attention. Strided attention has each position attend to its row and column in a virtual 2D grid, while fixed attention attends to a fixed column and to the elements after the latest column. The result reduces complexity to O(n sqrt(n)) and lets the model handle sequences in the tens of thousands of tokens. Sparse Transformer set state-of-the-art density modeling results on Enwik8, CIFAR-10, and ImageNet-64 at the time.

Longformer (Beltagy et al. 2020) combines a sliding window of local attention with a small number of global tokens that attend to all positions and are attended to by all positions. This works well for document-level tasks like summarization and question answering. BigBird (Zaheer et al. 2020) extends the idea by adding random attention edges to local and global patterns and proves that the resulting sparse attention is a universal approximator of sequence functions.

Reformer (Kitaev et al. 2020) replaces dot-product attention with locality-sensitive hashing, bucketing tokens whose query and key vectors are likely to have a high dot product into the same group and computing attention only within each bucket. This reduces complexity to O(n log n) and lets Reformer fit sequences of length 64K on a single GPU. Reformer also uses reversible layers to avoid storing intermediate activations.

These methods trade some representational capacity for linear or near-linear scaling, which makes them suitable for very long documents. Their adoption in production language models has been mixed. Mistral and other recent decoder-only LLMs use a sliding window pattern (similar in spirit to Longformer's local attention) as a way to bound the KV cache while keeping quality close to full attention.

linear and kernel-based attention

Linear attention methods replace the softmax with a kernel function that allows the attention computation to be decomposed into a form with O(n) complexity. Katharopoulos et al. (2020) showed that by using a feature map phi instead of softmax, attention can be rewritten as:

Attention(Q, K, V) = phi(Q) (phi(K)^T V)

By computing phi(K)^T V first (which has shape d_k by d_v), the computation avoids forming the n by n attention matrix. The same paper points out that this lets autoregressive transformers be implemented as recurrent networks, with a fixed-size matrix-valued state that updates one step at a time, and reports up to 4,000x speedups for very long autoregressive prediction.

Linformer (Wang et al. 2020) takes a different route, projecting the n by d_k key and value matrices down to a fixed length k along the sequence dimension. The resulting attention runs in O(n k) which is linear in n when k is held constant. The justification is empirical: the singular value spectrum of the attention matrix in trained transformers decays quickly, so the matrix is well approximated by a low-rank projection.

Performer (Choromanski et al. 2020) approximates the softmax kernel using positive orthogonal random features, an algorithm called FAVOR+. The approximation is unbiased and provably accurate, and lets Performer compute exact softmax-like attention in O(n) time and memory.

Linear attention methods have generally not matched the quality of standard softmax attention for large-scale language modeling, which has limited their direct adoption in frontier LLMs. Newer hybrid designs and ideas like RWKV, RetNet, and Mamba (see below) build on the same intuition that constant-state recurrence can replace explicit pairwise attention for many sequence tasks.

ring attention and distributed attention

Ring Attention (Liu et al. 2023) is not an algorithmic change to attention but a way of distributing the computation across many devices. Each device holds a slice of the sequence and the keys and values rotate around a ring, so each device sees every other device's K and V exactly once. The mathematics is exact, the memory per device is O(n / devices), and total throughput grows roughly linearly with the device count. Ring attention combined with FlashAttention is one of the standard approaches to training million-token context models.

the kv cache and inference

During autoregressive generation, every new token requires computing attention over the entire prefix. Recomputing keys and values for the prefix at each step would be O(n^2) per step. The standard remedy is the KV cache: keys and values for each generated token are computed once and stored, so each new step only computes the new token's Q and reads the cached K and V.

The KV cache turns generation into an O(n) per-step operation in compute but an O(n) per-step operation in memory bandwidth, since the entire cache must be streamed from HBM at every step. For long contexts and large models, this memory bandwidth becomes the binding constraint on inference throughput. MQA, GQA, MLA, and sliding-window attention are all in part attempts to shrink the KV cache so that more requests can fit on a single GPU and so that decoding is bound by compute rather than bandwidth.

Production inference servers such as vLLM and TensorRT-LLM use a paged KV cache, similar in spirit to virtual memory, that lets cached blocks for different requests share GPU memory more efficiently and supports continuous batching across many concurrent users.

role in transformers

Self-attention is the primary mechanism through which transformers build contextual representations. Each transformer layer consists of two sublayers: a multi-head self-attention sublayer followed by a position-wise feed-forward network. Both sublayers are wrapped in residual connections and layer normalization [1].

In the encoder, self-attention is bidirectional: each token can attend to all other tokens. This produces richly contextual representations useful for tasks like classification, named entity recognition, and extractive question answering. BERT and its variants use encoder-only architectures with bidirectional self-attention.

In the decoder, self-attention is causal (masked), restricting each token to attending only to earlier positions. This is the architecture used by GPT, LLaMA, and virtually all modern autoregressive language models.

Self-attention is also the mechanism that enables in-context learning, the ability of large language models to adapt to new tasks based on examples provided in the prompt. Olsson et al. (2022) at Anthropic argued that a specific kind of attention head, the induction head, is responsible for much of the in-context learning ability of small attention-only models. An induction head learns to attend back to a previous occurrence of the current token and copy what came after it, which is a primitive form of pattern matching that supports many in-context learning behaviors [12].

self-attention beyond NLP

While self-attention was popularized through language tasks, it has been successfully applied across many domains:

Computer vision. The Vision Transformer (ViT) by Dosovitskiy et al. (2020) treats image patches (typically 16 by 16 pixels) as tokens and applies standard self-attention over the resulting sequence. After pretraining on large datasets, ViT matched or exceeded convolutional neural networks on image classification benchmarks. Self-attention in ViT gives the model a global receptive field from the first layer, in contrast to CNNs, where each layer sees only a local neighborhood and global context emerges only through stacking many layers. Subsequent models like DeiT, Swin Transformer, and BEiT have refined this approach by incorporating hierarchical structures, local attention windows, and improved training strategies. The quadratic complexity of self-attention with respect to the number of patches limits direct application to very high-resolution images, motivating architectures like Swin Transformer that use shifted windows to restrict attention to local regions while still allowing cross-region information flow. ViT is now the backbone of many vision models, including DINOv2, SAM, and most multimodal foundation models [13].
Vision-language models. CLIP (Radford et al. 2021) uses cross-attention and self-attention together to align image and text representations. Modern multimodal LLMs such as GPT-4V, Gemini, and Claude use self-attention over interleaved image and text tokens.
Speech and audio. Models like Whisper and wav2vec 2.0 use self-attention encoders over mel-spectrogram patches for speech recognition, and Music Transformer applies it to symbolic music generation.
Protein structure. AlphaFold 2 uses an attention-based architecture (Evoformer) over multiple sequence alignments and pairwise residue features to predict protein structure with near-experimental accuracy.
Reinforcement learning. Decision Transformer frames reinforcement learning as a sequence modeling problem, using self-attention over trajectories of states, actions, and rewards.
Graph neural networks. Graph Attention Networks (GAT, Velickovic et al. 2018) apply self-attention over graph neighborhoods, enabling learnable aggregation of neighbor information.

interpretability of attention

Attention weights are easy to visualize, which led early work to treat them as natural explanations for model behavior. The picture is more nuanced.

Jain and Wallace (2019) ran a controlled study on text classification models and concluded that attention is not explanation. They found that learned attention weights are often weakly correlated with gradient-based feature importance and that very different attention distributions can produce essentially the same predictions. The implication is that any single attention map is one of many possible plausible explanations, not the explanation [14].

Wiegreffe and Pinter (2019) responded with "Attention is not not Explanation," pointing out that the experiments in Jain and Wallace allow attention to be replaced freely without retraining, which is too permissive a test. They proposed alternative tests under which attention can serve as a useful explanation for at least some models and tasks. The takeaway from this exchange is that attention weights are useful but should not be over-interpreted, and that any claim of explanation should be supported by behavioral tests, not raw weights alone.

Abnar and Zuidema (2020) addressed the problem that information mixes across layers and proposed two methods, attention rollout and attention flow, that estimate how much each input token contributes to a given output by recursively combining attention weights across layers. These methods correlate better with gradient-based importance than raw attention does and are still used as a quick sanity check on what a transformer is looking at [15].

A more recent line of work, mechanistic interpretability, treats attention heads as small computational circuits to be reverse-engineered. The Transformer Circuits Thread led by Anthropic has developed a mathematical framework in which each attention head is decomposed into a QK circuit (which positions to attend to) and an OV circuit (what information to write into the residual stream). Within this framework, Olsson et al. (2022) identified induction heads in small attention-only models, in which a head in one layer detects that the current token has appeared before, and a head in the next layer copies the token that followed it. Induction heads emerge during training in a sharp phase transition that closely tracks the appearance of in-context learning, suggesting that this circuit is one of the basic building blocks of LLM behavior [12, 16].

limitations

The quadratic O(n^2) compute and memory cost in sequence length is the most discussed limitation of self-attention. It bounds practical context lengths and forces long-context models to lean on FlashAttention, GQA or MLA, sliding window or sparse patterns, ring attention, and KV-cache compression to keep training and inference tractable. Even with these tricks, the cost of training a frontier LLM at 1M-token context dominates the budget.

A second limitation is that self-attention is permutation-equivariant by construction, so positional information has to be added separately. Different positional encodings have different inductive biases and different abilities to extrapolate beyond the training context. Length extrapolation remains an active research area, with techniques like position interpolation, NTK-aware RoPE scaling, YaRN, and ALiBi all attempting to push the context window beyond what the model originally saw.

A third limitation is interpretability. As discussed above, attention weights look interpretable but do not always explain model decisions. Mechanistic interpretability has made progress on small models but scaling these methods to frontier-size systems is hard.

A fourth limitation, which is more architectural than computational, is that self-attention is uniform: every token attends to every other token through the same kind of dot-product comparison. This treats all tokens the same regardless of role, which can be wasteful when only a small subset of tokens are relevant at each step.

modern context: alternatives and hybrids

Self-attention is the foundation of all modern LLMs in production today, but it is no longer the only option. Two related lines of work try to remove or reduce reliance on quadratic attention.

State space models such as Mamba (Gu and Dao, 2023) replace attention with a selective recurrent layer whose parameters depend on the input. Mamba scales linearly in sequence length and runs about 5x faster at inference than a transformer of the same size. Mamba-3B matches the quality of transformers twice its size on language modeling, audio, and genomics benchmarks [17]. RWKV and RetNet are similar in spirit, recasting attention as a constant-state recurrence.

Hybrid architectures combine attention layers with state space layers to capture the strengths of both. AI21's Jamba (2024) is a hybrid Transformer-Mamba mixture-of-experts model that interleaves Mamba and attention layers (one attention block out of every eight, in the original ratio) and supports context lengths up to 256K tokens on a single 80GB GPU. Jamba 1.5 scales the design to 398B total parameters with 94B active and reports up to 3x throughput improvement over LLaMA 2 70B and Mixtral [18].

Despite these alternatives, the major frontier model releases of 2024 and 2025 (GPT-4o, Claude 3 and 4, Gemini 1.5 and 2, LLaMA 3 and 4, Qwen, DeepSeek-V3 and R1) remain transformer-based. Self-attention with FlashAttention kernels, GQA or MLA for KV-cache reduction, RoPE for positions, and increasingly sliding-window or hybrid attention patterns is the dominant recipe. Whether attention will keep its central role or be supplanted by a recurrent or hybrid mechanism is one of the more interesting open architectural questions in the field.

Explain Like I'm 5 (ELI5)

Imagine you are in a classroom and the teacher asks everyone to write down which classmate is most helpful for answering a question. Every student looks around the room, thinks about what each other student knows, and writes down a score for how helpful each person would be. Then each student collects a little bit of knowledge from everyone else, but takes more from the students they scored highest.

That is what self-attention does. Each word in a sentence looks at every other word, figures out which words are most important for understanding its own meaning, and then mixes together information from all the other words, paying more attention to the important ones. After this process, each word's representation contains information about the whole sentence, not just about itself.

history and naming

Attention as a learnable alignment between a query and a set of values appeared in machine translation with Bahdanau et al. (2014), who used it to let a decoder soft-align over an encoder's hidden states. Luong et al. (2015) introduced multiplicative dot-product attention. The first explicit uses of self-attention (sometimes called intra-attention) appeared in the context of natural language inference and abstractive summarization in 2016 and 2017. Vaswani et al. (2017) gave self-attention its modern definition as the load-bearing component of the transformer, replacing recurrence and convolution entirely [1].

Since then, self-attention has been the central object of study in modern deep learning. The term "attention" today usually refers, by default, to scaled dot-product self-attention as defined in the transformer paper, with multi-head, causal masking, and FlashAttention as the standard implementation choices.

The key milestones in the evolution of self-attention can be summarized as follows:

Year	Development	Significance
2014	Bahdanau attention	First attention mechanism for neural machine translation; attended from decoder to encoder, not yet self-attention.
2015	Luong dot-product attention	Introduced multiplicative attention scoring, a precursor to scaled dot-product attention.
2016 to 2017	Intra-attention in RNN-based models	Self-attention integrated into RNN-based models for natural language inference and summarization.
2017	"Attention Is All You Need" (Vaswani et al.)	Introduced the transformer, using self-attention as the sole sequence-modeling mechanism and eliminating recurrence entirely.
2018	BERT (Devlin et al.)	Demonstrated bidirectional self-attention for pre-training, achieving breakthroughs on many NLP benchmarks.
2018	GPT (Radford et al.)	Showed that causal (masked) self-attention with autoregressive pre-training produces powerful language models.
2019	Multi-Query Attention (Shazeer)	Shared K and V across query heads to shrink the KV cache for inference.
2020	Vision Transformer (ViT)	Applied self-attention to image patches, proving that transformers can match or surpass CNNs on vision tasks.
2021	RoPE (Su et al.)	Rotary position embedding became the dominant positional encoding in modern decoder-only LLMs.
2022	FlashAttention (Dao et al.)	Made self-attention faster with no approximation, enabling training on much longer sequences.
2023	Grouped-Query Attention (Ainslie et al.)	Generalization of MQA adopted by LLaMA 2 70B, LLaMA 3, and Mistral.
2023 to 2024	State space models (Mamba, RWKV)	Proposed alternatives to self-attention with linear complexity, sparking debate about whether attention is still all you need.
2024	Multi-head Latent Attention (DeepSeek-V2)	Compressed K and V into a low-rank latent vector, cutting KV cache by over 90%.

references

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems (NeurIPS) 30. arXiv:1706.03762.

[2] Bahdanau, D., Cho, K., and Bengio, Y. (2014). "Neural Machine Translation by Jointly Learning to Align and Translate." arXiv:1409.0473.

[3] Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." NeurIPS 2022. arXiv:2205.14135.

[4] Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. Published in Neurocomputing 568 (2024).

[5] Press, O., Smith, N. A., and Lewis, M. (2021). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." ICLR 2022. arXiv:2108.12409.

[6] Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150.

[7] Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. arXiv:2305.13245.

[8] DeepSeek-AI (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434.

[9] Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). "Self-Attention with Relative Position Representations." NAACL 2018. arXiv:1803.02155.

[10] Clark, K., Khandelwal, U., Levy, O., and Manning, C. D. (2019). "What Does BERT Look At? An Analysis of BERT's Attention." BlackboxNLP 2019. arXiv:1906.04341.

[11] Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. (2024). "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." NeurIPS 2024. arXiv:2407.08608.

[12] Olsson, C., Elhage, N., Nanda, N., Joseph, N., et al. (2022). "In-context Learning and Induction Heads." Transformer Circuits Thread, Anthropic.

[13] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. (2020). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." ICLR 2021. arXiv:2010.11929.

[14] Jain, S. and Wallace, B. C. (2019). "Attention is not Explanation." NAACL 2019. arXiv:1902.10186. See also Wiegreffe, S. and Pinter, Y. (2019). "Attention is not not Explanation." EMNLP 2019. arXiv:1908.04626.

[15] Abnar, S. and Zuidema, W. (2020). "Quantifying Attention Flow in Transformers." ACL 2020. arXiv:2005.00928.

[16] Elhage, N., Nanda, N., Olsson, C., et al. (2021). "A Mathematical Framework for Transformer Circuits." Transformer Circuits Thread, Anthropic.

[17] Gu, A. and Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752.

[18] Lieber, O., Lenz, B., Bata, H., et al. (2024). "Jamba: A Hybrid Transformer-Mamba Language Model." arXiv:2403.19887.

[19] Child, R., Gray, S., Radford, A., and Sutskever, I. (2019). "Generating Long Sequences with Sparse Transformers." arXiv:1904.10509.

[20] Beltagy, I., Peters, M. E., and Cohan, A. (2020). "Longformer: The Long-Document Transformer." arXiv:2004.05150.

[21] Zaheer, M., Guruganesh, G., Dubey, A., et al. (2020). "Big Bird: Transformers for Longer Sequences." NeurIPS 2020. arXiv:2007.14062.

[22] Kitaev, N., Kaiser, L., and Levskaya, A. (2020). "Reformer: The Efficient Transformer." ICLR 2020. arXiv:2001.04451.

[23] Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. (2020). "Linformer: Self-Attention with Linear Complexity." arXiv:2006.04768.

[24] Choromanski, K., Likhosherstov, V., Dohan, D., et al. (2020). "Rethinking Attention with Performers." ICLR 2021. arXiv:2009.14794.

[25] Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. (2020). "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." ICML 2020. arXiv:2006.16236.

[26] Liu, H., Zaharia, M., and Abbeel, P. (2023). "Ring Attention with Blockwise Transformers for Near-Infinite Context." arXiv:2310.01889.

[27] Raschka, S. (2023). "Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch." https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html

definition and intuition

mathematical formulation

why scale by sqrt(d_k)

computational and memory complexity

why self-attention matters

long-range dependencies

parallelization

interpretability

self-attention vs cross-attention

masked (causal) self-attention

bidirectional vs causal in practice

multi-head self-attention

head reduction strategies for inference

positional encoding

efficiency improvements

flashattention

sparse attention

linear and kernel-based attention

ring attention and distributed attention

the kv cache and inference

role in transformers

self-attention beyond NLP

interpretability of attention

limitations

modern context: alternatives and hybrids

Explain Like I'm 5 (ELI5)

history and naming

references

Improve this article

Related Articles

Multi-head Latent Attention

GELU (Gaussian Error Linear Unit)

Multi-Head Self-Attention

Sparse autoencoder

ARC-AGI 2

Cross-attention

definition and intuition

mathematical formulation

why scale by sqrt(d_k)

computational and memory complexity

why self-attention matters

long-range dependencies

parallelization

interpretability

self-attention vs cross-attention

masked (causal) self-attention

bidirectional vs causal in practice

multi-head self-attention

head reduction strategies for inference

positional encoding

efficiency improvements

flashattention

sparse attention

linear and kernel-based attention

ring attention and distributed attention

the kv cache and inference

role in transformers

self-attention beyond NLP

interpretability of attention

limitations

modern context: alternatives and hybrids

Explain Like I'm 5 (ELI5)

history and naming

references

Related Articles

Multi-head Latent Attention

GELU (Gaussian Error Linear Unit)

Multi-Head Self-Attention

Sparse autoencoder

ARC-AGI 2

Cross-attention