Self-attention is a core mechanism in modern deep learning that allows a neural network to relate different positions within a single input sequence to compute a richer representation of that sequence. Unlike traditional attention mechanisms where queries attend to a separate context (as in sequence-to-sequence models), self-attention derives its queries, keys, and values entirely from the same input. Introduced as "intra-attention" in earlier work and formalized as a central building block in the transformer architecture by Vaswani et al. (2017), self-attention has become the dominant mechanism powering large language models, vision transformers, and a wide range of other architectures across natural language processing, computer vision, and beyond [1].
At its core, self-attention answers a simple question: for each element in a sequence, which other elements in the same sequence are most relevant to it? Consider the sentence "The cat sat on the mat because it was tired." To understand what "it" refers to, a model needs to relate the pronoun back to "cat" earlier in the sentence. Self-attention enables this by computing a weighted sum over all positions in the sequence, where the weights reflect the relevance of each position to the current one.
The term "self" distinguishes this mechanism from cross-attention, where queries come from one sequence and keys/values come from a different sequence (for example, an encoder-decoder setup). In self-attention, all three components originate from the same input, making it a form of intra-sequence reasoning.
Given an input sequence of n tokens, each represented as a d-dimensional vector, the input can be organized as a matrix X of shape (n, d). Self-attention transforms this input through three learned linear projections to produce queries, keys, and values:
Here, W_Q, W_K, and W_V are learned weight matrices. W_Q and W_K have shape (d, d_k), and W_V has shape (d, d_v), where d_k is the key/query dimension and d_v is the value dimension.
The attention output is then computed using scaled dot-product attention:
Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V
Breaking this down step by step:
The result is a new sequence of n vectors, each of which incorporates information from all other positions in the input, weighted by learned relevance.
Self-attention solved several fundamental problems that plagued earlier sequence modeling approaches.
Recurrent neural networks (RNNs) process sequences step by step, meaning information from early tokens must pass through many sequential operations to influence later tokens. In practice, this makes it difficult for RNNs to learn dependencies spanning dozens or hundreds of tokens, even with gating mechanisms like LSTM. Self-attention computes direct pairwise interactions between all positions in a single operation, so the path length between any two tokens is O(1) rather than O(n) [1].
Because self-attention computes all pairwise interactions simultaneously, the entire operation can be parallelized efficiently on modern GPUs. RNNs, by contrast, require sequential computation: the hidden state at time step t depends on the hidden state at time step t-1. This sequential bottleneck made RNN training much slower on parallel hardware.
The attention weight matrix provides a form of soft alignment that can be inspected. While attention weights should not be treated as definitive explanations of model behavior, they do offer a useful diagnostic tool for understanding which tokens the model considers relevant to each other.
The following table compares self-attention to previous sequence modeling approaches:
| Property | Self-Attention | RNN | CNN |
|---|---|---|---|
| Maximum path length | O(1) | O(n) | O(log n) |
| Computation per layer | O(n^2 * d) | O(n * d^2) | O(k * n * d^2) |
| Parallelizable | Yes | No | Yes |
| Long-range dependencies | Direct | Difficult | Moderate |
| Sequential operations | O(1) | O(n) | O(1) |
The transformer architecture employs two distinct forms of attention: self-attention and cross-attention. Understanding the difference is essential for grasping how encoder-decoder models work.
In self-attention, queries, keys, and values all come from the same sequence. Both the encoder and decoder in the original transformer use self-attention layers. The encoder's self-attention lets each token in the input attend to all other input tokens. The decoder's self-attention (which is masked; see below) lets each token in the output attend to previously generated output tokens.
In cross-attention (also called encoder-decoder attention), queries come from the decoder while keys and values come from the encoder's output. This is the mechanism by which the decoder "reads" the encoded input. For example, in machine translation, cross-attention allows each word being generated in the target language to attend to all words in the source language sentence.
| Aspect | Self-Attention | Cross-Attention |
|---|---|---|
| Source of Q | Same sequence | Decoder |
| Source of K, V | Same sequence | Encoder output |
| Purpose | Model intra-sequence relationships | Model inter-sequence relationships |
| Found in | Encoder layers, decoder layers | Decoder layers only (in encoder-decoder models) |
| Example use | Contextual word representation | Aligning translation output to input |
Decoder-only models such as GPT and LLaMA use only self-attention (with causal masking), since they do not have a separate encoder. Encoder-decoder models like T5 and the original transformer use both self-attention and cross-attention.
In autoregressive language modeling, a model generates tokens one at a time, left to right. During training, the model processes an entire sequence at once for efficiency, but each position should only be able to attend to positions at or before itself. Allowing a position to attend to future tokens would let the model "cheat" by looking at the answer it is supposed to predict.
Masked self-attention (also called causal self-attention) enforces this constraint by applying a mask to the attention score matrix before the softmax step. Specifically, all entries above the diagonal of the (n, n) score matrix are set to negative infinity, so that after softmax they become zero. This means position i can only attend to positions 1 through i.
The masked attention formula is:
Attention(Q, K, V) = softmax(mask(Q K^T / sqrt(d_k))) V
where mask(.) sets future positions to negative infinity.
Causal masking is used in all decoder-only models, including the GPT family, LLaMA, Mistral, and other autoregressive large language models. During inference, when tokens are generated one at a time, the causal property is naturally satisfied since future tokens do not yet exist. The mask is primarily needed during training when full sequences are processed in parallel.
A single self-attention operation can only capture one type of relationship at a time. Multi-head attention, introduced alongside self-attention in the original transformer paper, addresses this by running multiple self-attention operations in parallel, each with its own learned projections [1].
Given h attention heads, the input X is projected into h separate sets of queries, keys, and values using different weight matrices. Each head computes attention independently:
head_i = Attention(X W_Q^i, X W_K^i, X W_V^i)
The outputs of all heads are concatenated and projected through a final linear layer:
MultiHead(X) = Concat(head_1, ..., head_h) W_O
where W_O is a learned output projection matrix.
Typically, the per-head dimension is d_k = d_model / h, so the total computation is roughly equivalent to a single head with full dimensionality. The original transformer used h = 8 heads with d_model = 512, giving d_k = 64 per head [1].
Different heads tend to learn different types of relationships. Empirical analysis has shown that some heads specialize in syntactic relationships (subject-verb agreement, dependency parsing), while others focus on positional patterns (attending to adjacent tokens) or semantic relationships. This diversity of learned patterns is one reason multi-head attention is so effective.
Several important variations have emerged to reduce the computational cost of multi-head attention during inference:
| Variant | Description | Key Benefit |
|---|---|---|
| Multi-Head Attention (MHA) | Each head has its own Q, K, V projections | Maximum expressiveness |
| Multi-Query Attention (MQA) | All heads share a single K and V, separate Q per head | Greatly reduced KV cache size |
| Grouped-Query Attention (GQA) | Groups of heads share K and V | Balance between MHA and MQA |
MQA was proposed by Shazeer (2019) and significantly reduces the memory needed to store key-value pairs during autoregressive generation. GQA, introduced by Ainslie et al. (2023), provides a middle ground that retains most of MHA's quality while achieving most of MQA's efficiency gains. GQA is used in LLaMA 2 (70B), Mistral, and many other recent models [2].
The dominant computational cost of self-attention is the matrix multiplication Q K^T, which produces the (n, n) attention score matrix. For a sequence of length n with head dimension d_k, this operation requires O(n^2 * d_k) floating-point operations. The subsequent multiplication with V adds another O(n^2 * d_v). Overall, self-attention has:
The quadratic scaling in sequence length n is the primary bottleneck of self-attention. Doubling the sequence length quadruples both computation and memory. For a 2048-token sequence, the attention matrix has about 4 million entries; for a 128K-token sequence, it balloons to over 16 billion entries.
This quadratic cost has motivated extensive research into more efficient alternatives, which are described in the next section.
The O(n^2) complexity of self-attention has driven a rich body of work aimed at reducing computation and memory usage, especially for long sequences.
FlashAttention, introduced by Dao et al. (2022), takes a hardware-aware approach. Rather than changing the mathematical computation, it reorganizes how attention is computed to minimize expensive memory transfers between GPU HBM and on-chip SRAM. Using tiling and kernel fusion, FlashAttention computes exact attention while reducing memory usage from O(n^2) to O(n) and achieving 2-7x wall-clock speedups. FlashAttention-2 (2023) further improved GPU utilization to 50-73% of the theoretical maximum on A100 GPUs [3].
Sparse attention methods restrict each token to attending to only a subset of other tokens, reducing complexity from O(n^2) to O(n * sqrt(n)) or O(n * log(n)). Notable approaches include:
These methods trade some representational capacity for linear or near-linear scaling, making them suitable for very long documents.
Linear attention methods replace the softmax with a kernel function that allows the attention computation to be decomposed into a form with O(n) complexity. Katharopoulos et al. (2020) showed that by using a feature map phi instead of softmax, attention can be rewritten as:
Attention(Q, K, V) = phi(Q) (phi(K)^T V)
By computing phi(K)^T V first (which is d_k by d_v), the computation avoids forming the n by n attention matrix. However, linear attention methods have generally not matched the quality of standard softmax attention for language modeling tasks, limiting their adoption in state-of-the-art models.
Several additional strategies address attention efficiency:
| Method | Approach | Complexity |
|---|---|---|
| FlashAttention | IO-aware exact computation | O(n^2) compute, O(n) memory |
| Longformer | Local + global sparse patterns | O(n) |
| Performer | Random feature approximation | O(n) |
| Linear Transformer | Kernel-based linearization | O(n) |
| Reformer | LSH-based sparse attention | O(n log n) |
| Ring Attention | Distributed attention across devices | O(n^2 / devices) |
State space models such as Mamba represent an alternative architecture that avoids explicit pairwise attention entirely, using a recurrent formulation with O(n) complexity. Hybrid architectures that combine attention layers with state space model layers have also shown promising results.
Self-attention is the primary mechanism through which transformers build contextual representations. Each transformer layer consists of two sublayers: a multi-head self-attention sublayer followed by a position-wise feed-forward network. Both sublayers are wrapped in residual connections and layer normalization [1].
In the encoder, self-attention is bidirectional: each token can attend to all other tokens. This produces richly contextual representations useful for tasks like classification, named entity recognition, and extractive question answering. BERT and its variants use encoder-only architectures with bidirectional self-attention.
In the decoder, self-attention is causal (masked), restricting each token to attending only to earlier positions. This is the architecture used by GPT, LLaMA, and virtually all modern autoregressive language models.
Self-attention is also the mechanism that enables in-context learning, the ability of large language models to adapt to new tasks based on examples provided in the prompt. The attention mechanism can dynamically route information from demonstration examples to the positions where predictions are made, without any parameter updates.
While self-attention was popularized through language tasks, it has been successfully applied across many domains:
One important limitation of self-attention is that it is permutation-equivariant: the same output is produced regardless of the order of input tokens (up to the corresponding reordering of outputs). This means self-attention has no inherent notion of position or sequence order.
To address this, transformers add positional encoding to the input. The original transformer used fixed sinusoidal encodings, while modern models typically use learned or rotary positional encodings. Rotary Position Embedding (RoPE), introduced by Su et al. (2021), has become the dominant approach in recent large language models, encoding relative position information directly into the query and key vectors [5].
The concept of attention in neural networks predates the transformer. Bahdanau et al. (2014) introduced additive attention for machine translation, allowing a decoder to selectively focus on different parts of the encoder's output. Luong et al. (2015) proposed multiplicative (dot-product) attention as a simpler alternative.
Self-attention specifically (where the sequence attends to itself rather than to a different sequence) was used in various forms before the transformer. Cheng et al. (2016) applied self-attention to reading comprehension, and Parikh et al. (2016) used it for natural language inference.
The transformer paper by Vaswani et al. (2017) was the first architecture to rely entirely on self-attention for both the encoder and decoder, dispensing with recurrence and convolutions altogether. The paper demonstrated that this attention-only architecture achieved state-of-the-art results on machine translation while being significantly faster to train [1]. This result catalyzed the modern era of transformer-based models.