See also: Attention, Transformer, Multi-Head Self-Attention
Self-attention, also known as intra-attention, is a mechanism in deep learning that allows each element of an input sequence to attend to every other element within the same sequence. Introduced as a core building block of the Transformer architecture in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al., self-attention replaced recurrent neural networks (RNNs) and convolutional approaches as the dominant method for modeling relationships in sequential data. It is the foundation of modern large language models such as GPT, BERT, and LLaMA, and has also been adopted in computer vision, speech recognition, and many other domains.
Unlike traditional attention, where a model attends from one sequence to another (for example, from a decoder to an encoder), self-attention operates within a single sequence. Every token computes a weighted combination of all tokens in the same sequence, enabling the model to capture both local and long-range dependencies in a single operation.
The self-attention mechanism takes a sequence of input vectors (for instance, word embeddings) and transforms each one into three separate vectors using learned weight matrices:
All three matrices (W_Q, W_K, W_V) are learned parameters of the neural network, updated during training through backpropagation. For a given input matrix X of shape (n x d_model), the projections are:
Q = X * W_Q
K = X * W_K
V = X * W_V
where W_Q, W_K, and W_V each have shape (d_model x d_k), and d_k is the dimensionality of the query and key vectors.
Once the query, key, and value matrices are computed, the self-attention output is calculated using the scaled dot-product attention formula:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
This formula proceeds in several steps:
| Step | Operation | Purpose |
|---|---|---|
| 1. Dot product | Q * K^T | Compute a raw similarity score between every pair of tokens. The result is an (n x n) attention matrix. |
| 2. Scaling | Divide by sqrt(d_k) | Prevent the dot products from growing too large in magnitude, which would push the softmax into regions with extremely small gradients. |
| 3. Softmax | softmax over each row | Normalize the scores into a probability distribution so that the weights for each query sum to 1. |
| 4. Weighted sum | Multiply by V | Produce the output by taking a weighted combination of the value vectors, where higher-scoring tokens contribute more. |
The scaling factor of 1/sqrt(d_k) is not a minor implementation detail. The dot product between a query and key vector is a sum of d_k independent terms. As d_k grows, the variance of this sum grows proportionally, meaning the dot product values can become very large or very small. Large values push the softmax function into saturated regions where the gradients become vanishingly small, making gradient descent training difficult. Dividing by sqrt(d_k) brings the variance back to approximately 1, keeping the softmax in a well-behaved numerical range.
A common analogy compares self-attention to a library search. The query is the question a reader brings; the key is the label on each book's spine; and the value is the content inside the book. For each question (query), the mechanism checks how well it matches each book label (key), and then returns a blend of the book contents (values), weighted by how relevant each label was to the question.
In bidirectional self-attention, every token can attend to every other token in the sequence, including tokens that appear both before and after it. This approach is used in encoder-based models such as BERT and RoBERTa. Since the model can look at the full context in both directions, bidirectional self-attention is well suited for tasks that require understanding the entire input, such as text classification, named entity recognition, and question answering.
In BERT's masked language modeling (MLM) objective, the model predicts randomly masked tokens by leveraging context from both preceding and following tokens. This is possible precisely because bidirectional self-attention imposes no restrictions on which positions can interact.
Causal self-attention, also called masked self-attention, restricts each token to attend only to itself and to tokens that appear earlier in the sequence. Future positions are masked out by setting their attention scores to negative infinity before the softmax step, ensuring they receive zero weight. This creates a triangular attention pattern where each token at position i can only see positions 1 through i.
Causal self-attention is essential for autoregressive models such as GPT and LLaMA, where the model generates text one token at a time from left to right. During training, causal masking ensures that the model cannot "cheat" by looking at tokens it has not yet predicted. During inference, hidden states computed for earlier tokens remain valid and do not need to be recalculated, enabling efficient sequential generation through a key-value cache.
Self-attention should be distinguished from cross-attention. In self-attention, the query, key, and value matrices all derive from the same input sequence. In cross-attention, the query comes from one sequence (typically the decoder), while the key and value come from a different sequence (typically the encoder output). Cross-attention appears in encoder-decoder architectures used for tasks like machine translation, where the decoder must attend to the source sentence while generating the target sentence.
| Feature | Self-Attention | Cross-Attention |
|---|---|---|
| Source of Q, K, V | All from the same sequence | Q from one sequence; K, V from another |
| Typical use | Encoder layers, decoder layers (masked) | Decoder layers connecting to encoder |
| Purpose | Capture relationships within a sequence | Transfer information between two sequences |
| Example models | BERT, GPT | Encoder-decoder Transformer, T5 |
A key mathematical property of self-attention is that it is permutation invariant (or, more precisely, permutation equivariant). The attention computation depends only on the content of the token vectors and not on their position in the sequence. If the input tokens are shuffled, the attention weights change only because different vectors now occupy different positions, but the mechanism itself has no built-in notion of order. This means that, without additional information, self-attention cannot distinguish "The cat sat on the mat" from "Mat the on sat cat the."
To solve this, positional encoding is added to the input embeddings before they enter the self-attention layers. The original Transformer used fixed sinusoidal positional encodings, where each position is represented by a unique pattern of sine and cosine values at different frequencies. Other approaches include:
These techniques break the permutation invariance and allow the model to leverage word order.
Multi-head self-attention extends the basic self-attention mechanism by running multiple attention computations ("heads") in parallel, each with its own set of learned Q, K, and V projection matrices. The outputs of all heads are concatenated and projected through a final linear layer.
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(X * W_Qi, X * W_Ki, X * W_Vi)
Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. For example, one head might learn to attend to syntactic relationships (subject-verb agreement), while another head focuses on semantic similarity. The original Transformer used h = 8 attention heads with d_k = d_v = 64 for a total model dimension of d_model = 512.
The most significant practical limitation of self-attention is its quadratic complexity. For a sequence of length n with representation dimension d, the attention matrix Q * K^T has dimensions (n x n), requiring O(n^2 * d) time and O(n^2) memory. This means that doubling the sequence length quadruples the computation and memory requirements.
The following table, adapted from the original Transformer paper, compares self-attention to recurrence and convolution across three criteria:
| Layer Type | Complexity per Layer | Sequential Operations | Maximum Path Length |
|---|---|---|---|
| Self-Attention | O(n^2 * d) | O(1) | O(1) |
| Recurrent | O(n * d^2) | O(n) | O(n) |
| Convolutional | O(k * n * d^2) | O(1) | O(log_k(n)) |
Self-attention excels at maximum path length (any two tokens can interact directly in a single layer, yielding O(1) path length) and parallelism (all positions are computed simultaneously, yielding O(1) sequential operations). However, its quadratic scaling with sequence length makes it expensive for very long sequences. Recurrent layers have lower per-layer complexity when n < d, but they require O(n) sequential steps, preventing parallelization. Convolutional layers (with kernel size k) fall between the two.
The quadratic cost of standard self-attention has motivated extensive research into more efficient alternatives:
FlashAttention (Dao et al., 2022) is an IO-aware algorithm that computes exact standard attention but reorganizes the computation to minimize memory reads and writes between GPU high-bandwidth memory (HBM) and on-chip SRAM. By processing the attention matrix in tiles and never materializing the full (n x n) matrix in HBM, FlashAttention achieves 2 to 4 times wall-clock speedup over standard implementations and reduces memory usage from O(n^2) to O(n). Importantly, FlashAttention does not approximate the attention computation; it produces mathematically identical results.
Sparse attention methods restrict each token to attend only to a subset of positions rather than all n positions. Patterns include local windows (each token attends only to nearby neighbors), strided patterns, and combinations of both. BigBird and Longformer use sliding-window attention augmented with a small number of global tokens, achieving O(n) complexity while maintaining strong performance on long-document tasks.
Linear attention methods replace the softmax-based attention kernel with a decomposable kernel function, enabling the computation to be restructured so that it runs in O(n * d^2) time rather than O(n^2 * d). Performers (Choromanski et al., 2021) approximate the softmax kernel using random feature maps. Other approaches, like RWKV and Mamba, reformulate attention as a recurrence that processes tokens one at a time in O(1) per step, combining the parallelism benefits of Transformers during training with the constant-memory inference of RNNs.
| Variant | Complexity | Exact? | Key Idea |
|---|---|---|---|
| Standard Self-Attention | O(n^2 * d) | Yes | Full pairwise attention |
| FlashAttention | O(n^2 * d) (same FLOP, less IO) | Yes | Tiled computation, avoids materializing full matrix |
| Sparse Attention (Longformer, BigBird) | O(n * d) | No (approximation) | Attend only to local windows and global tokens |
| Linear Attention (Performers) | O(n * d^2) | No (approximation) | Kernel trick to decompose softmax |
| State-space models (Mamba) | O(n * d) | No (different mechanism) | Selective recurrence replacing attention |
Self-attention was originally developed for natural language processing, but it has been successfully adapted for computer vision. The Vision Transformer (ViT), introduced by Dosovitskiy et al. in 2020, applies the standard Transformer encoder to images by splitting each image into a grid of fixed-size patches (typically 16x16 pixels), flattening each patch into a vector, and treating these patch vectors as a sequence of tokens.
Self-attention in ViT allows every image patch to attend to every other patch, giving the model a global receptive field from the first layer. This contrasts with convolutional neural networks (CNNs), where each layer sees only a local neighborhood and global context emerges only through stacking many layers. Subsequent models like DeiT, Swin Transformer, and BEiT have refined this approach by incorporating hierarchical structures, local attention windows, and improved training strategies.
The quadratic complexity of self-attention with respect to the number of patches limits direct application to very high-resolution images, motivating architectures like Swin Transformer that use shifted windows to restrict attention to local regions while still allowing cross-region information flow.
Self-attention, convolution, and recurrence represent three fundamentally different approaches to processing sequential or spatial data. Each has distinct strengths:
| Property | Self-Attention | Convolution | Recurrence (RNN/LSTM) |
|---|---|---|---|
| Receptive field | Global (all positions) | Local (kernel size) | Sequential accumulation |
| Weight sharing | Dynamic (content-dependent) | Fixed (same kernel everywhere) | Fixed (same weights each step) |
| Parallelism | Fully parallel | Fully parallel | Sequential |
| Long-range dependencies | Direct (O(1) path length) | Indirect (requires stacking layers) | Indirect (O(n) path length) |
| Inductive bias | Minimal (learns all patterns from data) | Translation invariance, locality | Sequential ordering |
| Data efficiency | Needs more data (weaker inductive bias) | More data-efficient for spatial tasks | More data-efficient for short sequences |
Cordelière et al. (2020) showed that a multi-head self-attention layer with a sufficient number of heads is at least as expressive as any convolutional layer, meaning self-attention can learn to replicate any convolution. In practice, Transformer-based models tend to use a mixture of local and global attention patterns, with lower layers often learning convolution-like local attention and higher layers capturing long-range global relationships.
Self-attention is used across a wide range of tasks and domains:
Imagine you are in a classroom and the teacher asks everyone to write down which classmate is most helpful for answering a question. Every student looks around the room, thinks about what each other student knows, and writes down a score for how helpful each person would be. Then each student collects a little bit of knowledge from everyone else, but takes more from the students they scored highest.
That is what self-attention does. Each word in a sentence looks at every other word, figures out which words are most important for understanding its own meaning, and then mixes together information from all the other words (paying more attention to the important ones). After this process, each word's representation contains information about the whole sentence, not just about itself.
| Year | Development | Significance |
|---|---|---|
| 2014 | Bahdanau attention | First attention mechanism for neural machine translation; attended from decoder to encoder (not self-attention). |
| 2016 | Intra-attention in RNNs | Self-attention integrated into RNN-based models to capture within-sequence dependencies. |
| 2017 | "Attention Is All You Need" (Vaswani et al.) | Introduced the Transformer architecture, using self-attention as the sole sequence-modeling mechanism and eliminating recurrence entirely. |
| 2018 | BERT (Devlin et al.) | Demonstrated bidirectional self-attention for pre-training, achieving breakthroughs on many NLP benchmarks. |
| 2018 | GPT (Radford et al.) | Showed that causal (masked) self-attention with autoregressive pre-training produces powerful language models. |
| 2020 | Vision Transformer (ViT) | Applied self-attention to image patches, proving that Transformers can match or surpass CNNs on vision tasks. |
| 2022 | FlashAttention (Dao et al.) | Made self-attention 2 to 4 times faster with no approximation, enabling training on much longer sequences. |
| 2023-2024 | State-space models (Mamba, RWKV) | Proposed alternatives to self-attention with linear complexity, sparking debate about whether attention is still "all you need." |