Cross-attention is a variant of the attention mechanism used in transformer architectures where the queries are derived from one sequence and the keys and values are derived from a different sequence. Unlike self-attention, which captures relationships within a single input, cross-attention enables a model to integrate and align information across two distinct sources. It was introduced as part of the original transformer architecture in the "Attention Is All You Need" paper by Vaswani et al. (2017) and has since become a foundational building block in natural language processing, computer vision, speech recognition, and multimodal AI systems.
At its most basic level, cross-attention is a function that takes a query from one sequence and computes compatibility scores against keys from a second sequence, then uses those scores to produce a weighted combination of the second sequence's values. The "cross" in cross-attention refers to the fact that information flows between two different representations rather than circulating within a single one.
In a typical encoder-decoder setup, the decoder must decide which parts of the encoder's output are most relevant at each step of generation. Cross-attention provides exactly this capability. Each position in the decoder generates a query vector that is compared against all key vectors produced by the encoder. The resulting attention weights determine how much each encoder position contributes to the decoder's current output.
This mechanism is sometimes called "encoder-decoder attention" in the literature, though the term "cross-attention" has become the more general and widely used label, especially as the technique has expanded beyond traditional encoder-decoder models into diffusion models, multimodal systems, and retrieval-augmented architectures.
Cross-attention follows the same scaled dot-product attention formula used throughout transformer models, with the critical difference being the source of the query, key, and value matrices.
Given two sequences:
The query, key, and value matrices are computed as:
Q = H_A * W^Q
K = H_B * W^K
V = H_B * W^V
where H_A is the hidden representation of Sequence A, H_B is the hidden representation of Sequence B, and W^Q, W^K, W^V are learned projection matrices.
The attention output is then calculated using the scaled dot-product formula:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
Here, d_k is the dimensionality of the key vectors, and the division by sqrt(d_k) prevents the dot products from growing too large in magnitude, which would push the softmax function into regions with extremely small gradients.
The softmax operation normalizes the scores across all key positions, producing a probability distribution that represents how much attention each query position pays to each key position. The result is a weighted sum of the value vectors, where positions with higher attention scores contribute more to the output.
In practice, cross-attention is almost always implemented with multiple heads, following the multi-head attention formulation from Vaswani et al. (2017):
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W^O
where head_i = Attention(Q * W_i^Q, K * W_i^K, V * W_i^V)
In the original transformer, the model uses h=8 attention heads with d_model=512, giving each head d_k = d_v = d_model/h = 64 dimensions. The projection matrices are W_i^Q, W_i^K in R^(512x64) and W_i^V in R^(512x64). The output projection W^O in R^(512x512) combines the concatenated head outputs back into the model dimension.
Multiple heads allow the model to attend to information from different representational subspaces simultaneously. One head might focus on syntactic alignment, another on semantic similarity, and yet another on positional proximity. This parallel processing of different attention patterns is what makes multi-head cross-attention so effective in practice.
The distinction between self-attention and cross-attention is straightforward but important. In self-attention, a single sequence serves as the source for all three components (queries, keys, and values). In cross-attention, the queries come from one sequence and the keys and values come from another.
| Feature | Self-Attention | Cross-Attention |
|---|---|---|
| Query source | Same sequence as K and V | Different sequence from K and V |
| Key/Value source | Same sequence as Q | Different sequence from Q |
| Purpose | Capture relationships within one sequence | Align and integrate two different sequences |
| Location in original transformer | Encoder layers; decoder masked self-attention | Decoder encoder-decoder attention sublayer |
| Typical use | Contextual representation of a single input | Conditioning output generation on an external source |
| Computational complexity | O(n^2 * d) where n is sequence length | O(n_q * n_kv * d) where n_q and n_kv may differ |
| Example application | BERT encoding a sentence | Machine translation decoder attending to encoder |
| Number of input sequences | One | Two |
Self-attention allows each token in a sequence to gather context from every other token in the same sequence. Cross-attention, by contrast, allows tokens in one sequence to gather information from tokens in a completely different sequence. Both operations share the same underlying mathematical framework; only the data sources differ.
The transformer architecture introduced by Vaswani et al. in 2017 at NeurIPS contains three distinct types of attention:
Encoder self-attention: Each position in the encoder attends to all positions in the previous encoder layer. The queries, keys, and values all come from the encoder's own representations.
Decoder masked self-attention: Each position in the decoder attends to all positions up to and including the current position. A causal mask prevents attending to future tokens, preserving the autoregressive property needed for generation.
Encoder-decoder attention (cross-attention): The queries come from the previous decoder layer, while the keys and values come from the output of the encoder stack. This allows every position in the decoder to attend over all positions in the input sequence.
The original transformer uses N=6 identical layers in both the encoder and the decoder. Each encoder layer has two sublayers: multi-head self-attention and a position-wise feed-forward network. Each decoder layer has three sublayers: masked multi-head self-attention, multi-head cross-attention over the encoder output, and a position-wise feed-forward network. Residual connections and layer normalization wrap each sublayer.
The cross-attention sublayer in the decoder is what allows the model to "look at" the input sequence while generating the output. Without it, the decoder would have no way to condition its predictions on the source information encoded by the encoder. This design mimics earlier encoder-decoder attention mechanisms used in sequence-to-sequence models with recurrent neural networks, but replaces recurrence with parallelizable attention operations.
The original and most intuitive application of cross-attention is in machine translation. The encoder processes the source language sentence, and the decoder generates the target language sentence one token at a time. At each decoding step, cross-attention enables the decoder to identify which source tokens are most relevant for predicting the next target token.
For example, when translating "The cat sat on the mat" from English to French, the decoder might strongly attend to "cat" when generating "chat" and to "mat" when generating "tapis." The attention weights learned through cross-attention effectively create a soft alignment between source and target positions, replacing the hard alignment models used in earlier statistical machine translation systems.
Beyond translation, cross-attention plays a central role in other sequence-to-sequence tasks including text summarization, question answering, and document-grounded generation, where the model must condition its output on an input passage.
One of the most prominent modern uses of cross-attention is in text-to-image diffusion models. In these systems, cross-attention is the primary mechanism through which text descriptions guide the image generation process.
Stable Diffusion (based on the Latent Diffusion Model by Rombach et al., 2022) uses cross-attention layers within its U-Net denoising network to condition image generation on text prompts. The text prompt is first encoded using a frozen CLIP ViT-L/14 text encoder, producing a sequence of token embeddings. Inside the U-Net, the latent image features form the queries (Q), while the CLIP text token embeddings produce the keys (K) and values (V). At each denoising step, cross-attention allows the model to determine which spatial regions of the image should correspond to which words in the prompt.
Research has shown that cross-attention in Stable Diffusion operates in two distinct phases during inference. In the initial "semantics-planning" stage, the model relies heavily on cross-attention to lay out the spatial arrangement of objects described in the text. In the subsequent "fidelity-improving" stage, cross-attention outputs converge to a fixed point and the model focuses on refining visual details. This insight has been exploited in prompt-to-prompt editing techniques, where manipulating cross-attention maps allows targeted changes to generated images without altering the overall composition.
Imagen (Saharia et al., 2022, Google Research) takes a different approach to text encoding but relies on the same cross-attention conditioning mechanism. Instead of CLIP, Imagen uses the encoder from a frozen T5-XXL large language model, a text-only model not originally trained on image-text pairs. The text embeddings are injected into the U-Net via cross-attention, implemented by concatenating the text embedding to the key-value pairs of each attention layer. Google's research found that scaling the language model size improved image quality and text-image alignment more than scaling the diffusion model itself, highlighting how critical the cross-attention interface between text and image is for generation quality.
DALL-E 2 (Ramesh et al., 2022, OpenAI), also known as unCLIP, uses a two-stage architecture consisting of a diffusion prior that generates CLIP image embeddings from text, followed by a decoder that generates images from those embeddings. The decoder uses cross-attention layers between its encoder and decoder blocks, with the cross-attention taking CLIP text embeddings and previous layer outputs to produce context vectors for each subsequent layer.
Cross-attention is a core architectural element in many vision-language models that need to fuse visual and textual information.
Flamingo (Alayrac et al., 2022, DeepMind) introduced gated cross-attention dense layers as its primary mechanism for injecting visual information into a frozen large language model. In Flamingo's architecture, new cross-attention layers are interleaved between the existing frozen LLM layers. The visual features (from a Vision Encoder processed through a Perceiver Resampler) serve as keys and values, while the language model's hidden states serve as queries. A learned gating mechanism, using a tanh-based scalar gate initialized at zero, gradually introduces visual conditioning without destabilizing the pretrained language model. This design allows Flamingo to perform few-shot learning on vision-language tasks by providing interleaved image-text examples as prompts.
LLaVA (Liu et al., 2023) takes a different approach. Rather than using cross-attention layers, LLaVA projects visual tokens from a CLIP vision encoder through a simple linear (later MLP) projection and concatenates them with text tokens as input to the LLM decoder. This projection-based fusion avoids the need for additional cross-attention parameters, though it requires the language model's self-attention to handle the combined visual-textual sequence. The LLaVA approach has become increasingly popular as an alternative to explicit cross-attention for vision-language integration.
Other multimodal architectures that employ cross-attention include Google's PaLI, Microsoft's Florence, and various medical imaging models that cross-attend between radiology images and clinical text reports.
Encoder-decoder automatic speech recognition (ASR) systems rely on cross-attention to align acoustic features with text transcriptions.
Whisper (Radford et al., 2023, OpenAI) is a prominent example. Whisper processes audio by converting 30-second chunks into log-Mel spectrograms, which are then encoded by a transformer encoder. The decoder uses cross-attention to attend over the encoder's output while autoregressively generating text tokens. This cross-attention mechanism creates a dynamic alignment between acoustic frames and the text being transcribed, allowing the model to handle variable-length audio inputs and produce accurate transcriptions across multiple languages.
The cross-attention weights in Whisper have proven useful beyond basic transcription. Researchers have leveraged the implicit time alignment captured in the cross-attention maps for tasks like word-level timestamping and streaming ASR. The Simul-Whisper system, for instance, uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive decoding for chunk-based streaming speech recognition.
Earlier encoder-decoder ASR systems, such as those based on Listen, Attend and Spell (LAS) by Chan et al. (2016), also used attention-based alignment between acoustic encoders and text decoders, establishing the pattern that Whisper and other modern systems follow.
The Perceiver (Jaegle et al., 2021, DeepMind) represents a creative rethinking of how cross-attention can be used to handle inputs of arbitrary size and modality. Traditional transformers apply self-attention to the full input, which is computationally prohibitive for high-dimensional data like images (a 224x224 image has over 50,000 pixels).
The Perceiver solves this problem by using cross-attention to project a high-dimensional input byte array into a fixed-size latent array. The architecture has two core components:
Cross-attention module: Maps from a large input byte array (M elements) and a small latent array (N elements, where N is much smaller than M) to a new latent array. The queries come from the latent array and the keys and values come from the input.
Latent transformer tower: A stack of self-attention blocks that operates entirely in the latent space.
The computational complexity of the cross-attention step is O(M * N), which is linear in the input size M (since N is fixed). The latent self-attention has complexity O(N^2), which is independent of the input size. This gives the full architecture a complexity of O(M * N + L * N^2), where L is the number of latent self-attention layers, effectively decoupling network depth from input size.
In practice, the Perceiver uses a latent array of around 512 indices with 1024 channels for ImageNet classification. The architecture iteratively alternates between cross-attention (to re-read the input) and latent self-attention (to process the latent representations), with multiple rounds of cross-attention yielding better performance at the cost of increased computation.
The Perceiver IO extension (Jaegle et al., 2021) further generalizes this approach by adding an output cross-attention step, where task-specific output queries cross-attend to the latent array to produce structured outputs of any desired shape.
This architecture has influenced many subsequent designs. The Perceiver Resampler used in Flamingo, for example, uses the same principle of cross-attending from a fixed set of learned latent queries to a variable-length set of visual features, producing a fixed-size set of visual tokens regardless of the input resolution.
Cross-attention is a key mechanism in several retrieval-augmented generation (RAG) architectures, where a model must incorporate external retrieved documents into its generation process.
Fusion-in-Decoder (Izacard and Grave, 2021) is a retrieval-augmented model built on T5 that uses cross-attention as its core fusion mechanism. In FiD, each retrieved passage is encoded independently by the T5 encoder. The encoder outputs from all retrieved passages are then concatenated and passed to the T5 decoder, which attends over all encoded passages jointly through cross-attention. This design allows the decoder to identify and synthesize relevant information scattered across multiple retrieved documents. The cross-attention weights in FiD effectively serve as a learned relevance scoring mechanism, with the model learning to focus on the most informative passages during generation.
The Retrieval-Enhanced Transformer (RETRO) by Borgeaud et al. (2022, DeepMind) introduces a chunked cross-attention mechanism for integrating retrieved text into language model predictions. RETRO splits the input into chunks and retrieves nearest neighbors for each chunk from a database of trillions of tokens. The retrieved neighbors are processed by an encoder, and the resulting representations are integrated into the main model through a specialized chunked cross-attention layer. This approach has time complexity linear in the amount of retrieved data and enabled RETRO to achieve performance comparable to GPT-3 with roughly 25 times fewer parameters.
The success of these architectures demonstrates that cross-attention provides a natural and effective interface for conditioning a generative model on externally retrieved information, whether that information comes from a document corpus, a knowledge base, or any other structured data source.
The computational cost of cross-attention is determined by the sizes of the two sequences involved. If the query sequence has n_q tokens and the key/value sequence has n_kv tokens, the attention computation requires O(n_q * n_kv * d) operations, where d is the embedding dimension.
In the standard encoder-decoder transformer for machine translation, both sequences are typically of similar length, making cross-attention comparable in cost to self-attention. However, in modern applications the two sequences can differ dramatically in size. In diffusion models, the spatial feature map may contain thousands of positions while the text prompt has only dozens of tokens. In the Perceiver, the input may contain tens of thousands of elements while the latent array has only a few hundred.
As model resolutions and sequence lengths have grown, cross-attention has become a notable computational bottleneck in some applications. In text-to-image diffusion models, cross-attention can account for a substantial fraction of the total inference latency, particularly at high resolutions where the spatial feature maps are large.
Several approaches have been developed to reduce the cost of cross-attention:
| Technique | Description | Typical Savings |
|---|---|---|
| Sparse cross-attention | Limits attention computation to selected subsets of tokens based on fixed patterns, routing, or clustering | 40-60% memory reduction |
| Layer-sparse cross-attention (LSA) | Applies cross-attention in only a subset of decoder layers rather than all of them | Reduces decoder inference cost significantly |
| Multi-query attention | Shares key and value projections across all attention heads while keeping separate query projections | Up to 75% memory reduction |
| Cache sharing | Reuses computed key-value caches across similar cross-attention layers | Reduces redundant computation |
| Flash attention | Memory-efficient exact attention using tiling and recomputation | 2-4x speedup with no approximation |
| Cross-attention pruning | Identifies and removes redundant cross-attention layers after convergence | Reduces inference cost in diffusion models |
Research on Stable Diffusion has shown that cross-attention outputs converge to a fixed point after only a few inference steps. This observation has led to methods that skip cross-attention computation in later denoising steps without measurable loss in image quality, providing significant inference speedups.
Beyond the standard multi-head formulation, several specialized variants of cross-attention have been developed for specific applications.
Gated cross-attention adds a learned gating mechanism that controls how much cross-attended information is mixed into the main representation. This is particularly useful when adding cross-attention to a pretrained model, as it allows the model to start from a state where the cross-attention has no effect (gate initialized to zero) and gradually learn to incorporate the new information source.
Flamingo's gated cross-attention is the best-known example. Each cross-attention layer's output is multiplied by tanh(alpha), where alpha is a learned scalar initialized to zero. This ensures that the pretrained language model's behavior is preserved at initialization, with visual information gradually introduced during training.
Used in the RETRO architecture, chunked cross-attention splits the input sequence into fixed-size chunks and applies cross-attention independently within each chunk to its corresponding retrieved neighbors. This reduces the computational cost and allows the model to handle very long sequences by localizing the cross-attention computation.
Instead of deriving queries from a decoder's hidden states, some architectures use a fixed set of learned query vectors. The Perceiver's latent array and the Q-Former in BLIP-2 (Li et al., 2023) both use this approach. The learned queries cross-attend to the input, effectively learning to extract the most relevant information regardless of the specific input content. This design decouples the output size from the input size and provides a flexible information bottleneck.
Standard cross-attention is unidirectional: information flows from the key/value sequence to the query sequence. Bidirectional cross-attention, as used in BiXT (2024), allows information to flow in both directions simultaneously. The input tokens attend to the latent variables, and the latent variables attend to the input tokens in a single step, addressing the iterative bottleneck in Perceiver-like architectures.
Some recent architectures decouple cross-attention into separate components for different types of conditioning. In text-to-image models, for instance, separate cross-attention heads may be dedicated to different aspects of the text prompt (e.g., subject, style, spatial layout), allowing more fine-grained control over the generation process.
Cross-attention is not the only way to combine information from two sequences. Several alternative approaches exist, each with different tradeoffs:
| Mechanism | How It Works | Pros | Cons |
|---|---|---|---|
| Cross-attention | Queries from one sequence attend to keys/values from another | Fine-grained token-level alignment; learnable relevance | O(n_q * n_kv) complexity; adds parameters |
| Concatenation + self-attention | Combine both sequences into one and apply self-attention | Simple; no extra parameters | Quadratic in combined length; no explicit source distinction |
| Linear projection | Project one modality into another's embedding space | Very simple; minimal extra parameters | No dynamic alignment; coarse fusion |
| FiLM conditioning | Scale and shift features using learned affine transforms | Lightweight; good for global conditioning | No spatial or token-level alignment |
| Additive fusion | Element-wise addition of aligned representations | Computationally cheap | Requires pre-aligned representations |
The choice among these mechanisms depends on the task requirements. Cross-attention excels when fine-grained, position-specific alignment between two sequences is needed, which is why it dominates in translation, diffusion models, and retrieval-augmented generation. Simpler approaches like projection or concatenation may suffice when the alignment is less critical or when computational budgets are tight.
The concept of attending from one sequence to another predates the transformer. Bahdanau attention (Bahdanau et al., 2015) introduced the idea of learning soft alignments between encoder and decoder states in RNN-based sequence-to-sequence models. The Luong attention mechanism (Luong et al., 2015) proposed simplified scoring functions for the same purpose. These early attention mechanisms are, in essence, forms of cross-attention applied to recurrent hidden states rather than transformer representations.
The transformer's contribution was to reformulate this cross-sequence attention using the query-key-value framework, combine it with multi-head parallelism, and embed it within a fully attention-based architecture that eliminated recurrence entirely. This made cross-attention both more expressive (through multiple heads) and more efficient (through parallelization).
Since 2017, cross-attention has expanded far beyond its original role in encoder-decoder text models. Its adoption in diffusion models (2021 onward), multimodal models (2022 onward), and retrieval-augmented systems demonstrates its versatility as a general-purpose mechanism for conditioning one representation on another.