Cross-attention is a variant of the attention mechanism used in transformer architectures and other neural networks where the queries are derived from one sequence or representation while the keys and values are derived from a different sequence or representation. Unlike self-attention, which captures relationships within a single input, cross-attention enables a model to integrate and align information across two distinct sources. The mechanism originated as encoder-decoder attention in the work of Bahdanau, Cho, and Bengio (2014) on neural machine translation [2], was generalized into the modern query-key-value form by Vaswani et al. (2017) in the original transformer architecture [1], and has since become a foundational building block in natural language processing, computer vision, speech recognition, diffusion models, retrieval-augmented generation, and a broad family of multimodal AI systems.
Whereas self-attention dominates decoder-only large language models such as GPT and Llama, cross-attention is the standard mechanism whenever a model must condition its output on an external source of information that lives in a different representational space. Examples include text-conditioned image generation in Stable Diffusion, grounding language model outputs on a stack of retrieved documents in Fusion-in-Decoder, and injecting visual features into a frozen language model in Flamingo. The fact that the same scaled dot-product operation can serve so many different roles, simply by changing the source of the queries versus the keys and values, is part of what makes the attention framework so general.
At its most basic level, cross-attention is a function that takes a query from one sequence and computes compatibility scores against keys from a second sequence, then uses those scores to produce a weighted combination of the second sequence's values. The "cross" in cross-attention refers to the fact that information flows between two different representations rather than circulating within a single one. The first sequence supplies the perspective from which the model is asking a question; the second sequence supplies the data being queried.
In a typical encoder-decoder setup, the decoder must decide which parts of the encoder's output are most relevant at each step of generation. Cross-attention provides exactly this capability. Each position in the decoder generates a query vector that is compared against all key vectors produced by the encoder. The resulting attention weights determine how much each encoder position contributes to the decoder's current output. The same idea generalizes far beyond encoder-decoder text models. The query side may be a small set of learned latent vectors, a sequence of image patches, or even a single readout token. The key-value side may be a text passage, a stack of retrieved documents, an audio spectrogram, or a high-resolution image. Cross-attention is essentially a soft, differentiable lookup that selects information from one representation based on a request issued by another.
This mechanism is sometimes called "encoder-decoder attention" in the literature, though the term "cross-attention" has become the more general and widely used label, especially as the technique has expanded beyond traditional encoder-decoder models into diffusion models, multimodal systems, and retrieval-augmented architectures. The two terms are often used interchangeably, though "encoder-decoder attention" specifically refers to the case where the keys and values come from a transformer encoder stack.
Cross-attention follows the same scaled dot-product attention formula used throughout transformer models, with the critical difference being the source of the query, key, and value matrices.
Given two sequences:
The query, key, and value matrices are computed as:
Q = H_A * W^Q
K = H_B * W^K
V = H_B * W^V
where H_A is the hidden representation of Sequence A with shape (n_q, d), H_B is the hidden representation of Sequence B with shape (n_kv, d), and W^Q, W^K, W^V are learned projection matrices. The query length n_q and the key/value length n_kv may be very different. In machine translation they are usually similar; in text-to-image diffusion they often differ by orders of magnitude, with thousands of spatial query positions attending to a few dozen text tokens.
The attention output is then calculated using the scaled dot-product formula:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
Here, d_k is the dimensionality of the key vectors, and the division by sqrt(d_k) prevents the dot products from growing too large in magnitude, which would push the softmax into regions with extremely small gradients [1]. The score matrix Q K^T has shape (n_q, n_kv) rather than the (n, n) matrix produced by self-attention, reflecting the fact that the two sides are decoupled in length.
The softmax operation normalizes the scores across all key positions, producing a probability distribution that represents how much attention each query position pays to each key position. The result is a weighted sum of the value vectors, where positions with higher attention scores contribute more to the output. Because the projections W^Q, W^K, and W^V are learned, the network can choose different subspaces for asking questions versus offering answers, and the same H_B can be projected into many different K and V views simply by training another set of projection matrices.
In practice, cross-attention is almost always implemented with multi-head attention, following the formulation from Vaswani et al. (2017) [1]:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W^O
where head_i = Attention(Q * W_i^Q, K * W_i^K, V * W_i^V)
In the original transformer, the model uses h=8 attention heads with d_model=512, giving each head d_k = d_v = d_model/h = 64 dimensions. The projection matrices are W_i^Q, W_i^K in R^(512x64) and W_i^V in R^(512x64). The output projection W^O in R^(512x512) combines the concatenated head outputs back into the model dimension.
Multiple heads allow the model to attend to information from different representational subspaces simultaneously. In a translation model, one head might learn syntactic alignment between source and target words, another might capture semantic similarity across the two languages, and yet another might attend to positional proximity. Studies of trained translation models have repeatedly recovered familiar linguistic patterns inside individual cross-attention heads, including subject-verb agreement, coreference resolution, and word-order rearrangement. This parallel processing of different attention patterns is part of why multi-head cross-attention works so well in practice.
Cross-attention rarely uses a causal mask, because the keys and values typically come from a representation that has already been computed in full before generation begins. In an encoder-decoder transformer, the entire encoder stack runs to produce H_B before the first decoder step, so there is no "future" to hide on the key side. The decoder still applies a causal mask in its self-attention sublayer, but the cross-attention sublayer is bidirectional with respect to the encoder.
Masking can still appear inside cross-attention for other reasons. Padding masks zero out attention to padding tokens added for batching. Cross-attention masks in Flamingo restrict each text token to attend only to images that appear earlier in the interleaved sequence, preserving the autoregressive property at the document level [3]. Some retrieval-augmented systems use block-diagonal masks that prevent attention from leaking across boundaries between independently encoded passages.
The distinction between self-attention and cross-attention is straightforward but important. In self-attention, a single sequence serves as the source for all three components (queries, keys, and values). In cross-attention, the queries come from one sequence and the keys and values come from another. Both operations share the same underlying scaled dot-product framework; only the data sources differ.
| Feature | Self-attention | Cross-attention |
|---|---|---|
| Query source | Same sequence as K and V | Different sequence from K and V |
| Key/value source | Same sequence as Q | Different sequence from Q |
| Purpose | Capture relationships within one sequence | Align and integrate two different sequences |
| Location in original transformer | Encoder layers; decoder masked self-attention | Decoder encoder-decoder attention sublayer |
| Typical use | Contextual representation of a single input | Conditioning output generation on an external source |
| Sequence length symmetry | Always square: n_q = n_kv = n | Asymmetric: n_q and n_kv may differ |
| Score matrix shape | (n, n) | (n_q, n_kv) |
| Computational complexity | O(n^2 d) where n is sequence length | O(n_q n_kv d) where n_q and n_kv may differ |
| Causal masking | Optional (used in autoregressive decoders) | Rare (keys are usually fully observed) |
| Example application | BERT encoding a sentence | Machine translation decoder attending to encoder |
| Number of input sequences | One | Two |
| Modern decoder-only LLMs | Yes (sole attention type) | Usually absent |
Self-attention allows each token in a sequence to gather context from every other token in the same sequence. Cross-attention, by contrast, allows tokens in one sequence to gather information from tokens in a completely different sequence. The decision of which to use is essentially a question about the data. If you have one input and want richer features for it, you want self-attention. If you have two inputs and want one to be informed by the other, you want cross-attention.
It is worth noting that self-attention can be viewed as a degenerate case of cross-attention where Sequence A and Sequence B are identical. In implementations such as PyTorch's nn.MultiheadAttention(query, key, value), the same module performs self-attention when query == key == value and cross-attention when they differ. This unification is a clean reflection of the fact that both operations live on the same mathematical foundation.
The transformer architecture introduced by Vaswani et al. in 2017 at NeurIPS contains three distinct types of attention [1]:
Encoder self-attention: Each position in the encoder attends to all positions in the previous encoder layer. The queries, keys, and values all come from the encoder's own representations.
Decoder masked self-attention: Each position in the decoder attends to all positions up to and including the current position. A causal mask prevents attending to future tokens, preserving the autoregressive property needed for generation.
Encoder-decoder attention (cross-attention): The queries come from the previous decoder layer, while the keys and values come from the output of the encoder stack. This allows every position in the decoder to attend over all positions in the input sequence.
The original transformer uses N=6 identical layers in both the encoder and the decoder. Each encoder layer has two sublayers: multi-head self-attention and a position-wise feed-forward network. Each decoder layer has three sublayers: masked multi-head self-attention, multi-head cross-attention over the encoder output, and a position-wise feed-forward network. Residual connections and layer normalization wrap each sublayer.
The cross-attention sublayer in the decoder is what allows the model to "look at" the input sequence while generating the output. Without it, the decoder would have no way to condition its predictions on the source information encoded by the encoder. This design mimics earlier encoder-decoder attention mechanisms used in sequence-to-sequence models with recurrent neural networks, but replaces recurrence with parallelizable attention operations [1][2].
A single decoder block, in pseudocode, looks roughly like this:
h = LayerNorm(x + MaskedSelfAttention(x))
h = LayerNorm(h + CrossAttention(query=h, key=enc_out, value=enc_out))
h = LayerNorm(h + FFN(h))
The cross-attention sublayer takes the running decoder representation h as the query side and the precomputed encoder output enc_out as the key-value side. Because enc_out is fixed for the duration of generation, its K and V projections can be cached after the first decoding step and reused at every subsequent step. This is one of the most important practical optimizations in encoder-decoder inference: the cross-attention KV cache for the source sequence is computed once and reused for the entire output, while only the decoder's self-attention KV cache grows token by token.
The direct ancestor of cross-attention is the soft attention mechanism introduced by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio in their 2014 paper Neural Machine Translation by Jointly Learning to Align and Translate [2]. Earlier sequence-to-sequence models compressed an entire source sentence into a single fixed-length context vector, which became a serious bottleneck for long inputs. Bahdanau and colleagues argued that forcing all source information through one vector squashed the signal and caused performance to degrade rapidly as sentences got longer.
Their solution was to encode the input as a sequence of annotation vectors using a bidirectional recurrent neural network, then to let the decoder choose, at each step, which of those annotations to focus on. The decoder produced a query (the previous decoder hidden state), scored each encoder annotation against it using a small feed-forward network, normalized the scores with a softmax, and combined the annotations into a context vector that fed into the next decoding step. This was, in retrospect, cross-attention applied to RNN hidden states using additive scoring rather than dot products. The mechanism is still known today as Bahdanau attention or additive attention.
Thang Luong, Hieu Pham, and Christopher D. Manning published a follow-up titled Effective Approaches to Attention-based Neural Machine Translation [4]. They introduced two scoring functions (dot-product and general) as alternatives to Bahdanau's additive scoring, and distinguished between global attention, which attended to every source position, and local attention, which attended to a small window around a predicted alignment position. The dot-product score is the direct predecessor of the scaled dot-product attention used in transformers; the transformer's later innovation was to add the 1/sqrt(d_k) scaling factor that prevents the softmax from saturating in high dimensions [1].
Two other 2015 papers extended the same ideas. End-to-End Memory Networks by Sukhbaatar, Szlam, Weston, and Fergus framed attention as soft addressing over an external memory matrix, with a query computing weights over memory keys and retrieving a combination of memory values [5]. The paper demonstrated multiple computational "hops" of attention per output, foreshadowing the multi-layer attention stacks of later transformers. Show, Attend and Tell by Kelvin Xu and colleagues extended attention to a multimodal setting [6]. The model produced captions one word at a time using an LSTM decoder that attended to spatial features from a convolutional neural network. Visualizing the attention maps showed the model fixating on the relevant region of the image as it generated each word. This was, in effect, the first widely used cross-modal cross-attention.
The transformer paper by Vaswani et al. (2017) reformulated cross-sequence attention using the unified query-key-value framework, replaced additive scoring with scaled dot-product scoring, and added multi-head parallelism [1]. The encoder-decoder attention sublayer in the transformer decoder is the canonical modern cross-attention. Eliminating recurrence meant cross-attention no longer needed to interleave with sequential RNN steps, making it both more expressive (through multiple heads) and dramatically more efficient (through parallelization on GPU and TPU hardware).
The original and most intuitive application of cross-attention is in machine translation. The encoder processes the source language sentence, and the decoder generates the target language sentence one token at a time. At each decoding step, cross-attention enables the decoder to identify which source tokens are most relevant for predicting the next target token.
For example, when translating "The cat sat on the mat" from English to French, the decoder might strongly attend to "cat" when generating "chat" and to "mat" when generating "tapis." The attention weights learned through cross-attention effectively create a soft alignment between source and target positions, replacing the hard alignment models used in earlier statistical machine translation systems. Researchers have used these alignments to build interpretability visualizations, to extract word-level alignments for downstream tools such as bilingual dictionaries, and to debug failure cases in production translation systems.
Beyond translation, cross-attention plays a central role in other sequence-to-sequence tasks including text summarization, question answering, and document-grounded generation, where the model must condition its output on an input passage. Encoder-decoder models such as T5 and BART use cross-attention as the primary channel through which the input shapes the output. The success of these models on summarization, paraphrasing, and grammatical error correction owes a great deal to the ability of cross-attention to flexibly select source information.
One of the most prominent modern uses of cross-attention is in text-to-image diffusion models. In these systems, cross-attention is the primary mechanism through which text descriptions guide the image generation process.
Stable Diffusion, based on the Latent Diffusion Model by Rombach et al. (2022), uses cross-attention layers within its U-Net denoising network to condition image generation on text prompts [7]. The text prompt is first encoded using a frozen CLIP ViT-L/14 text encoder. Inside the U-Net, the latent image features form the queries (Q), while the CLIP text token embeddings produce the keys (K) and values (V). At each denoising step, cross-attention allows the model to determine which spatial regions of the image should correspond to which words in the prompt. Stable Diffusion applies cross-attention at multiple spatial resolutions (64x64, 32x32, 16x16 latent feature maps), allowing different levels of textual conditioning at various granularities.
Research has shown that cross-attention in Stable Diffusion operates in two distinct phases during inference. In the initial "semantics-planning" stage, the model relies heavily on cross-attention to lay out the spatial arrangement of objects described in the text. In the subsequent "fidelity-improving" stage, cross-attention outputs converge to a fixed point and the model focuses on refining visual details. This insight has been exploited in prompt-to-prompt editing techniques, where manipulating cross-attention maps allows targeted changes to generated images without altering the overall composition.
Imagen (Saharia et al., 2022, Google Research) takes a different approach to text encoding but relies on the same cross-attention conditioning mechanism. Instead of CLIP, Imagen uses the encoder from a frozen T5-XXL large language model, a text-only model not originally trained on image-text pairs. The text embeddings are injected into the U-Net via cross-attention, implemented by concatenating the text embedding to the key-value pairs of each attention layer. Google's research found that scaling the language model size improved image quality and text-image alignment more than scaling the diffusion model itself, highlighting how critical the cross-attention interface between text and image is for generation quality [8].
DALL-E 2 (Ramesh et al., 2022, OpenAI), also known as unCLIP, uses a two-stage architecture consisting of a diffusion prior that generates CLIP image embeddings from text, followed by a decoder that generates images from those embeddings [9]. The decoder uses cross-attention layers between its encoder and decoder blocks, with the cross-attention taking CLIP text embeddings and previous layer outputs to produce context vectors for each subsequent layer.
More recent text-to-image systems have explored alternatives. The Diffusion Transformer (DiT) by Peebles and Xie (2023) replaced the U-Net backbone with a transformer operating on latent patches and studied three conditioning mechanisms: cross-attention, in-context conditioning, and adaptive layer norm (adaLN-Zero) [10]. The paper found that adaLN-Zero produced lower FID than cross-attention while being more compute-efficient, which is why newer DiT-based systems often use adaLN-Zero for class or scalar conditioning. Cross-attention remains the standard for text conditioning, where the signal is a sequence of token embeddings rather than a single class label. Stable Diffusion 3 and FLUX.1 (2024) use a Multimodal Diffusion Transformer (MMDiT) architecture in which text and image tokens are processed jointly with separate Q, K, V projections per modality, structurally equivalent to interleaved bidirectional cross-attention.
Cross-attention is a core architectural element in many vision-language models that need to fuse visual and textual information. The choice between cross-attention and simpler concatenation-based fusion is one of the central design decisions in this space.
Flamingo (Alayrac et al., 2022, DeepMind) introduced gated cross-attention dense layers as its primary mechanism for injecting visual information into a frozen large language model [3]. In Flamingo's architecture, new cross-attention layers (called gated xattn-dense blocks) are interleaved between the existing frozen LLM layers. The visual features (from a Vision Encoder processed through a Perceiver Resampler) serve as keys and values, while the language model's hidden states serve as queries. A learned gating mechanism, using a tanh-based scalar gate initialized at zero, gradually introduces visual conditioning without destabilizing the pretrained language model. This design allows Flamingo to perform few-shot learning on vision-language tasks by providing interleaved image-text examples as prompts.
BLIP-2 (Li et al., 2023) introduced the Q-Former, a lightweight querying transformer that bridges a frozen image encoder to a frozen language model [11]. The Q-Former uses a small set of learnable query tokens that pass through self-attention layers (where they exchange information among themselves) and cross-attention layers (where they extract information from the visual encoder's features). The output is a fixed-size set of query tokens that can be projected into the language model's input space. This approach achieves competitive results with far fewer trainable parameters than approaches that train end-to-end multimodal models.
LLaVA (Liu et al., 2023) takes a different approach. Rather than using cross-attention layers, LLaVA projects visual tokens from a CLIP vision encoder through a simple linear (later MLP) projection and concatenates them with text tokens as input to the LLM decoder. This projection-based fusion avoids the need for additional cross-attention parameters, though it requires the language model's self-attention to handle the combined visual-textual sequence. The LLaVA approach has become increasingly popular as an alternative to explicit cross-attention for vision-language integration, partly because it simplifies training and partly because it lets the model use the same self-attention layers it already trained for text.
Qwen-VL (Bai et al., 2023) sits between these two approaches. It uses a single-layer cross-attention adapter with trainable query vectors and visual encoder features as keys and values, condensing the variable-length sequence of visual features down to a fixed length of 256. 2D absolute positional encodings are added to the cross-attention query-key pairs to preserve spatial information about the original image during compression. The compressed visual sequence is then concatenated with text tokens and processed by the Qwen language model.
The IDEFICS family from Hugging Face illustrates how the architectural pendulum has swung. The first IDEFICS, an open reproduction of Flamingo, used gated cross-attention layers. IDEFICS-2, released in 2024, departed from this design and adopted a fully autoregressive approach with a Perceiver pooling step followed by an MLP projection and concatenation with text tokens, similar to LLaVA. The trade-off is that cross-attention provides cleaner separation between modalities and more parameter-efficient conditioning, while concatenation-based approaches require fewer architectural changes and reuse the existing self-attention infrastructure.
The table below summarizes how some prominent multimodal architectures handle cross-modal fusion.
| Model | Year | Vision-language fusion | Cross-attention used? |
|---|---|---|---|
| Show, Attend and Tell | 2015 | LSTM decoder attends to CNN feature map | Yes (additive attention over spatial features) |
| Flamingo | 2022 | Gated xattn-dense layers interleaved into frozen LLM | Yes (gated cross-attention) |
| Perceiver | 2021 | Cross-attention from latent array to input bytes | Yes (defining feature) |
| Perceiver IO | 2021 | Input cross-attention plus output cross-attention | Yes (input and output) |
| BLIP-2 | 2023 | Q-Former with learned queries cross-attending to ViT features | Yes (in Q-Former) |
| Qwen-VL | 2023 | Single-layer cross-attention adapter compresses ViT features to 256 tokens | Yes (in adapter) |
| LLaVA | 2023 | MLP projection of ViT features concatenated with text | No (concatenation) |
| IDEFICS | 2023 | Open reproduction of Flamingo with gated xattn-dense | Yes (gated) |
| IDEFICS-2 | 2024 | Perceiver pooling plus MLP projection plus concatenation | No (concatenation) |
| Florence-2 | 2023 | Sequence-to-sequence with vision encoder and language decoder | Yes (encoder-decoder cross-attention) |
| Stable Diffusion | 2022 | U-Net cross-attends to CLIP text embeddings | Yes (text conditioning) |
| Imagen | 2022 | U-Net cross-attends to T5-XXL text embeddings | Yes (text conditioning) |
| DALL-E 2 | 2022 | Decoder cross-attends to CLIP text and image embeddings | Yes |
Encoder-decoder automatic speech recognition (ASR) systems rely on cross-attention to align acoustic features with text transcriptions. The decoder generates one text token at a time while attending over the encoder's representation of the audio.
Whisper (Radford et al., 2022, OpenAI) is a prominent example. Whisper processes audio by converting 30-second chunks into log-Mel spectrograms, which are then encoded by a transformer encoder. The decoder uses cross-attention to attend over the encoder's output while autoregressively generating text tokens [12]. This cross-attention mechanism creates a dynamic alignment between acoustic frames and the text being transcribed, allowing the model to handle variable-length audio inputs and produce accurate transcriptions across multiple languages.
The cross-attention weights in Whisper have proven useful beyond basic transcription. Researchers have leveraged the implicit time alignment captured in the cross-attention maps for tasks like word-level timestamping and streaming ASR. The Simul-Whisper system, for instance, uses the time alignment embedded in Whisper's cross-attention to guide auto-regressive decoding for chunk-based streaming speech recognition.
Earlier encoder-decoder ASR systems, such as those based on Listen, Attend and Spell (LAS) by Chan et al. (2016), also used attention-based alignment between acoustic encoders and text decoders, establishing the pattern that Whisper and other modern systems follow.
Cross-attention is the standard mechanism in generative question answering systems where a model must produce an answer conditioned on one or more documents. In an encoder-decoder model, the relevant documents are encoded into key-value memory, and the decoder generates the answer while attending over those representations. The cross-attention weights effectively serve as a learned relevance scorer, surfacing the parts of the input that are most useful for the current generation step.
This pattern is the foundation of most extractive and abstractive QA systems built on top of encoder-decoder transformers. Long-document QA systems extend the same mechanism with retrieval, sparse attention, or hierarchical encoders to keep cross-attention costs manageable when the input becomes very long.
The Perceiver (Jaegle et al., 2021, DeepMind) represents a creative rethinking of how cross-attention can be used to handle inputs of arbitrary size and modality. Traditional transformers apply self-attention to the full input, which is computationally prohibitive for high-dimensional data such as images. A 224x224 image has over 50,000 pixels, so naive self-attention would require an attention matrix with 2.5 billion entries.
The Perceiver solves this problem by using cross-attention to project a high-dimensional input byte array into a fixed-size latent array. The architecture has two core components [13]:
Cross-attention module: Maps from a large input byte array (M elements) and a small latent array (N elements, where N is much smaller than M) to a new latent array. The queries come from the latent array and the keys and values come from the input.
Latent transformer tower: A stack of self-attention blocks that operates entirely in the latent space.
The computational complexity of the cross-attention step is O(M * N), which is linear in the input size M (since N is fixed). The latent self-attention has complexity O(N^2), which is independent of the input size. This gives the full architecture a complexity of O(M * N + L * N^2), where L is the number of latent self-attention layers, effectively decoupling network depth from input size.
In practice, the Perceiver uses a latent array of around 512 indices with 1024 channels for ImageNet classification. The architecture iteratively alternates between cross-attention (to re-read the input) and latent self-attention (to process the latent representations), with multiple rounds of cross-attention yielding better performance at the cost of increased computation. Perceiver achieves performance comparable to ResNet-50 and ViT on ImageNet without using 2D convolutions, by directly attending to all 50,000 pixels through cross-attention into the latent bottleneck [13].
The Perceiver IO extension (Jaegle et al., 2021) further generalizes this approach by adding an output cross-attention step, where task-specific output queries cross-attend to the latent array to produce structured outputs of any desired shape. This makes the Perceiver IO usable as a generic backbone for any task whose inputs and outputs can be expressed as arrays of vectors, including classification, segmentation, optical flow, and language modeling.
This architecture has influenced many subsequent designs. The Perceiver Resampler used in Flamingo, for example, uses the same principle of cross-attending from a fixed set of learned latent queries to a variable-length set of visual features, producing a fixed-size set of visual tokens regardless of the input resolution. The Q-Former in BLIP-2 follows the same pattern. Both can be viewed as direct descendants of the Perceiver's input cross-attention.
Cross-attention is a key mechanism in several retrieval-augmented generation (RAG) architectures, where a model must incorporate external retrieved documents into its generation process.
The RAG framework introduced by Lewis et al. (2020) combined a dense passage retriever with a pretrained sequence-to-sequence model (BART) [14]. Given an input query, the retriever returned the top-K relevant Wikipedia passages from a dense vector index. The seq2seq model then conditioned on the retrieved passages along with the query to generate an answer. Cross-attention is the channel through which the retrieved evidence flows into the decoder. Two RAG variants were explored: RAG-Sequence assumes a single retrieved document is responsible for the entire output, while RAG-Token allows each generated token to attend to a different document, both implemented through marginalization over retrieved documents in the decoder.
RAG set the state of the art on three open-domain question answering benchmarks at the time of publication and helped establish the broader paradigm of retrieval-augmented generation that now dominates production deployments of large language models for factual tasks.
Fusion-in-Decoder (Izacard and Grave, 2021) is a retrieval-augmented model built on T5 that uses cross-attention as its core fusion mechanism [15]. In FiD, each retrieved passage is encoded independently by the T5 encoder. The encoder outputs from all retrieved passages are then concatenated and passed to the T5 decoder, which attends over all encoded passages jointly through cross-attention. This design allows the decoder to identify and synthesize relevant information scattered across multiple retrieved documents.
The cross-attention weights in FiD effectively serve as a learned relevance scoring mechanism, with the model learning to focus on the most informative passages during generation. Researchers have used these attention scores to compute passage-level importance for downstream tasks such as supporting evidence selection. FiD also scales gracefully: encoding passages independently is parallelizable, and decoder cross-attention concentrates almost all of the cross-passage interaction in one place. The FiDO follow-up further reduced inference cost by applying cross-attention only every few decoder layers, a variant known as layer-sparse attention.
The Retrieval-Enhanced Transformer (RETRO) by Borgeaud et al. (2022, DeepMind) introduces a chunked cross-attention mechanism for integrating retrieved text into language model predictions [16]. RETRO splits the input into chunks and retrieves nearest neighbors for each chunk from a database of trillions of tokens. The retrieved neighbors are processed by an encoder, and the resulting representations are integrated into the main model through a specialized chunked cross-attention layer that interleaves with the model's regular self-attention. This approach has time complexity linear in the amount of retrieved data and enabled a 7.5 billion parameter RETRO to match the performance of GPT-3 (175B) on the Pile, despite using roughly 25 times fewer parameters.
The success of these architectures demonstrates that cross-attention provides a natural and effective interface for conditioning a generative model on externally retrieved information, whether that information comes from a document corpus, a knowledge base, or any other structured data source. Production RAG pipelines today often place retrieval and conditioning at the prompt level (concatenating retrieved chunks into the LLM's input window), but the architectural variants based on cross-attention remain influential and frequently outperform prompt-only approaches when sufficient training data is available.
The computational cost of cross-attention is determined by the sizes of the two sequences involved. If the query sequence has n_q tokens and the key/value sequence has n_kv tokens, the attention computation requires O(n_q n_kv d) operations, where d is the embedding dimension.
In the standard encoder-decoder transformer for machine translation, both sequences are typically of similar length, making cross-attention comparable in cost to self-attention. However, in modern applications the two sequences can differ dramatically in size. In diffusion models, the spatial feature map may contain thousands of positions while the text prompt has only dozens of tokens. In the Perceiver, the input may contain tens of thousands of elements while the latent array has only a few hundred. In retrieval-augmented systems such as FiD, the concatenated encoder output can be many thousands of tokens long while the decoder generates only a few hundred.
A useful efficiency property of cross-attention in encoder-decoder models is that the keys and values from the encoder side can be precomputed and cached. The encoder runs once on the source sequence to produce H_B, then the K = H_B W^K and V = H_B W^V projections are computed once and reused at every decoding step. Only the query side, derived from the growing decoder representation, needs to be recomputed at each step. This is one of the largest practical wins of encoder-decoder architectures over alternative designs. By contrast, decoder self-attention requires growing the K and V cache by one token at every generation step, although that cache can also be reused across steps.
As model resolutions and sequence lengths have grown, cross-attention has become a notable computational bottleneck in some applications. In text-to-image diffusion models, cross-attention can account for a substantial fraction of the total inference latency, particularly at high resolutions where the spatial feature maps are large. In retrieval-augmented systems, the concatenated encoder output can easily exceed tens of thousands of tokens. Cross-attention through such long key-value sequences is expensive in both compute and memory.
Several approaches have been developed to reduce the cost of cross-attention.
| Technique | Description | Typical savings |
|---|---|---|
| Sparse cross-attention | Limits attention computation to selected subsets of tokens based on fixed patterns, routing, or clustering | 40-60% memory reduction |
| Layer-sparse cross-attention (LSA) | Applies cross-attention in only a subset of decoder layers rather than all of them, as in FiDO | Reduces decoder inference cost significantly |
| Multi-query attention (MQA) | Shares key and value projections across all attention heads while keeping separate query projections | Up to 75% KV memory reduction |
| Grouped-query attention (GQA) | Compromise between full multi-head and MQA, sharing K and V across small groups of heads | Significant memory savings with minimal quality loss |
| Cache sharing | Reuses computed key-value caches across similar cross-attention layers | Reduces redundant computation |
| FlashAttention | Memory-efficient exact attention using tiling and recomputation | 2-4x speedup with no approximation |
| Cross-attention pruning | Identifies and removes redundant cross-attention layers after convergence | Reduces inference cost in diffusion models |
| KV cache reuse | Precomputes encoder K and V once and reuses for all decoder steps | Removes per-step encoder cost |
Research on Stable Diffusion has shown that cross-attention outputs converge to a fixed point after only a few inference steps. This observation has led to methods that skip cross-attention computation in later denoising steps without measurable loss in image quality, providing significant inference speedups. FlashAttention, originally developed for self-attention, also accelerates cross-attention by avoiding materialization of the full (n_q, n_kv) score matrix in high-bandwidth memory.
Beyond the standard multi-head formulation, several specialized variants of cross-attention have been developed for specific applications.
Gated cross-attention adds a learned gating mechanism that controls how much cross-attended information is mixed into the main representation. This is particularly useful when adding cross-attention to a pretrained model, as it allows the model to start from a state where the cross-attention has no effect (gate initialized to zero) and gradually learn to incorporate the new information source. Flamingo's gated cross-attention is the best-known example. Each layer's output is multiplied by tanh(alpha), where alpha is a learned scalar initialized to zero [3]. This ensures the pretrained language model's behavior is preserved at initialization, with visual information gradually introduced during training. The same trick has been used in many subsequent vision-language models that attach new modalities to a frozen backbone.
Used in the RETRO architecture, chunked cross-attention splits the input sequence into fixed-size chunks and applies cross-attention independently within each chunk to its corresponding retrieved neighbors [16]. This reduces computational cost and allows the model to handle very long sequences by localizing cross-attention. It also makes the dependency structure between input chunks and retrieved evidence explicit, which supports analysis of which retrieved passages influenced which parts of the output.
Instead of deriving queries from a decoder's hidden states, some architectures use a fixed set of learned query vectors. The Perceiver's latent array, the Perceiver Resampler in Flamingo, and the Q-Former in BLIP-2 all use this approach [11][13]. The learned queries cross-attend to the input, effectively learning to extract the most relevant information regardless of input content. This decouples output size from input size and provides a flexible information bottleneck that can compress arbitrarily long inputs into a fixed number of tokens.
Standard cross-attention is unidirectional: information flows from keys/values to queries. Bidirectional cross-attention, as used in BiXT (2024), allows information to flow in both directions simultaneously, addressing the iterative bottleneck in Perceiver-like architectures. The same structural pattern appears in MMDiT in Stable Diffusion 3 and FLUX.1, where text and image tokens attend to each other through paired projections. Decoupled cross-attention, used in the IP-Adapter for Stable Diffusion, splits cross-attention into separate components for different conditioning sources (for example, image embeddings alongside text), enabling image-prompted generation without retraining the base model.
Cross-attention is not the only way to combine information from two sequences. Several alternative approaches exist, each with different tradeoffs.
| Mechanism | How it works | Pros | Cons |
|---|---|---|---|
| Cross-attention | Queries from one sequence attend to keys/values from another | Fine-grained token-level alignment; learnable relevance | O(n_q n_kv) complexity; adds parameters |
| Concatenation plus self-attention | Combine both sequences into one and apply self-attention | Simple; no extra parameters; reuses existing layers | Quadratic in combined length; no explicit source distinction |
| Linear projection | Project one modality into another's embedding space | Very simple; minimal extra parameters | No dynamic alignment; coarse fusion |
| FiLM conditioning | Scale and shift features using learned affine transforms derived from the conditioning input | Lightweight; good for global conditioning | No spatial or token-level alignment |
| Additive fusion | Element-wise addition of aligned representations | Computationally cheap | Requires pre-aligned representations |
| adaLN / adaLN-Zero | Adaptive layer norm scaled and shifted by conditioning | Cheap; works well for class-label conditioning | Single conditioning vector; not great for sequential conditions |
| Hypernetwork conditioning | Conditioning input generates the weights of another network | Maximally flexible | Computationally heavy; hard to train |
The choice among these mechanisms depends on the task requirements. Cross-attention excels when fine-grained, position-specific alignment between two sequences is needed, which is why it dominates in translation, diffusion models, and retrieval-augmented generation. Simpler approaches like projection or concatenation may suffice when the alignment is less critical or when computational budgets are tight. The current vision-language ecosystem is split between cross-attention-based designs (Flamingo, BLIP-2, Qwen-VL) and concatenation-based designs (LLaVA, IDEFICS-2), with no clear consensus on which approach is best for all use cases.
The most popular modern large language models, including GPT-4, Claude, Llama, Gemini, and DeepSeek, are decoder-only models that contain no cross-attention layers. They use only causal self-attention. This is a striking departure from the original transformer, which placed cross-attention at the heart of its design.
Decoder-only architectures are simpler and treat all input as a single sequence, which makes pretraining objectives such as next-token prediction uniform. They scale predictably under standard scaling laws and reuse the same self-attention layers for both encoding and generation, simplifying training and serving infrastructure. Conditioning is handled by placing the conditioning input in the prompt rather than in a separate encoder, and the model's self-attention discovers the relevant relationships.
Encoder-decoder models with cross-attention still perform well on tasks where the input and output are structurally distinct, such as machine translation, summarization, and grammatical error correction. T5, BART, mT5, and Whisper are widely used encoder-decoder models that retain cross-attention. The current dominance of decoder-only architectures reflects practical engineering choices and scaling considerations more than a verdict that cross-attention is fundamentally inferior. In multimodal models, cross-attention is reappearing in new forms such as MMDiT and decoupled cross-attention adapters.
Most deep learning frameworks expose cross-attention through the same multi-head attention module they use for self-attention. In PyTorch, the standard interface is torch.nn.MultiheadAttention(embed_dim, num_heads), with forward(query, key, value). When all three are the same tensor, the module computes self-attention. When query differs from key and value, it computes cross-attention. A typical decoder block looks like this:
# self-attention sublayer (causal)
self_out, _ = self.self_attn(query=h, key=h, value=h, attn_mask=causal_mask)
h = self.norm1(h + self_out)
# cross-attention sublayer
cross_out, _ = self.cross_attn(query=h, key=enc_out, value=enc_out, key_padding_mask=enc_mask)
h = self.norm2(h + cross_out)
# feed-forward sublayer
h = self.norm3(h + self.ffn(h))
The Hugging Face Transformers library provides ready-made implementations of cross-attention inside its encoder-decoder model classes such as T5ForConditionalGeneration, BartForConditionalGeneration, and WhisperForConditionalGeneration. The cross-attention KV cache is exposed through the past_key_values argument and is automatically populated and reused during generation. For diffusion models, frameworks such as Diffusers expose cross-attention through CrossAttention (or the equivalent AttnProcessor), with the conditioning input passed as encoder_hidden_states.
Because cross-attention produces an explicit (n_q, n_kv) matrix of weights, it is one of the more interpretable components of modern transformer architectures. In machine translation, cross-attention maps approximate word alignments and have been used to extract bilingual dictionaries and to debug translation errors. In Whisper, the time alignment in cross-attention enables word-level timestamps and has been exploited by streaming inference systems. In Stable Diffusion, cross-attention maps have been visualized as heatmaps over the image to show where each prompt word focuses, and this information is the basis for techniques such as prompt-to-prompt editing, attention slicing, and attention-based segmentation.
Attention weights are not a complete explanation of model behavior. Two heads can attend to the same positions and produce very different outputs because the values they read are projected differently, and several papers have shown that attention is not always faithful as an explanation. Still, cross-attention maps are far more interpretable than the dense weight matrices of feed-forward layers, and they remain a useful tool for research and debugging.
The concept of attending from one sequence to another predates the transformer. Bahdanau attention introduced the idea of learning soft alignments between encoder and decoder states in RNN-based sequence-to-sequence models in 2014 [2]. The Luong attention mechanism proposed simplified scoring functions the following year [4], end-to-end memory networks framed the same operation as soft addressing over an external memory [5], and Show, Attend and Tell extended attention to images [6]. These early mechanisms are, in essence, forms of cross-attention applied to recurrent or convolutional representations rather than transformer representations.
The transformer's contribution was to reformulate cross-sequence attention using the query-key-value framework, combine it with multi-head parallelism, and embed it within a fully attention-based architecture that eliminated recurrence entirely [1]. This made cross-attention both more expressive and more efficient. The trajectory since then has been one of generalization across modalities. The encoder-decoder pattern of NMT extended to summarization, speech recognition, and image captioning, then jumped modalities to power text-conditioned diffusion models, vision-language models, and retrieval-augmented systems.
The most recent chapter is somewhat surprising. Decoder-only large language models, the dominant paradigm of modern AI, do not use cross-attention. They handle conditioning entirely through prompt concatenation and self-attention. At the same time, cross-attention remains essential in diffusion models, efficient multimodal adapters, retrieval-augmented architectures, and encoder-decoder systems for translation and ASR. The mechanism is no longer the headline feature of any single architecture, but it has settled into the role of a flexible building block designers reach for whenever they need to bring two representations into contact.