Multi-Query Attention (MQA)
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,703 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 2,703 words
Add missing citations, update stale details, or suggest a clearer explanation.
Multi-Query Attention (MQA) is a variant of the multi-head attention mechanism used in [[transformer]] neural networks in which all query heads share a single key head and a single value head, rather than each query head having its own dedicated key and value projections. The technique was introduced by [[noam_shazeer]] in 2019 to address the memory-bandwidth bottleneck of autoregressive Transformer decoding, and it produces a dramatically smaller key/value cache during inference while typically incurring only a small loss in model quality.[^1]
MQA was widely adopted in large language models released between 2022 and 2023, including PaLM, Falcon, and StarCoder, where its ability to shrink the KV cache enabled larger batch sizes and faster generation. Since 2023, however, MQA has been largely superseded in new models by Grouped-Query Attention (GQA), a generalisation introduced by Ainslie et al. that recovers most of MQA's speed gains while staying closer in quality to standard multi-head [[attention]].[^2]
In the original Transformer of Vaswani et al. (2017), a multi-head [[attention]] layer with h heads computes, for each head, a separate linear projection of the input into query, key and value vectors, then performs scaled dot-product [[attention]] independently per head. Concretely, each head holds its own learned projection tensors P_q, P_k, P_v of shape [h, d, k] (or [h, d, v] for values), so that a Transformer layer maintains h distinct sets of keys and values.[^1]
This design is fast to train, because all positions in a sequence can be processed in parallel. But Shazeer (2019) observed that the same design becomes a serious bottleneck during incremental inference, in which an autoregressive decoder generates one token at a time. At each decoding step, the model must reload from memory the cached keys and values of every previously generated position, for every layer and every head. This collection of cached tensors is generally called the [[kv_cache]]. Shazeer's performance analysis showed that, for a baseline Transformer decoder, the ratio of memory access to arithmetic operations during incremental decoding scales as Theta(n/d + 1/b), where n is the current sequence length, d is the model dimension and b is the batch size. When n approaches d or the batch size is small, this ratio approaches one, and the modern GPU or TPU stalls waiting for memory rather than performing useful computation.[^1]
The two terms in this ratio suggest two different remedies. The 1/b term can be reduced by increasing the batch size, memory permitting. The n/d term is harder, because the cached K and V tensors have aggregate size proportional to b * h * n * k (where k is the per-head key dimension) and must be loaded in full at each decoding step. Earlier work attacked this by limiting the sequence length, attending only to a local window, or otherwise reducing the number of positions attended to.[^1] MQA takes a complementary path: rather than shrinking the sequence axis, it shrinks the heads axis of K and V.
MQA modifies multi-head attention by removing the heads dimension h from the key and value projection tensors while keeping it on the queries and on the output projection. In the original paper Shazeer writes that "Multi-query attention is identical except that the different heads share a single set of keys and values."[^1]
In terms of tensor shapes:
P_q, P_k, P_v, P_o with shape [h, d, k] (or [h, d, v]).P_q of shape [h, d, k] and P_o of shape [h, d, v], but reduces P_k to shape [d, k] and P_v to shape [d, v], so there is exactly one key projection and one value projection shared across all h query heads.[^1]Equivalently, all h query heads attend to the same key matrix K of shape [m, k] (or [b, m, k] for a batch of sequences) and the same value matrix V of shape [m, v], rather than h distinct copies. The TensorFlow einsum code shown in the paper differs from the multi-head version only in that the letter h is dropped from the equations wherever it indexed the heads dimension of K, V, P_k or P_v.[^1]
A useful way to view MQA is as an extreme point in a more general design space. If we let g denote the number of key/value heads, then:
g = h recovers standard multi-head attention.g = 1 is multi-query attention.1 < g < h is the intermediate Grouped-Query Attention case introduced later by Ainslie et al.[^2]This continuum is exactly the framing used by the GQA paper, which describes MQA as "Multi-query attention (MQA), which only uses a single key-value head" and presents GQA as "a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads."[^2]
The principal motivation for MQA is the size of the key/value cache used during autoregressive decoding. With standard multi-head attention, the per-layer KV cache for a batch has size proportional to b * h * n * k (for K) plus b * h * n * v (for V). MQA replaces the per-head copies with a single shared copy, reducing the cache to b * n * k + b * n * v, i.e. a factor of h smaller along the heads dimension.
Shazeer's complexity analysis quantifies the impact during incremental generation. For MQA, the ratio of memory access to arithmetic operations becomes Theta(1/d + n/(d*h) + 1/b), where the n/d term of the multi-head case has been reduced by a factor of h. The paper notes: "We have reduced the offensive n/d by a factor of h. Theoretically, given large batch size b, this should dramatically improve performance of incremental generation."[^1]
The empirical speed numbers from the original 2019 experiments are striking. On WMT 2014 English-German translation with a 6-layer, 211-million-parameter encoder/decoder Transformer (d_model = 1024, d_ff = 4096, h = 8, d_k = d_v = 128), Shazeer reports per-token decoding times on a TPUv2:
The decoder step is therefore roughly twelve times faster in the MQA variant under those conditions, while training time per token is essentially identical (13.0 vs. 13.2 microseconds).[^1] For beam-4 search the encoder/decoder times go from 2.0 plus 203 microseconds (multi-head) to 1.6 plus 32 microseconds (multi-query), again a roughly six- to seven-fold decoder speed-up.[^1]
Later analyses outside the original paper have framed the savings in terms of batch capacity rather than wall-clock time. Practitioners have noted that the KV cache shrinks by a factor equal to the number of attention heads, freeing memory for substantially larger inference batch sizes; one industry write-up reports batch-size increases on the order of 16x for Falcon-40B and around 70x for Falcon-7B when MQA replaces dense multi-head attention.[^3]
The headline claim of Shazeer (2019) is that MQA achieves these speed-ups while incurring "only minor quality degradation from the baseline."[^1] The detailed numbers support that framing but also show a real, if small, gap.
On WMT 2014 English-German translation, the paper reports:
h = 8, d_k = d_v = 128, d_ff = 4096): dev log-perplexity 1.424, dev BLEU 26.7, test BLEU 27.7 (greedy) / 28.4 (beam 4).h = 8, d_k = d_v = 128, d_ff = 5440, widened so total parameters match the baseline): dev log-perplexity 1.439, dev BLEU 26.5, test BLEU 27.5 (greedy) / 28.5 (beam 4).[^1]On the Billion-Word Language Modeling Benchmark with 192-million-parameter decoder-only Transformers:
In both settings the multi-query model is "slightly worse" on per-token perplexity but, as Shazeer points out, "significantly better than any of the alternatives involving decreasing h, d_k and d_v" by the same amount. In other words, sharing keys and values across heads is a much more graceful way to shrink the KV cache than simply using fewer or narrower heads.[^1]
Later work surveyed this trade-off less optimistically once MQA was applied at large scale. The GQA paper of Ainslie et al. (2023) states bluntly in its introduction that "multi-query attention (MQA) can lead to quality degradation and training instability, and it may not be feasible to train separate models optimized for quality and inference."[^2] Appendix A of that paper notes specific instabilities, writing that "multi-query attention can lead to training instability during fine-tuning, in particular combined with long input tasks" and that pre-training with MQA "suffered from frequent loss spikes", whereas uptrained GQA models "appear to be stable" by contrast.[^2]
The GQA paper also reports quantitative quality differences on uptrained T5-XXL across summarisation and question-answering benchmarks. On representative tasks the ROUGE-1 scores are: CNN/DailyMail 47.2 for the multi-head XXL baseline, 46.6 for the MQA variant and 47.1 for GQA with eight groups; arXiv summarisation 43.8 for multi-head, 43.0 for MQA and 43.5 for GQA-8. Average inference time per sample on the same setup is 1.51 seconds for multi-head, 0.24 seconds for MQA and 0.28 seconds for GQA-8.[^2] These numbers make MQA's central trade-off explicit: roughly a factor of six speed-up against multi-head attention, at the cost of a small but consistent quality regression that GQA is able to mostly close.
Although MQA was published in 2019, large-scale adoption in publicly described language models came in 2022 and 2023.
Google's 540-billion-parameter PaLM, described by Chowdhery et al. (2022), is one of the earliest large LLMs to use MQA. The GQA paper explicitly cites PaLM as a prior user, writing that "while some language models already use multi-query attention, such as PaLM ... many do not, including publicly available language models such as T5 ... and LLaMA."[^2] Secondary technical write-ups summarise PaLM's use of MQA as projecting keys and values to shape [1, h] while keeping queries at [k, h], with the rationale that this has a neutral effect on model quality and training speed but yields "a significant cost savings at autoregressive decoding time" because standard multi-head attention has poor efficiency on accelerator hardware during autoregressive decoding.[^4]
The Falcon family of open-weight LLMs (7B, 40B and later 180B) released by the UAE's Technology Innovation Institute in 2023 was explicitly designed for efficient inference and used MQA as one of its architectural choices. The official Falcon-40B model card on Hugging Face lists the attention mechanism as "multiquery (Shazeer et al., 2019) and FlashAttention (Dao et al., 2022)", and adds that "for multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree."[^5] Falcon's design pairs MQA with [[flash_attention]] to push inference latency and memory usage down at large model scale.
The StarCoder code LLM from the BigCode project, described by Li et al. (2023), is another high-profile MQA user. The paper abstract describes "15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention."[^6] The Hugging Face model card states that StarCoder is "a GPT-2 model with multi-query attention and Fill-in-the-Middle objective" with a context window of 8192 tokens, trained on roughly one trillion tokens in bfloat16.[^7] Here MQA's role is, in part, to make the model's relatively long 8K context cheap to serve at high batch sizes for code-completion workloads.
Across these three deployments the common theme is that MQA was selected specifically for inference-time efficiency in regimes where the KV cache, not raw FLOPs, dominates serving cost.
Multi-query attention sits at one end of a spectrum that runs from one shared key/value head (MQA) to one per query head (standard multi-head attention). Ainslie et al. (2023) introduced Grouped-Query Attention to occupy the middle of this spectrum. In GQA, query heads are partitioned into G groups, with each group sharing a single key and value head; so G = 1 recovers MQA and G = h recovers multi-head attention.[^2]
The GQA paper makes two main contributions. First, it proposes an "uptraining" recipe: an existing multi-head checkpoint can be converted to MQA or GQA by mean-pooling existing key and value heads within each group and then continuing pre-training for a small fraction (around 5%) of the original compute. Second, it shows that GQA with a moderate number of groups (eight, in their main experiments on T5-XXL) "achieves quality close to multi-head attention with comparable speed to MQA."[^2] As noted above, the ROUGE and inference-time numbers bear this out across CNN/DailyMail, arXiv, PubMed, MediaSum, MultiNews, WMT and question-answering tasks.[^2]
The practical implication is that GQA dominates MQA in most settings where slightly lower quality is undesirable, since it preserves nearly all of MQA's KV-cache reduction at a much smaller quality cost. As one description puts it, "GQA achieves significant additional quality gains, achieving performance close to MHA-XXL with speed close to MQA."[^2]
A pleasing side benefit is that GQA aligns naturally with tensor-parallel inference: setting G equal to the tensor-parallel degree puts exactly one KV head per GPU and avoids cross-device duplication of the KV cache.
By the mid-2020s, MQA in its strict g = 1 form has been largely supplanted by GQA in new frontier-class LLMs, though it remains common in code, infilling and other workloads where aggressive inference throughput justifies the quality cost. Meta's Llama 2, for instance, adopts grouped-query attention rather than pure MQA for its larger 34B and 70B variants, typically with an 8-group configuration in which num_key_value_heads equals the tensor-parallel degree; setting num_key_value_heads to 1 in that same implementation recovers MQA and setting it to num_attention_heads recovers multi-head attention.[^8]
Several practical lessons from the MQA experience continue to influence current attention research:
n/d + 1/b ratio remains a useful mental model when reasoning about whether a decoder is compute-bound or memory-bound, and many subsequent attention variants (GQA, multi-head latent attention, paged attention, and others) target the same bottleneck from different angles.[^1]MQA itself, then, is best understood as the first practical realisation of the broader idea that the key/value side of attention can be made far skinnier than the query side without breaking the model. That idea, more than the specific g = 1 choice, is its lasting contribution.