# Multi-Query Attention (MQA)

> Source: https://aiwiki.ai/wiki/mqa
> Updated: 2026-06-07
> Categories: Model Architecture, Transformer Models
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# Multi-Query Attention (MQA)

**Multi-Query Attention (MQA)** is a variant of the multi-head attention mechanism used in [transformer](/wiki/transformer) neural networks in which all query heads share a single key head and a single value head, rather than each query head having its own dedicated key and value projections. The technique was introduced by [noam shazeer](/wiki/noam_shazeer) in 2019 to address the memory-bandwidth bottleneck of autoregressive Transformer decoding, and it produces a dramatically smaller key/value cache during inference while typically incurring only a small loss in model quality.[^1]

MQA was widely adopted in large language models released between 2022 and 2023, including PaLM, Falcon, and StarCoder, where its ability to shrink the KV cache enabled larger batch sizes and faster generation. Since 2023, however, MQA has been largely superseded in new models by Grouped-Query Attention (GQA), a generalisation introduced by Ainslie et al. that recovers most of MQA's speed gains while staying closer in quality to standard multi-head [attention](/wiki/attention).[^2]

## Background: multi-head attention and the KV cache bottleneck

In the original Transformer of Vaswani et al. (2017), a multi-head [attention](/wiki/attention) layer with `h` heads computes, for each head, a separate linear projection of the input into query, key and value vectors, then performs scaled dot-product [attention](/wiki/attention) independently per head. Concretely, each head holds its own learned projection tensors `P_q`, `P_k`, `P_v` of shape `[h, d, k]` (or `[h, d, v]` for values), so that a Transformer layer maintains `h` distinct sets of keys and values.[^1]

This design is fast to train, because all positions in a sequence can be processed in parallel. But Shazeer (2019) observed that the same design becomes a serious bottleneck during **incremental inference**, in which an autoregressive decoder generates one token at a time. At each decoding step, the model must reload from memory the cached keys and values of every previously generated position, for every layer and every head. This collection of cached tensors is generally called the [kv cache](/wiki/kv_cache). Shazeer's performance analysis showed that, for a baseline Transformer decoder, the ratio of memory access to arithmetic operations during incremental decoding scales as `Theta(n/d + 1/b)`, where `n` is the current sequence length, `d` is the model dimension and `b` is the batch size. When `n` approaches `d` or the batch size is small, this ratio approaches one, and the modern GPU or TPU stalls waiting for memory rather than performing useful computation.[^1]

The two terms in this ratio suggest two different remedies. The `1/b` term can be reduced by increasing the batch size, memory permitting. The `n/d` term is harder, because the cached `K` and `V` tensors have aggregate size proportional to `b * h * n * k` (where `k` is the per-head key dimension) and must be loaded in full at each decoding step. Earlier work attacked this by limiting the sequence length, attending only to a local window, or otherwise reducing the number of positions attended to.[^1] MQA takes a complementary path: rather than shrinking the sequence axis, it shrinks the heads axis of `K` and `V`.

## The MQA mechanism: one set of keys and values for all query heads

MQA modifies multi-head attention by removing the heads dimension `h` from the key and value projection tensors while keeping it on the queries and on the output projection. In the original paper Shazeer writes that "Multi-query attention is identical except that the different heads share a single set of keys and values."[^1]

In terms of tensor shapes:

- Standard multi-head attention uses `P_q`, `P_k`, `P_v`, `P_o` with shape `[h, d, k]` (or `[h, d, v]`).
- MQA uses `P_q` of shape `[h, d, k]` and `P_o` of shape `[h, d, v]`, but reduces `P_k` to shape `[d, k]` and `P_v` to shape `[d, v]`, so there is exactly one key projection and one value projection shared across all `h` query heads.[^1]

Equivalently, all `h` query heads attend to the **same** key matrix `K` of shape `[m, k]` (or `[b, m, k]` for a batch of sequences) and the **same** value matrix `V` of shape `[m, v]`, rather than `h` distinct copies. The TensorFlow `einsum` code shown in the paper differs from the multi-head version only in that the letter `h` is dropped from the equations wherever it indexed the heads dimension of `K`, `V`, `P_k` or `P_v`.[^1]

A useful way to view MQA is as an extreme point in a more general design space. If we let `g` denote the number of key/value heads, then:

- `g = h` recovers standard multi-head attention.
- `g = 1` is multi-query attention.
- `1 < g < h` is the intermediate Grouped-Query Attention case introduced later by Ainslie et al.[^2]

This continuum is exactly the framing used by the GQA paper, which describes MQA as "Multi-query attention (MQA), which only uses a single key-value head" and presents GQA as "a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads."[^2]

## Memory and compute savings

The principal motivation for MQA is the size of the key/value cache used during autoregressive decoding. With standard multi-head attention, the per-layer KV cache for a batch has size proportional to `b * h * n * k` (for `K`) plus `b * h * n * v` (for `V`). MQA replaces the per-head copies with a single shared copy, reducing the cache to `b * n * k + b * n * v`, i.e. a factor of `h` smaller along the heads dimension.

Shazeer's complexity analysis quantifies the impact during incremental generation. For MQA, the ratio of memory access to arithmetic operations becomes `Theta(1/d + n/(d*h) + 1/b)`, where the `n/d` term of the multi-head case has been reduced by a factor of `h`. The paper notes: "We have reduced the offensive n/d by a factor of h. Theoretically, given large batch size b, this should dramatically improve performance of incremental generation."[^1]

The empirical speed numbers from the original 2019 experiments are striking. On WMT 2014 English-German translation with a 6-layer, 211-million-parameter encoder/decoder Transformer (`d_model = 1024`, `d_ff = 4096`, `h = 8`, `d_k = d_v = 128`), Shazeer reports per-token decoding times on a TPUv2:

- Baseline multi-head: encoder 1.7 microseconds/token, decoder 46 microseconds/token.
- Multi-query: encoder 1.5 microseconds/token, decoder 3.8 microseconds/token.

The decoder step is therefore roughly **twelve times faster** in the MQA variant under those conditions, while training time per token is essentially identical (13.0 vs. 13.2 microseconds).[^1] For beam-4 search the encoder/decoder times go from 2.0 plus 203 microseconds (multi-head) to 1.6 plus 32 microseconds (multi-query), again a roughly six- to seven-fold decoder speed-up.[^1]

Later analyses outside the original paper have framed the savings in terms of batch capacity rather than wall-clock time. Practitioners have noted that the KV cache shrinks by a factor equal to the number of attention heads, freeing memory for substantially larger inference batch sizes; one industry write-up reports batch-size increases on the order of 16x for Falcon-40B and around 70x for Falcon-7B when MQA replaces dense multi-head attention.[^3]

## Quality trade-offs

The headline claim of Shazeer (2019) is that MQA achieves these speed-ups while incurring "only minor quality degradation from the baseline."[^1] The detailed numbers support that framing but also show a real, if small, gap.

On WMT 2014 English-German translation, the paper reports:

- Multi-head (`h = 8`, `d_k = d_v = 128`, `d_ff = 4096`): dev log-perplexity 1.424, dev BLEU 26.7, test BLEU 27.7 (greedy) / 28.4 (beam 4).
- Multi-query (`h = 8`, `d_k = d_v = 128`, `d_ff = 5440`, widened so total parameters match the baseline): dev log-perplexity 1.439, dev BLEU 26.5, test BLEU 27.5 (greedy) / **28.5** (beam 4).[^1]

On the Billion-Word Language Modeling Benchmark with 192-million-parameter decoder-only Transformers:

- Multi-head: dev perplexity 29.9.
- Multi-query: dev perplexity 30.2.[^1]

In both settings the multi-query model is "slightly worse" on per-token perplexity but, as Shazeer points out, "significantly better than any of the alternatives involving decreasing `h`, `d_k` and `d_v`" by the same amount. In other words, sharing keys and values across heads is a much more graceful way to shrink the KV cache than simply using fewer or narrower heads.[^1]

Later work surveyed this trade-off less optimistically once MQA was applied at large scale. The GQA paper of Ainslie et al. (2023) states bluntly in its introduction that "multi-query attention (MQA) can lead to quality degradation and training instability, and it may not be feasible to train separate models optimized for quality and inference."[^2] Appendix A of that paper notes specific instabilities, writing that "multi-query attention can lead to training instability during fine-tuning, in particular combined with long input tasks" and that pre-training with MQA "suffered from frequent loss spikes", whereas uptrained GQA models "appear to be stable" by contrast.[^2]

The GQA paper also reports quantitative quality differences on uptrained T5-XXL across summarisation and question-answering benchmarks. On representative tasks the ROUGE-1 scores are: CNN/DailyMail 47.2 for the multi-head XXL baseline, 46.6 for the MQA variant and 47.1 for GQA with eight groups; arXiv summarisation 43.8 for multi-head, 43.0 for MQA and 43.5 for GQA-8. Average inference time per sample on the same setup is 1.51 seconds for multi-head, 0.24 seconds for MQA and 0.28 seconds for GQA-8.[^2] These numbers make MQA's central trade-off explicit: roughly a factor of six speed-up against multi-head attention, at the cost of a small but consistent quality regression that GQA is able to mostly close.

## Models that adopted MQA

Although MQA was published in 2019, large-scale adoption in publicly described language models came in 2022 and 2023.

### PaLM (Google, 2022)

Google's 540-billion-parameter PaLM, described by Chowdhery et al. (2022), is one of the earliest large LLMs to use MQA. The GQA paper explicitly cites PaLM as a prior user, writing that "while some language models already use multi-query attention, such as PaLM ... many do not, including publicly available language models such as T5 ... and LLaMA."[^2] Secondary technical write-ups summarise PaLM's use of MQA as projecting keys and values to shape `[1, h]` while keeping queries at `[k, h]`, with the rationale that this has a neutral effect on model quality and training speed but yields "a significant cost savings at autoregressive decoding time" because standard multi-head attention has poor efficiency on accelerator hardware during autoregressive decoding.[^4]

### Falcon (TII, 2023)

The Falcon family of open-weight LLMs (7B, 40B and later 180B) released by the UAE's Technology Innovation Institute in 2023 was explicitly designed for efficient inference and used MQA as one of its architectural choices. The official Falcon-40B model card on Hugging Face lists the attention mechanism as "multiquery (Shazeer et al., 2019) and FlashAttention (Dao et al., 2022)", and adds that "for multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree."[^5] Falcon's design pairs MQA with [flash attention](/wiki/flash_attention) to push inference latency and memory usage down at large model scale.

### StarCoder (BigCode, 2023)

The StarCoder code LLM from the BigCode project, described by Li et al. (2023), is another high-profile MQA user. The paper abstract describes "15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention."[^6] The Hugging Face model card states that StarCoder is "a GPT-2 model with multi-query attention and Fill-in-the-Middle objective" with a context window of 8192 tokens, trained on roughly one trillion tokens in bfloat16.[^7] Here MQA's role is, in part, to make the model's relatively long 8K context cheap to serve at high batch sizes for code-completion workloads.

Across these three deployments the common theme is that MQA was selected specifically for **inference-time** efficiency in regimes where the KV cache, not raw FLOPs, dominates serving cost.

## Successor: Grouped-Query Attention (GQA)

Multi-query attention sits at one end of a spectrum that runs from one shared key/value head (MQA) to one per query head (standard multi-head attention). Ainslie et al. (2023) introduced Grouped-Query Attention to occupy the middle of this spectrum. In GQA, query heads are partitioned into `G` groups, with each group sharing a single key and value head; so `G = 1` recovers MQA and `G = h` recovers multi-head attention.[^2]

The GQA paper makes two main contributions. First, it proposes an "uptraining" recipe: an existing multi-head checkpoint can be converted to MQA or GQA by mean-pooling existing key and value heads within each group and then continuing pre-training for a small fraction (around 5%) of the original compute. Second, it shows that GQA with a moderate number of groups (eight, in their main experiments on T5-XXL) "achieves quality close to multi-head attention with comparable speed to MQA."[^2] As noted above, the ROUGE and inference-time numbers bear this out across CNN/DailyMail, arXiv, PubMed, MediaSum, MultiNews, WMT and question-answering tasks.[^2]

The practical implication is that GQA dominates MQA in most settings where slightly lower quality is undesirable, since it preserves nearly all of MQA's KV-cache reduction at a much smaller quality cost. As one description puts it, "GQA achieves significant additional quality gains, achieving performance close to MHA-XXL with speed close to MQA."[^2]

A pleasing side benefit is that GQA aligns naturally with tensor-parallel inference: setting `G` equal to the tensor-parallel degree puts exactly one KV head per GPU and avoids cross-device duplication of the KV cache.

## Current status

By the mid-2020s, MQA in its strict `g = 1` form has been largely supplanted by GQA in new frontier-class LLMs, though it remains common in code, infilling and other workloads where aggressive inference throughput justifies the quality cost. Meta's Llama 2, for instance, adopts grouped-query attention rather than pure MQA for its larger 34B and 70B variants, typically with an 8-group configuration in which `num_key_value_heads` equals the tensor-parallel degree; setting `num_key_value_heads` to 1 in that same implementation recovers MQA and setting it to `num_attention_heads` recovers multi-head attention.[^8]

Several practical lessons from the MQA experience continue to influence current attention research:

- **KV-cache size, not FLOPs, is often the binding inference constraint.** The Shazeer (2019) analysis of the `n/d + 1/b` ratio remains a useful mental model when reasoning about whether a decoder is compute-bound or memory-bound, and many subsequent attention variants (GQA, multi-head latent attention, paged attention, and others) target the same bottleneck from different angles.[^1]
- **Aggressive sharing is risky to combine with long context and fine-tuning.** The training instabilities reported by the GQA authors, including loss spikes during pre-training and instability during long-input fine-tuning, are part of why few large-scale 2024-era models continue to use pure MQA.[^2]
- **Uptraining is cheap.** The ability to convert an existing multi-head checkpoint into an MQA or GQA model using about 5% of the original compute means that the choice between these attention variants need not be locked in at the start of training, an insight first quantified in the GQA paper but rooted in MQA's parameter compatibility with multi-head attention.[^2]

MQA itself, then, is best understood as the first practical realisation of the broader idea that the key/value side of attention can be made far skinnier than the query side without breaking the model. That idea, more than the specific `g = 1` choice, is its lasting contribution.

## See also

- [transformer](/wiki/transformer)
- [attention](/wiki/attention)
- [kv cache](/wiki/kv_cache)
- [grouped query attention](/wiki/grouped_query_attention)
- [flash attention](/wiki/flash_attention)
- [palm](/wiki/palm)
- [falcon](/wiki/falcon)
- [starcoder](/wiki/starcoder)
- [noam shazeer](/wiki/noam_shazeer)

## References

[^1]: Shazeer, Noam. "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150, 6 November 2019. https://arxiv.org/abs/1911.02150 (Accessed 2026-05-19). Full text including Tables 1, 2 and 3 with WMT 2014 EN-DE and Billion-Word benchmark results: https://arxiv.org/pdf/1911.02150v1 (Accessed 2026-05-19).

[^2]: Ainslie, Joshua; Lee-Thorp, James; de Jong, Michiel; Zemlyanskiy, Yury; Lebrón, Federico; and Sanghai, Sumit. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv:2305.13245, May 2023. https://arxiv.org/abs/2305.13245 (Accessed 2026-05-19). HTML version with full tables and appendix discussion of training instability: https://arxiv.org/html/2305.13245v3 (Accessed 2026-05-19).

[^3]: Fireworks AI. "Multi-Query Attention is All You Need." Fireworks AI blog, 2023. https://fireworks.ai/blog/multi-query-attention-is-all-you-need (Accessed 2026-05-19).

[^4]: Wolfe, Cameron R. "PaLM: Efficiently Training Massive Language Models." Deep (Learning) Focus, Substack. https://cameronrwolfe.substack.com/p/palm-efficiently-training-massive (Accessed 2026-05-19).

[^5]: Technology Innovation Institute. "tiiuae/falcon-40b" model card. Hugging Face. https://huggingface.co/tiiuae/falcon-40b (Accessed 2026-05-19).

[^6]: Li, Raymond; et al. "StarCoder: may the source be with you!" arXiv:2305.06161, May 2023. https://arxiv.org/abs/2305.06161 (Accessed 2026-05-19).

[^7]: BigCode. "bigcode/starcoder" model card. Hugging Face. https://huggingface.co/bigcode/starcoder (Accessed 2026-05-19).

[^8]: Hugging Face Transformers documentation. "Llama 2." https://huggingface.co/docs/transformers/en/model_doc/llama2 (Accessed 2026-05-19).

