Multi-Query Attention (MQA)

Model Architecture Transformer Models

17 min read

Updated Jul 7, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 7, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v3 · 3,478 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Multi-Query Attention (MQA) is a variant of the multi-head attention mechanism used in transformer neural networks in which all query heads share a single key head and a single value head, rather than each query head having its own dedicated key and value projections. Introduced by Noam Shazeer in 2019 to address the memory-bandwidth bottleneck of autoregressive Transformer decoding, MQA shrinks the KV cache that must be read at each decoding step by a factor equal to the number of attention heads, which in Shazeer's original experiments cut the decoder step from 46 to 3.8 microseconds per token (roughly twelve times faster) while incurring what the paper calls "only minor quality degradation from the baseline."^[1]

MQA was widely adopted in large language models released between 2022 and 2023, including PaLM, Falcon, and StarCoder, where its ability to shrink the KV cache enabled larger batch sizes and faster generation. Since 2023, however, MQA has been largely superseded in new models by Grouped-Query Attention (GQA), a generalisation introduced by Ainslie et al. that recovers most of MQA's speed gains while staying closer in quality to standard multi-head attention, and by Multi-head Latent Attention (MLA), which DeepSeek introduced in 2024 to compress the keys and values into a shared low-rank latent vector.^[2]^[9]

How do MHA, MQA, GQA, and MLA compare?

Multi-query attention is one point on a spectrum of attention designs that trade key/value sharing against model quality. The table below summarises the four main variants used in modern large language models, from the original dense multi-head attention (MHA) through the intermediate GQA to the latent compression of MLA. In every case except MHA the goal is the same: shrink the per-token KV cache that dominates the cost of autoregressive decoding, while giving up as little quality as possible.^[1]^[2]^[9]

Attention variant	Key/value heads for `h` query heads	KV cache size	Quality vs MHA	Example models
Multi-head attention (MHA)	`h` (one per query head)	baseline, largest	baseline	Original Transformer (2017), T5, LLaMA 1, Gemma 7B
Multi-query attention (MQA)	1 shared head	`1/h` of MHA	slightly lower; can be unstable at scale	PaLM, Falcon, StarCoder, Gemma 2B
Grouped-query attention (GQA)	`g` groups, `1 < g < h`	`g/h` of MHA (e.g. 1/8 with 8 groups)	close to MHA	Llama 2 70B, Llama 3, Mistral 7B
Multi-head latent attention (MLA)	compressed shared low-rank latent	far smaller than MHA; a 93.3% cut vs DeepSeek's GQA-based 67B model	reported better than MHA	DeepSeek-V2, DeepSeek-V3

What problem does MQA solve? The KV cache bottleneck

In the original Transformer of Vaswani et al. (2017), a multi-head attention layer with h heads computes, for each head, a separate linear projection of the input into query, key and value vectors, then performs scaled dot-product attention independently per head. Concretely, each head holds its own learned projection tensors P_q, P_k, P_v of shape [h, d, k] (or [h, d, v] for values), so that a Transformer layer maintains h distinct sets of keys and values.^[1]

This design is fast to train, because all positions in a sequence can be processed in parallel. But Shazeer (2019) observed that the same design becomes a serious bottleneck during incremental inference, in which an autoregressive decoder generates one token at a time. At each decoding step, the model must reload from memory the cached keys and values of every previously generated position, for every layer and every head. This collection of cached tensors is generally called the kv cache. Shazeer's performance analysis showed that, for a baseline Transformer decoder, the ratio of memory access to arithmetic operations during incremental decoding scales as Theta(n/d + 1/b), where n is the current sequence length, d is the model dimension and b is the batch size. When n approaches d or the batch size is small, this ratio approaches one, and the modern GPU or TPU stalls waiting for memory rather than performing useful computation.^[1]

The two terms in this ratio suggest two different remedies. The 1/b term can be reduced by increasing the batch size, memory permitting. The n/d term is harder, because the cached K and V tensors have aggregate size proportional to b * h * n * k (where k is the per-head key dimension) and must be loaded in full at each decoding step. Earlier work attacked this by limiting the sequence length, attending only to a local window, or otherwise reducing the number of positions attended to.^[1] MQA takes a complementary path: rather than shrinking the sequence axis, it shrinks the heads axis of K and V.

How does Multi-Query Attention work?

MQA modifies multi-head attention by removing the heads dimension h from the key and value projection tensors while keeping it on the queries and on the output projection. In the original paper Shazeer writes that "Multi-query attention is identical except that the different heads share a single set of keys and values."^[1]

In terms of tensor shapes:

Standard multi-head attention uses P_q, P_k, P_v, P_o with shape [h, d, k] (or [h, d, v]).
MQA uses P_q of shape [h, d, k] and P_o of shape [h, d, v], but reduces P_k to shape [d, k] and P_v to shape [d, v], so there is exactly one key projection and one value projection shared across all h query heads.^[1]

Equivalently, all h query heads attend to the same key matrix K of shape [m, k] (or [b, m, k] for a batch of sequences) and the same value matrix V of shape [m, v], rather than h distinct copies. The TensorFlow einsum code shown in the paper differs from the multi-head version only in that the letter h is dropped from the equations wherever it indexed the heads dimension of K, V, P_k or P_v.^[1]

A useful way to view MQA is as an extreme point in a more general design space. If we let g denote the number of key/value heads, then:

g = h recovers standard multi-head attention.
g = 1 is multi-query attention.
1 < g < h is the intermediate Grouped-Query Attention case introduced later by Ainslie et al.^[2]

This continuum is exactly the framing used by the GQA paper, which describes MQA as "Multi-query attention (MQA), which only uses a single key-value head" and presents GQA as "a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads."^[2]

How much memory and speed does MQA save?

The principal motivation for MQA is the size of the key/value cache used during autoregressive decoding. With standard multi-head attention, the per-layer KV cache for a batch has size proportional to b * h * n * k (for K) plus b * h * n * v (for V). MQA replaces the per-head copies with a single shared copy, reducing the cache to b * n * k + b * n * v, i.e. a factor of h smaller along the heads dimension.

Shazeer's complexity analysis quantifies the impact during incremental generation. For MQA, the ratio of memory access to arithmetic operations becomes Theta(1/d + n/(d*h) + 1/b), where the n/d term of the multi-head case has been reduced by a factor of h. The paper notes: "We have reduced the offensive n/d by a factor of h. Theoretically, given large batch size b, this should dramatically improve performance of incremental generation."^[1]

The empirical speed numbers from the original 2019 experiments are striking. On WMT 2014 English-German translation with a 6-layer, 211-million-parameter encoder/decoder Transformer (d_model = 1024, d_ff = 4096, h = 8, d_k = d_v = 128), Shazeer reports per-token decoding times on a TPUv2:

Baseline multi-head: encoder 1.7 microseconds/token, decoder 46 microseconds/token.
Multi-query: encoder 1.5 microseconds/token, decoder 3.8 microseconds/token.

The decoder step is therefore roughly twelve times faster in the MQA variant under those conditions, while training time per token is essentially identical (13.0 vs. 13.2 microseconds).^[1] For beam-4 search the encoder/decoder times go from 2.0 plus 203 microseconds (multi-head) to 1.6 plus 32 microseconds (multi-query), again a roughly six- to seven-fold decoder speed-up.^[1]

Later analyses outside the original paper have framed the savings in terms of batch capacity rather than wall-clock time. Practitioners have noted that the KV cache shrinks by a factor equal to the number of attention heads, freeing memory for substantially larger inference batch sizes; one industry write-up reports batch-size increases on the order of 16x for Falcon-40B and around 70x for Falcon-7B when MQA replaces dense multi-head attention.^[3]

What is the quality cost of MQA?

The headline claim of Shazeer (2019) is that MQA achieves these speed-ups while incurring "only minor quality degradation from the baseline."^[1] The detailed numbers support that framing but also show a real, if small, gap.

On WMT 2014 English-German translation, the paper reports:

Multi-head (h = 8, d_k = d_v = 128, d_ff = 4096): dev log-perplexity 1.424, dev BLEU 26.7, test BLEU 27.7 (greedy) / 28.4 (beam 4).
Multi-query (h = 8, d_k = d_v = 128, d_ff = 5440, widened so total parameters match the baseline): dev log-perplexity 1.439, dev BLEU 26.5, test BLEU 27.5 (greedy) / 28.5 (beam 4).^[1]

On the Billion-Word Language Modeling Benchmark with 192-million-parameter decoder-only Transformers:

Multi-head: dev perplexity 29.9.
Multi-query: dev perplexity 30.2.^[1]

In both settings the multi-query model is "slightly worse" on per-token perplexity but, as Shazeer points out, "significantly better than any of the alternatives involving decreasing h, d_k and d_v" by the same amount. In other words, sharing keys and values across heads is a much more graceful way to shrink the KV cache than simply using fewer or narrower heads.^[1]

Later work surveyed this trade-off less optimistically once MQA was applied at large scale. The GQA paper of Ainslie et al. (2023) states bluntly in its introduction that "multi-query attention (MQA) can lead to quality degradation and training instability, and it may not be feasible to train separate models optimized for quality and inference."^[2] Appendix A of that paper notes specific instabilities, writing that "multi-query attention can lead to training instability during fine-tuning, in particular combined with long input tasks" and that pre-training with MQA "suffered from frequent loss spikes", whereas uptrained GQA models "appear to be stable" by contrast.^[2]

The GQA paper also reports quantitative quality differences on uptrained T5-XXL across summarisation and question-answering benchmarks. On representative tasks the ROUGE-1 scores are: CNN/DailyMail 47.2 for the multi-head XXL baseline, 46.6 for the MQA variant and 47.1 for GQA with eight groups; arXiv summarisation 43.8 for multi-head, 43.0 for MQA and 43.5 for GQA-8. Average inference time per sample on the same setup is 1.51 seconds for multi-head, 0.24 seconds for MQA and 0.28 seconds for GQA-8.^[2] These numbers make MQA's central trade-off explicit: roughly a factor of six speed-up against multi-head attention, at the cost of a small but consistent quality regression that GQA is able to mostly close.

Which models use Multi-Query Attention?

Although MQA was published in 2019, large-scale adoption in publicly described language models came in 2022 and 2023.

PaLM (Google, 2022)

Google's 540-billion-parameter PaLM, described by Chowdhery et al. (2022), is one of the earliest large LLMs to use MQA. The GQA paper explicitly cites PaLM as a prior user, writing that "while some language models already use multi-query attention, such as PaLM ... many do not, including publicly available language models such as T5 ... and LLaMA."^[2] Secondary technical write-ups summarise PaLM's use of MQA as projecting keys and values to shape [1, h] while keeping queries at [k, h], with the rationale that this has a neutral effect on model quality and training speed but yields "a significant cost savings at autoregressive decoding time" because standard multi-head attention has poor efficiency on accelerator hardware during autoregressive decoding.^[4]

Falcon (TII, 2023)

The Falcon family of open-weight LLMs (7B, 40B and later 180B) released by the UAE's Technology Innovation Institute in 2023 was explicitly designed for efficient inference and used MQA as one of its architectural choices. The official Falcon-40B model card on Hugging Face lists the attention mechanism as "multiquery (Shazeer et al., 2019) and FlashAttention (Dao et al., 2022)", and adds that "for multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree."^[5] Falcon's design pairs MQA with flash attention to push inference latency and memory usage down at large model scale.

StarCoder (BigCode, 2023)

The StarCoder code LLM from the BigCode project, described by Li et al. (2023), is another high-profile MQA user. The paper abstract describes "15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention."^[6] The Hugging Face model card states that StarCoder is "a GPT-2 model with multi-query attention and Fill-in-the-Middle objective" with a context window of 8192 tokens, trained on roughly one trillion tokens in bfloat16.^[7] Here MQA's role is, in part, to make the model's relatively long 8K context cheap to serve at high batch sizes for code-completion workloads.

Gemma 2B (Google, 2024)

Google's first Gemma release, in 2024, split its architecture by scale, and is a notable late example of pure MQA being preferred at small model size even after GQA had become the default for larger models. The Gemma technical report states that "the 7B model uses multi-head attention while the 2B checkpoints use multi-query attention", a split made "based on ablations" showing that the respective attention variants improved performance at each scale. In configuration terms the 2B model uses 8 query heads with a single key/value head (num_kv_heads = 1), the textbook MQA setup, while the 7B model keeps 16 key/value heads for full multi-head attention; both share a head dimension of 256.^[10]

Across these deployments the common theme is that MQA was selected specifically for inference-time efficiency in regimes where the KV cache, not raw FLOPs, dominates serving cost.

How does MQA differ from Grouped-Query Attention (GQA)?

Multi-query attention sits at one end of a spectrum that runs from one shared key/value head (MQA) to one per query head (standard multi-head attention). Ainslie et al. (2023) introduced Grouped-Query Attention to occupy the middle of this spectrum. In GQA, query heads are partitioned into G groups, with each group sharing a single key and value head; so G = 1 recovers MQA and G = h recovers multi-head attention.^[2]

The GQA paper makes two main contributions. First, it proposes an "uptraining" recipe: an existing multi-head checkpoint can be converted to MQA or GQA by mean-pooling existing key and value heads within each group and then continuing pre-training for a small fraction (around 5%) of the original compute. Second, it shows that GQA with a moderate number of groups (eight, in their main experiments on T5-XXL) "achieves quality close to multi-head attention with comparable speed to MQA."^[2] As noted above, the ROUGE and inference-time numbers bear this out across CNN/DailyMail, arXiv, PubMed, MediaSum, MultiNews, WMT and question-answering tasks.^[2]

The practical implication is that GQA dominates MQA in most settings where slightly lower quality is undesirable, since it preserves nearly all of MQA's KV-cache reduction at a much smaller quality cost. As one description puts it, "GQA achieves significant additional quality gains, achieving performance close to MHA-XXL with speed close to MQA."^[2]

A pleasing side benefit is that GQA aligns naturally with tensor-parallel inference: setting G equal to the tensor-parallel degree puts exactly one KV head per GPU and avoids cross-device duplication of the KV cache.

How does MQA relate to Multi-head Latent Attention (MLA)?

Multi-head Latent Attention (MLA) is a later evolution of the same KV-cache-reduction idea that motivated MQA, introduced by DeepSeek in the DeepSeek-V2 technical report in May 2024. Where MQA shrinks the KV cache by sharing one key head and one value head across all query heads, MLA instead applies a low-rank joint compression that projects the keys and values into a single shared latent vector, which is what gets cached; the full-size keys and values are reconstructed on the fly during attention. The DeepSeek-V2 paper states that "MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector", and that, "equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of KV cache."^[9]

The reported gains are large. Compared with the earlier DeepSeek 67B model, DeepSeek-V2 (236 billion total parameters, 21 billion activated per token) "reduces the KV cache by 93.3%" and boosts maximum generation throughput to 5.76 times, while also cutting training cost by 42.5%.^[9] Unlike MQA and GQA, which trade away some quality relative to full multi-head attention, DeepSeek reports MLA slightly improving on it, which is why MLA, rather than MQA, was carried forward into DeepSeek's later frontier models. MLA is best understood as the same lesson MQA first taught, that the key/value side of attention can be made far skinnier than the query side, pushed from head-sharing all the way to explicit latent compression.

Is Multi-Query Attention still used?

By the mid-2020s, MQA in its strict g = 1 form has been largely supplanted by GQA in new frontier-class LLMs, though it remains common in code, infilling and other workloads where aggressive inference throughput justifies the quality cost, and it still appears at small model scale (for example Gemma 2B in 2024).^[10] Meta's Llama 2, for instance, adopts grouped-query attention rather than pure MQA for its larger 34B and 70B variants, typically with an 8-group configuration in which num_key_value_heads equals the tensor-parallel degree; setting num_key_value_heads to 1 in that same implementation recovers MQA and setting it to num_attention_heads recovers multi-head attention.^[8]

Several practical lessons from the MQA experience continue to influence current attention research:

KV-cache size, not FLOPs, is often the binding inference constraint. The Shazeer (2019) analysis of the n/d + 1/b ratio remains a useful mental model when reasoning about whether a decoder is compute-bound or memory-bound, and many subsequent attention variants (GQA, multi-head latent attention, paged attention, and others) target the same bottleneck from different angles.^[1]
Aggressive sharing is risky to combine with long context and fine-tuning. The training instabilities reported by the GQA authors, including loss spikes during pre-training and instability during long-input fine-tuning, are part of why few large-scale 2024-era models continue to use pure MQA.^[2]
Uptraining is cheap. The ability to convert an existing multi-head checkpoint into an MQA or GQA model using about 5% of the original compute means that the choice between these attention variants need not be locked in at the start of training, an insight first quantified in the GQA paper but rooted in MQA's parameter compatibility with multi-head attention.^[2]

MQA itself, then, is best understood as the first practical realisation of the broader idea that the key/value side of attention can be made far skinnier than the query side without breaking the model. That idea, more than the specific g = 1 choice, is its lasting contribution, carried forward by GQA and MLA alike.

References

Shazeer, Noam. "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150, 6 November 2019. https://arxiv.org/abs/1911.02150 (Accessed 2026-05-19). Full text including Tables 1, 2 and 3 with WMT 2014 EN-DE and Billion-Word benchmark results: https://arxiv.org/pdf/1911.02150v1 (Accessed 2026-05-19). ↩
Ainslie, Joshua; Lee-Thorp, James; de Jong, Michiel; Zemlyanskiy, Yury; Lebrón, Federico; and Sanghai, Sumit. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv:2305.13245, May 2023. https://arxiv.org/abs/2305.13245 (Accessed 2026-05-19). HTML version with full tables and appendix discussion of training instability: https://arxiv.org/html/2305.13245v3 (Accessed 2026-05-19). ↩
Fireworks AI. "Multi-Query Attention is All You Need." Fireworks AI blog, 2023. https://fireworks.ai/blog/multi-query-attention-is-all-you-need (Accessed 2026-05-19). ↩
Wolfe, Cameron R. "PaLM: Efficiently Training Massive Language Models." Deep (Learning) Focus, Substack. https://cameronrwolfe.substack.com/p/palm-efficiently-training-massive (Accessed 2026-05-19). ↩
Technology Innovation Institute. "tiiuae/falcon-40b" model card. Hugging Face. https://huggingface.co/tiiuae/falcon-40b (Accessed 2026-05-19). ↩
Li, Raymond; et al. "StarCoder: may the source be with you!" arXiv:2305.06161, May 2023. https://arxiv.org/abs/2305.06161 (Accessed 2026-05-19). ↩
BigCode. "bigcode/starcoder" model card. Hugging Face. https://huggingface.co/bigcode/starcoder (Accessed 2026-05-19). ↩
Hugging Face Transformers documentation. "Llama 2." https://huggingface.co/docs/transformers/en/model_doc/llama2 (Accessed 2026-05-19). ↩
DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, May 2024. https://arxiv.org/abs/2405.04434 (Accessed 2026-07-08). HTML version with the Multi-head Latent Attention section and KV-cache comparison table: https://arxiv.org/html/2405.04434v5 (Accessed 2026-07-08). ↩
Gemma Team, Google DeepMind. "Gemma: Open Models Based on Gemini Research and Technology." arXiv:2403.08295, 2024. https://arxiv.org/abs/2403.08295 (Accessed 2026-07-08). HTML version with the architecture table listing attention type, heads and KV heads per model size: https://arxiv.org/html/2403.08295v1 (Accessed 2026-07-08). ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributor · full history

Suggest edit

What links here

H2O (Heavy-Hitter Oracle for KV Cache)PaLM YOCO (You Only Cache Once)

How do MHA, MQA, GQA, and MLA compare?

What problem does MQA solve? The KV cache bottleneck

How does Multi-Query Attention work?

How much memory and speed does MQA save?

What is the quality cost of MQA?

Which models use Multi-Query Attention?

PaLM (Google, 2022)

Falcon (TII, 2023)

StarCoder (BigCode, 2023)

Gemma 2B (Google, 2024)

How does MQA differ from Grouped-Query Attention (GQA)?

How does MQA relate to Multi-head Latent Attention (MLA)?

Is Multi-Query Attention still used?

See also

References

Improve this article

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Rotary Position Embedding

Self-attention

Cross-attention

Mixture of Depths

What links here

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Rotary Position Embedding

Self-attention

Cross-attention

Mixture of Depths

What links here