Grouped-Query Attention (GQA)
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,236 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 19, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,236 words
Add missing citations, update stale details, or suggest a clearer explanation.
Grouped-Query Attention (GQA) is a variant of the attention mechanism used in Transformer language models that reduces the size and memory bandwidth of the key/value cache during autoregressive decoding by sharing key and value projections across groups of query heads. GQA was introduced by Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai in a 2023 paper titled "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints," which appeared at EMNLP 2023.[^1][^2] The technique is positioned as a tunable middle ground between the original multi-head attention (MHA) of "Attention Is All You Need" and Noam Shazeer's multi-query attention (MQA) from 2019, and it has since become the default attention configuration in many widely deployed large language models including Llama 2 70B, all sizes of Llama 3, Mistral 7B, and several models in the Qwen, Gemma, and Falcon families.[^1][^3][^4][^5][^6]
In the standard Transformer decoder introduced by Vaswani et al. (2017), each attention layer computes queries, keys, and values from the input hidden states using independent linear projections, then splits the result into $h$ parallel "heads," each of which performs scaled dot-product attention.[^7] During autoregressive generation, the keys and values for previously generated tokens are recomputed at every step unless they are cached. The standard optimization, known as the KV cache, stores these tensors so that each new token only needs to compute attention against the cached keys and values rather than reprocessing the full sequence.[^8]
The memory required by the KV cache grows linearly with the number of layers, the number of attention heads, the head dimension, the batch size, and the sequence length. As models scaled to tens of billions of parameters and context windows expanded to thousands or hundreds of thousands of tokens, the KV cache became a dominant cost in inference. The bytes that must be read from high-bandwidth memory at each decoding step are proportional to the KV cache size, and modern accelerators reach the limit of their memory bandwidth long before they exhaust their compute capacity during decoding, making the load of the cached keys and values the principal latency bottleneck.[^9]
Shazeer introduced multi-query attention in a 2019 paper titled "Fast Transformer Decoding: One Write-Head is All You Need" (arxiv 1911.02150).[^9] In MQA, all $h$ query heads share a single key head and a single value head, rather than each query head having its own. Because the cache for a layer scales with the number of key/value heads, MQA reduces the per-layer KV cache by a factor of $h$. Shazeer's experiments showed that this dramatically accelerates decoding on hardware whose latency is dominated by memory bandwidth, while incurring only a small drop in model quality.[^9]
The catch is that the quality drop is not uniformly small. Subsequent work observed that MQA could cause training instability and degrade quality on long-generation tasks such as summarization, in addition to creating practical issues when combined with tensor parallelism: because there is only one key and one value head, the same K and V projections must be duplicated on every tensor-parallel shard rather than being split across them.[^1][^3]
In MHA, the number of key/value heads equals the number of query heads. This maximizes representational capacity per layer but also maximizes KV cache size. For a model with $L$ layers, $h$ heads, head dimension $d_h$, sequence length $s$, and batch size $b$, the KV cache occupies on the order of $2 \cdot L \cdot b \cdot h \cdot s \cdot d_h$ elements, plus the per-element byte cost (typically 2 bytes for bfloat16 or float16).[^3] For a 70-billion-parameter model with 64 heads and a 4K context, this can reach the order of gigabytes per inference request, which limits the batch size achievable on a given accelerator and makes long-context decoding bandwidth-bound.[^3]
GQA generalizes MQA by introducing a tunable number of key/value heads $G$, with $1 \le G \le h$. The $h$ query heads are partitioned into $G$ equal groups; all query heads in the same group share a single key head and a single value head. As Ainslie et al. describe it, GQA is "a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads."[^1]
Two limit cases are useful to keep in mind:[^1]
The KV cache size for GQA-$G$ scales with $G$ rather than with $h$, so going from MHA to GQA-$G$ reduces the per-layer KV cache by a factor of $h / G$.[^1][^3] For example, the 70B-class models that use 64 query heads with 8 KV heads (GQA-8) reduce their KV cache by a factor of 8 compared to a 64-head MHA model with the same head dimension and layer count.[^3][^4]
GQA preserves an important property of MHA: every query head still computes its own attention pattern over the cached keys, so the model retains $h$ distinct attention distributions even though it stores only $G$ key/value tensors.[^3] In implementation, this is typically done by repeating each shared key and value tensor $h/G$ times along the head axis (or, equivalently, broadcasting it) so that the attention computation is performed exactly as in MHA.[^3]
GQA also interacts more cleanly with tensor parallelism than MQA does. If the number of KV groups equals the number of tensor-parallel shards (or a multiple of it), each shard naturally owns a distinct set of key/value heads, avoiding the cross-device duplication required by MQA.[^3]
A central contribution of Ainslie et al. is the "uptraining" recipe, a method for taking a pre-trained MHA checkpoint and converting it to MQA or GQA without paying the full cost of pretraining a new model from scratch.[^1] The recipe has two steps:[^1]
The paper reports that $\alpha = 0.05$, i.e., 5 percent of the original pretraining compute, is sufficient to recover most of the quality of the original MHA model when converting to GQA.[^1][^10] Their ablation showed that mean-pooling outperformed alternatives such as selecting a single representative head per group or random initialization of the new K/V projections, consistent with the intuition that mean-pooling preserves more of the information present in the original heads.[^1][^10] Increasing the proportion of uptraining beyond 5 percent yielded diminishing returns.[^10]
This uptraining method was important because, at the time of publication, most large pre-trained checkpoints used MHA. The recipe gave practitioners a path to retrofit them for efficient inference without retraining from scratch, which is prohibitively expensive at scale.[^1]
The original GQA paper evaluates the technique on the T5.1.1 architecture, using public Large and XXL checkpoints from the T5 family.[^1][^10] The evaluation suite includes summarization datasets (CNN/Daily Mail, arXiv, PubMed, MediaSum, and Multi-News), the WMT 2014 English-to-German translation benchmark, and the TriviaQA question-answering dataset.[^10]
The headline result, reported in Table 1 of the paper, is that an MHA T5-XXL model uptrained to GQA-8 attains average dev-set performance very close to the original MHA-XXL baseline while running at inference speed close to the MQA variant.[^10] The reported per-sample inference times on TPUv4 and average scores follow the pattern in which the MHA-XXL baseline runs the slowest, MQA-XXL runs the fastest but with a small quality drop relative to MHA-XXL, and GQA-8-XXL runs almost as fast as MQA-XXL while recovering essentially all of the quality of MHA-XXL.[^10] The MQA-XXL configuration reduces sample generation time to less than one third of the MHA-XXL baseline, illustrating how heavily KV cache loading dominates decoding latency at scale.[^10]
The ablation on group count shows the expected smooth interpolation: as $G$ increases from 1 (MQA) toward $h$ (MHA), quality rises while inference speed decreases. Eight groups was identified as a favorable trade-off in the paper's setting, producing quality close to MHA at speed close to MQA.[^1][^10]
It is worth highlighting that the speedup from GQA at training time is more modest than at inference time, because training is typically compute-bound rather than memory-bound. The principal benefits of GQA show up in autoregressive decoding, batched serving, and long-context inference, where loading the KV cache from memory is the dominant cost.[^1][^9]
GQA was adopted very quickly by the open-weights LLM community after the 2023 paper, and it is now the standard attention configuration in many production-scale models.
Meta's Llama 2 family, released in 2023, used GQA for the 70B-parameter model. The 7B and 13B Llama 2 models retained standard MHA, but the 70B model adopted GQA with 8 key/value heads (i.e., 64 query heads grouped into 8 KV heads), citing the need to make inference of such a large model practical.[^11]
Llama 3, released by Meta in 2024, applied GQA across the entire family rather than only at the largest scale. Both the 8B and 70B variants use GQA, a choice the model card describes as motivated by inference efficiency.[^5][^12] The Llama 3 8B and 70B base models share a vocabulary of 128K tokens and were trained on sequences of 8,192 tokens; both employ GQA throughout.[^12]
The Mistral 7B model, released by Mistral AI in late 2023, combined GQA with sliding-window attention. The Mistral 7B paper states that the model "leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA)" to handle long sequences efficiently.[^6] Mistral 7B uses 32 query heads and 8 KV heads, giving a query-to-KV ratio of 4.[^3][^13]
Subsequent model families from multiple organizations have also adopted GQA. Qwen2 documentation describes GQA as the standard attention mechanism, with num_key_value_heads exposed as an explicit hyperparameter that selects MHA, MQA, or GQA depending on its value.[^14] Sebastian Raschka's 2025 architecture survey identifies GQA as a shared design choice across Llama, Qwen, Gemma, Mistral, and GPT-OSS, alongside techniques such as rotary position embeddings and SwiGLU activations.[^4] Among open-weights model lines, GQA is paired with other efficiency techniques (sliding-window attention, sparse mixtures of experts, and so on) rather than being chosen instead of them, which reflects its role as a baseline component of contemporary decoder Transformers.[^4]
For closed-weights models such as the GPT-4 family, the exact attention configuration has not been publicly confirmed in technical documentation. Public discussion has speculated about GQA-like configurations in such systems, but these claims are not verified by official sources, so this article does not treat them as confirmed. What is verifiable is the broader trend in open-weights models, which is that nearly every recently released decoder Transformer above a few billion parameters uses some form of GQA, MQA, or a more aggressive KV-cache compression scheme.
The motivation for GQA is most apparent when one considers the cost structure of LLM serving:
These properties are why GQA is often described as one of the "free lunches" of modern LLM architectures: it can be applied at design time or via uptraining and yields large inference-time gains for only a small quality cost, which the paper's empirical results consistently show to be smaller than the gap from MHA to MQA.[^1][^10]
Several follow-up techniques build on GQA or address its limitations:
A notable alternative to GQA is Multi-head Latent Attention (MLA), introduced by DeepSeek-AI in the DeepSeek-V2 paper (arxiv 2405.04434).[^19] MLA does not reduce the number of key/value heads. Instead, it applies a low-rank joint compression of the keys and values into a learned latent vector that is stored in the cache; the per-head keys and values are recomputed from this latent representation at attention time.[^19][^20]
The DeepSeek-V2 paper reports that MLA reduces the KV cache by 93.3 percent relative to the corresponding MHA configuration and that the resulting model can match or exceed standard MHA in quality, in contrast to the modest but real quality loss observed for GQA and MQA compared to MHA.[^19] In the paper's ablations, GQA configurations were reported to underperform MHA in modeling quality, whereas MLA was reported to outperform MHA in their setting.[^19] This finding is part of why DeepSeek's subsequent models, including DeepSeek V3, continued to use MLA rather than GQA.[^4][^19]
GQA and MLA represent two complementary strategies for shrinking the KV cache: GQA reduces how many key/value pairs are stored by sharing heads, while MLA reduces what is stored per pair by compressing it. GQA is simpler to implement and compatible with arbitrary existing MHA checkpoints via uptraining, while MLA can reach much larger cache reductions but is more involved to implement and was demonstrated primarily on models pretrained from scratch with the technique in mind.[^4][^19]
GQA does involve a real, if usually small, quality cost relative to full MHA, particularly when $G$ is pushed toward 1.[^1][^4][^10] The optimal choice of $G$ depends on the model size, context length, hardware target, and the cost of additional pretraining or uptraining; the eight-group setting popularized by the original paper and adopted by Llama is a useful default but not necessarily optimal in every regime.[^10][^17]
GQA also does not, by itself, eliminate the long-context KV-cache bottleneck: as context lengths grow into the hundreds of thousands of tokens, even the reduced cache becomes large. Production deployments typically combine GQA with paged attention or compressed KV-cache management, sliding-window attention, and sometimes more aggressive compression techniques like MLA or learned KV quantization in order to make extremely long contexts feasible.[^4][^15]
Finally, although uptraining is far cheaper than pretraining, it is not free. Converting a large MHA checkpoint to GQA still requires nontrivial compute (on the order of a few percent of the original pretraining), and the conversion can interact in subtle ways with later fine-tuning recipes, so practitioners sometimes prefer to pretrain GQA models from scratch when starting a new model family.[^1][^4]
Grouped-Query Attention is a simple but high-impact architectural change to the Transformer attention block: instead of either one KV head per query head (MHA) or a single shared KV head (MQA), GQA uses an intermediate number of KV heads, with each KV head shared across a group of query heads. The technique reduces KV cache size and memory bandwidth proportionally to the grouping factor, which translates directly into faster autoregressive decoding, higher serving throughput, larger feasible context windows, and the ability to serve more concurrent requests on the same hardware. The accompanying uptraining recipe lets practitioners convert existing MHA checkpoints to GQA with roughly 5 percent of the original pretraining compute. Within roughly two years of its introduction at EMNLP 2023, GQA had become the default attention configuration in most production-scale open-weights decoder LLMs, and it remains a standard building block alongside competing or complementary techniques such as sliding-window attention and Multi-Head Latent Attention.[^1][^4][^10]