Grouped-query attention (GQA) is an attention mechanism for transformer models that reduces memory and computational costs during inference by having multiple query heads share a single set of key and value heads. Introduced by Ainslie et al. at Google in 2023, GQA is a generalization that sits between standard multi-head attention (MHA) and multi-query attention (MQA). In MHA, each query head has its own dedicated key and value heads. In MQA, all query heads share a single key-value head. GQA divides the query heads into groups, with each group sharing one key-value head, offering a tunable trade-off between the quality of MHA and the speed of MQA.
GQA has been widely adopted since its introduction. Meta used it in LLaMA 2 (July 2023) and retained it in LLaMA 3 (April 2024). Mistral AI adopted it for Mistral 7B (September 2023). Google used it in Gemma (February 2024). By 2025, GQA has become the standard attention configuration in most large-scale large language models.
The original transformer (Vaswani et al., 2017) introduced multi-head attention, where the input is projected into multiple sets of queries (Q), keys (K), and values (V). Each "head" performs attention independently, and the results are concatenated and projected back. If a model has h attention heads with a head dimension of d_k, then each layer maintains h separate Q, K, and V projections.
During training, this is efficient because all tokens in a batch are processed simultaneously. During autoregressive inference (text generation), however, the model generates one token at a time. For each new token, the model must:
The KV cache grows linearly with sequence length and must be stored in GPU memory. For a model with L layers, h heads, head dimension d_k, sequence length S, and batch size B, the KV cache requires:
The factor of 2 accounts for both keys and values. For large models with long contexts, this cache can consume tens of gigabytes of GPU memory. The memory bandwidth required to read this cache at each generation step also becomes a bottleneck, because the actual computation (a matrix-vector product) is small relative to the amount of data that must be moved from memory.
This memory bandwidth bottleneck motivated the development of methods to reduce KV cache size without significantly degrading model quality.
Multi-query attention was proposed by Shazeer in 2019 in the paper "Fast Transformer Decoding: One Write-Head is All You Need." The idea is simple: instead of having separate K and V projections for each attention head, use a single shared K projection and a single shared V projection across all heads. Each head still has its own Q projection, so the queries differ across heads, but they all attend over the same set of keys and values.
This reduces the KV cache size by a factor of h (the number of heads). For a model with 32 heads, MQA reduces the KV cache to 1/32 of the original size. The memory bandwidth savings translate directly into faster autoregressive decoding, since less data needs to be read from memory at each step.
However, MQA can degrade model quality. Sharing a single key-value head forces all query heads to work with the same representation, which limits the model's ability to attend to different aspects of the input simultaneously. Empirically, Ainslie et al. found that converting an MHA model to MQA and then uptraining recovered most but not all of the original quality.
GQA provides a middle ground. Instead of sharing keys and values across all heads (MQA) or not sharing at all (MHA), GQA divides the h query heads into g groups of equal size. Each group shares a single key head and a single value head. The number of KV heads is therefore g, and each KV head serves h/g query heads.
The key observation is that GQA forms a spectrum:
Consider a transformer layer with h = 32 query heads and a GQA configuration using g = 8 KV head groups. The computation proceeds as follows:
Project the input. The input token representation is projected into 32 query vectors, 8 key vectors, and 8 value vectors. The query projection uses a weight matrix of shape (d_model, h * d_k) = (d_model, 32 * d_k). The key and value projections use weight matrices of shape (d_model, g * d_k) = (d_model, 8 * d_k).
Assign groups. Query heads 0-3 are assigned to KV head 0, query heads 4-7 to KV head 1, and so on. Each group of 4 query heads shares one KV head.
Compute attention per head. For each of the 32 query heads, attention scores are computed using the query vector from that head and the key vector from the corresponding KV group. The attention output is computed as the weighted sum of the value vector from the corresponding KV group.
Concatenate and project. The 32 per-head attention outputs are concatenated and passed through an output projection, exactly as in standard MHA.
The only structural difference from MHA is that the K and V projections are smaller (producing g vectors instead of h), and each K/V vector is shared (broadcast) across multiple query heads during the attention computation.
The KV cache size for GQA is:
Compared to MHA (where g = h), this reduces the cache by a factor of h/g. For the example above with h = 32 and g = 8, the KV cache is reduced by a factor of 4.
| Property | MHA | MQA | GQA |
|---|---|---|---|
| Number of query heads | h | h | h |
| Number of KV heads | h | 1 | g (where 1 < g < h) |
| KV cache size (per layer) | 2 * h * d_k * S | 2 * d_k * S | 2 * g * d_k * S |
| KV cache reduction factor | 1x (baseline) | h x | h/g x |
| Model quality | Best | Noticeable degradation | Near-MHA quality |
| Inference speed | Slowest | Fastest | Near-MQA speed |
| K/V projection parameters | 2 * d_model * h * d_k | 2 * d_model * d_k | 2 * d_model * g * d_k |
| Example (h=64, g=8) | 64 KV heads, 1x cache | 1 KV head, 64x cache reduction | 8 KV heads, 8x cache reduction |
A practical question Ainslie et al. addressed is how to convert an existing MHA model to GQA without training from scratch. Their paper proposes an "uptraining" procedure:
Start with a pre-trained MHA checkpoint. Take a model that was fully trained with standard multi-head attention.
Convert the KV projections. To go from h KV heads to g KV heads, the key and value projection weights from the original heads within each group are averaged (mean-pooled) to produce the initial weights for the shared KV head.
Continue training. Fine-tune ("uptrain") the converted model for a small fraction of the original pre-training compute, typically around 5% of the original training steps, using the same pre-training objective and data.
This approach is much cheaper than training a GQA model from scratch, and Ainslie et al. showed that it recovers most of the quality lost during the conversion. Their experiments on T5-XXL (a model with approximately 13 billion parameters) demonstrated that GQA-8 (8 KV head groups) uptrained with 5% of the original compute achieved quality close to the original MHA model while matching the inference speed of MQA.
The paper evaluated uptrained models on summarization (CNN/Daily Mail, arXiv, PubMed, MediaSum, MultiNews), translation (WMT), and question-answering tasks. The results showed:
GQA-8 consistently outperformed MQA across all evaluated tasks while maintaining comparable inference speed. This finding demonstrated that the intermediate configuration (8 KV groups rather than 1) recovers the quality loss of MQA with minimal speed penalty.
GQA's adoption has been rapid and widespread. The following table lists major models that use GQA along with their specific configurations:
| Model | Organization | Parameters | Query heads (h) | KV heads (g) | Group size (h/g) | Head dim (d_k) | Release date |
|---|---|---|---|---|---|---|---|
| LLaMA 2 70B | Meta | 70B | 64 | 8 | 8 | 128 | July 2023 |
| LLaMA 2 34B | Meta | 34B | 48 | 8 | 6 | 128 | July 2023 |
| Mistral 7B | Mistral AI | 7.3B | 32 | 8 | 4 | 128 | September 2023 |
| Mixtral 8x7B | Mistral AI | 46.7B (sparse) | 32 | 8 | 4 | 128 | December 2023 |
| Gemma 7B | 8.5B | 16 | 16 (MHA) | 1 | 256 | February 2024 | |
| Gemma 2B | 2.5B | 8 | 1 (MQA) | 8 | 256 | February 2024 | |
| LLaMA 3 8B | Meta | 8B | 32 | 8 | 4 | 128 | April 2024 |
| LLaMA 3 70B | Meta | 70.6B | 64 | 8 | 8 | 128 | April 2024 |
| LLaMA 3.1 405B | Meta | 405B | 128 | 8 | 16 | 128 | July 2024 |
| Mistral Large | Mistral AI | 123B | 96 | 8 | 12 | 128 | February 2024 |
The most common configuration uses 8 KV heads, regardless of the total number of query heads. This means the group size (number of query heads sharing each KV head) varies across model sizes: LLaMA 3 8B uses groups of 4, LLaMA 3 70B uses groups of 8, and LLaMA 3.1 405B uses groups of 16. The trend toward larger group sizes in bigger models suggests that larger models are more robust to KV sharing, likely because they have more redundancy across heads.
Notably, the Gemma family demonstrates the full spectrum: Gemma 2B uses MQA (g = 1), while Gemma 7B uses full MHA (g = h = 16). This design choice reflects the finding that smaller models are more sensitive to KV sharing and may need the full flexibility of MHA.
The most direct benefit of GQA is reduced KV cache memory consumption. Consider LLaMA 3 70B with 80 layers, 64 query heads, 8 KV heads, head dimension 128, batch size 8, and sequence length 4,096 in BF16 precision:
The 8x reduction (from 80 GB to 10 GB) is the difference between fitting the model's KV cache on a single GPU and needing multiple GPUs just for the cache. As sequence lengths grow (8K, 32K, 128K tokens), these savings become even more important.
During autoregressive generation, the dominant cost at each step is reading the KV cache from GPU memory (a memory-bound operation) rather than the matrix multiplication itself (a compute-bound operation). By reducing the KV cache size, GQA reduces the amount of data that must be read at each generation step, directly increasing the number of tokens the model can generate per second.
The throughput improvement scales with the reduction in KV heads. Ainslie et al. reported that GQA-8 achieved inference speeds within 5-10% of MQA, while MQA itself was substantially faster than MHA. The exact speedup depends on the hardware, batch size, and sequence length, but GQA typically provides a 2-4x throughput improvement over MHA for autoregressive generation with long sequences.
Reduced KV cache memory also allows serving larger batches. If the KV cache is 8x smaller, you can fit 8x more sequences in the same GPU memory, which improves throughput for serving workloads where many users are generating text simultaneously. This is a major practical benefit for LLM serving infrastructure.
Implementing GQA is straightforward. The key changes relative to MHA are:
Reduced K/V projection dimensions. The K and V weight matrices have shape (d_model, g * d_k) instead of (d_model, h * d_k).
Broadcasting during attention. When computing attention for a group of query heads, the single shared K and V vectors for that group are broadcast (repeated) to match the number of query heads. In PyTorch, this is typically implemented using expand() or repeat_interleave() on the key and value tensors.
KV cache management. The KV cache stores only g key and g value vectors per layer per token, rather than h.
The rest of the attention computation (softmax, weighted sum, output projection) remains unchanged. GQA is compatible with FlashAttention and other optimized attention kernels, which typically support arbitrary numbers of KV heads.
A minimal PyTorch implementation of the core GQA attention would look like:
# q: (batch, seq_len, n_heads, head_dim)
# k: (batch, seq_len, n_kv_heads, head_dim)
# v: (batch, seq_len, n_kv_heads, head_dim)
# Repeat KV heads to match query heads
n_rep = n_heads // n_kv_heads
k = k.repeat_interleave(n_rep, dim=2)
v = v.repeat_interleave(n_rep, dim=2)
# Standard scaled dot-product attention
scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(head_dim)
scores = F.softmax(scores, dim=-1)
output = torch.matmul(scores, v)
The repeat_interleave operation is the only additional step compared to standard MHA. In optimized implementations, this broadcasting is fused into the attention kernel to avoid materializing the expanded tensors.
GQA is one of several techniques for reducing KV cache memory. Other approaches include:
KV cache quantization reduces the precision of cached keys and values from FP16/BF16 to INT8 or INT4. This is orthogonal to GQA and can be combined with it for further savings. For example, GQA with 8 KV heads and INT4 quantization would reduce the cache by 8x (from GQA) * 4x (from quantization) = 32x compared to MHA with FP16.
Sliding window attention, used in Mistral 7B alongside GQA, limits the attention span to a fixed window of recent tokens. Keys and values outside the window are evicted from the cache, bounding its maximum size regardless of sequence length.
Multi-head latent attention (MLA), introduced by DeepSeek in DeepSeek-V2 (2024), compresses the KV cache by projecting keys and values into a lower-dimensional latent space. This achieves even greater compression than GQA but requires a different architectural design.
Token eviction and token merging techniques selectively remove or combine tokens from the KV cache based on their importance, allowing the cache to remain a fixed size even as the sequence grows.
From a theoretical standpoint, GQA works because attention heads in transformer models exhibit significant redundancy. Research has shown that many heads in a trained MHA model learn similar attention patterns, and some heads can be pruned entirely with minimal quality loss. GQA exploits this redundancy by forcing groups of heads to share key-value representations, which effectively regularizes the model and reduces redundant computation.
The mean-pooling initialization used in the uptraining procedure leverages this redundancy directly: by averaging the KV weights within a group, the shared head starts with a representation that captures the common patterns across the original heads in that group.
Training overhead for conversion. While uptraining is cheaper than training from scratch, the 5% additional compute is not negligible for very large models. Training a GQA model from scratch avoids this cost but requires committing to the GQA configuration before seeing any results.
Fixed group assignment. In standard GQA, the assignment of query heads to KV groups is fixed (usually contiguous heads are grouped together). Some research has explored learned or dynamic groupings, where the model selects which KV head to use based on the input, but this adds complexity.
Task sensitivity. While GQA performs well on average across tasks, the quality impact varies. Tasks requiring highly diverse attention patterns (where different heads need genuinely different key-value representations) may be more affected by KV sharing.
Not always the optimal point. For very small models, full MHA may be preferable because the memory savings are less important and the quality degradation is more noticeable. For very large models, more aggressive sharing (lower g values) may be acceptable. The optimal number of KV groups depends on the model size, serving constraints, and quality requirements.
Research continues to explore improvements to KV sharing strategies. Sparse Query Attention (SQA), proposed in 2025, extends the idea further by reducing the number of query heads as well, not just key-value heads, for even greater efficiency. Dynamic GQA, where the grouping adapts based on the input or layer, is another active area. The combination of GQA with other efficiency techniques (quantization, sliding windows, speculative decoding) is also an important area of systems-level optimization for LLM serving.