Grouped-Query Attention
Last reviewed
May 7, 2026
Sources
9 citations
Review status
Source-backed
Revision
v5 ยท 4,757 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
9 citations
Review status
Source-backed
Revision
v5 ยท 4,757 words
Add missing citations, update stale details, or suggest a clearer explanation.
Grouped-Query Attention (GQA) is an attention mechanism for transformer language models that reduces the memory and bandwidth cost of autoregressive inference while preserving output quality close to standard multi-head attention. Introduced by Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai of Google Research in May 2023, GQA has been widely adopted in production-scale large language models including LLaMA 2, LLaMA 3, Mistral 7B, and Qwen.
The core idea is to partition the query heads into a fixed number of groups, where all query heads within the same group share a single pair of key and value projection matrices. This interpolates between the two extremes of the attention design space -- full multi-head attention, which maintains independent K/V heads for every query head, and multi-query attention, which collapses all K/V heads into one. By sitting between these extremes, GQA achieves a large reduction in the size of the KV cache with only a small degradation in model quality.
During the autoregressive decoding phase of transformer inference, a model generates one token at a time. At each step, the model must compute attention over every token in the context window. To avoid recomputing the key and value projections of all previous tokens at every step, practical inference systems maintain a KV cache -- a stored buffer of the K and V tensors for each layer and each prior token.
The memory cost of the KV cache grows as:
cache_size = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * bytes_per_element
For a large model serving long contexts at scale, this can easily reach tens of gigabytes. For example, Llama 2 13B at a 4096-token context with batch size 8 requires roughly 25 GB for the KV cache alone. Beyond raw memory, the decoding step is memory-bandwidth-bound: for each new token, the GPU must read the entire KV cache from DRAM just to compute a single attention query. The ratio of computation to memory access is extremely low, so inference throughput is limited by how fast the hardware can transfer data rather than by floating-point throughput.
Reducing the number of K/V heads directly reduces both the memory footprint and the bandwidth requirement, making the decoding step faster without changing the arithmetic of the attention computation itself.
The original transformer architecture described by Vaswani et al. (2017) uses Multi-Head Attention (MHA). In MHA, an input hidden state of dimension $d_{\text{model}}$ is linearly projected into $h$ separate query, key, and value vectors, each of dimension $d_k = d_{\text{model}} / h$. Attention scores are computed independently in each of the $h$ heads, and the results are concatenated and projected back to $d_{\text{model}}$.
MHA is expressive because each head attends to different parts of the context using its own key and value projections. The KV cache size scales linearly with $h$: for a model with $h$ heads, $L$ layers, head dimension $d_k$, sequence length $T$, and batch size $B$, the cache requires $2 \cdot h \cdot d_k \cdot L \cdot T \cdot B$ elements (times bytes per element for a concrete figure).
In 2019, Noam Shazeer proposed Multi-Query Attention (MQA) in the paper "Fast Transformer Decoding: One Write-Head is All You Need" (arXiv:1911.02150). MQA retains $h$ independent query projections but uses only a single shared key head and a single shared value head across all query heads. At decoding time, all query heads attend to the same K and V vectors.
The impact on memory is dramatic. Compared to MHA, MQA reduces the number of K/V elements by a factor of $h$ -- for a model with 32 attention heads, MQA reduces the KV cache size by roughly 97%. The decoding throughput increases proportionally because far less data must be read from DRAM per step.
However, MQA comes at a quality cost. Because all query heads share the same K/V representations, the model has less representational capacity. The paper notes only "minor quality degradation" in some settings, but subsequent work found the gap to be more significant for larger models trained on longer contexts. Models trained with MQA from scratch can recover much of the quality through scale and training time, but converting a pre-trained MHA checkpoint to MQA is difficult without substantial additional training.
The paper "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (arXiv:2305.13245) by Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai of Google Research was submitted to arXiv on May 22, 2023 and accepted at EMNLP 2023 (the final version appeared December 23, 2023).
The paper makes two related contributions. First, it describes a practical uptraining procedure for converting existing MHA checkpoints to either MQA or GQA format without requiring full retraining. Second, it introduces GQA itself as a principled generalization that performs better than converted MQA at a modest increase in KV cache cost.
Converting a pre-trained MHA model requires transforming the K/V projection matrices from $h$ independent heads down to $G$ grouped heads (where $G < h$). The paper finds that mean-pooling the projection matrices within each group gives better results than selecting a single head or random initialization. Specifically, all the original K heads assigned to a given group are averaged element-wise to produce the single K head for that group, and the same is done for V heads.
After this structural conversion, the model is further pre-trained ("uptrained") on a small fraction of the original training data. The paper uses $\alpha = 0.05$ (5% of the original pre-training compute), which required approximately 600 TPUv3 chip-days for T5-XXL. At this fraction, quality had already plateaued for GQA, while MQA continued to benefit from additional uptraining but never fully closed the gap to GQA.
GQA divides the $h$ query heads into $G$ equally sized groups. All query heads within a group share a single key projection and a single value projection. The notation GQA-$G$ denotes grouped-query attention with $G$ groups.
The two limiting cases recover existing methods:
For an intermediate value of $G$, GQA has $G$ K/V heads instead of 1 (as in MQA) or $h$ (as in MHA). The KV cache is reduced by a factor of $h / G$ relative to MHA.
At inference time, for each group $g$ ($g = 1, \ldots, G$), the query heads in that group attend to the shared K/V pair for that group. If there are $h/G$ query heads per group, the attention scores for those heads are computed as:
Attention(Q_i, K_g, V_g) for each query head i in group g
where $K_g$ and $V_g$ are the shared key and value matrices for group $g$. The outputs of all $h$ heads are then concatenated and linearly projected as in standard MHA.
The computation cost is nearly identical to MHA during both training (where the full sequence is processed in parallel) and prefill (the first forward pass over the prompt). The benefit of GQA appears primarily at the decode step, where the number of K/V entries that must be loaded from the KV cache is reduced by a factor of $h / G$.
For a model with $h$ query heads, $G$ K/V groups, $d_k$ head dimension, $L$ layers, sequence length $T$, batch size $B$, and 2 bytes per element (float16):
The reduction factor of GQA over MHA is $h / G$. For Llama 3.1 8B with $h = 32$ and $G = 8$, the KV cache is 4x smaller than it would be with MHA.
The GQA paper evaluates models on a suite of sequence-to-sequence tasks using T5-based architectures. A key result is that GQA-8-XXL (a T5-XXL model with 8 K/V groups) achieves quality close to MHA-XXL while running at near-MQA speed. Selected results from Table 1 of the paper include:
| Model | ROUGE-1 (avg) | Inference time (s/sample) |
|---|---|---|
| MHA-Large | 46.0 | 0.37 |
| MHA-XXL | 47.2 | 1.51 |
| MQA-XXL | 46.6 | 0.24 |
| GQA-8-XXL | 47.1 | 0.28 |
The result illustrates the key tradeoff: MHA-XXL has the best quality but is 6x slower than MQA-XXL. GQA-8-XXL matches MHA-XXL quality within 0.1 ROUGE-1 points while being only slightly slower than MQA-XXL. On English-German translation (WMT 2014) and question answering (TriviaQA), the same pattern holds: GQA matches MHA quality while running significantly faster.
| Property | MHA | GQA-$G$ | MQA |
|---|---|---|---|
| K/V heads | $h$ | $G$ (where $1 < G < h$) | 1 |
| KV cache size | baseline | $G/h$ of MHA | $1/h$ of MHA |
| Decode bandwidth | highest | $G/h$ of MHA | lowest |
| Model quality | best | close to MHA | degraded |
| Representational capacity | highest | intermediate | lowest |
| Introduced | Vaswani et al. (2017) | Ainslie et al. (2023) | Shazeer (2019) |
For models with 32-64 query heads and 8 K/V heads (the common production configuration), GQA reduces the KV cache to 25% or 12.5% of the MHA baseline. This is a substantial improvement at a relatively modest cost in model quality.
Figure 6 of the GQA paper shows the effect of varying $G$ from 1 (MQA) to $h$ (MHA). The main finding is that the quality-speed curve has diminishing returns: most of the quality gap between MQA and MHA is recovered with a small number of groups (around 4-8), while further increases in $G$ add little quality but proportionally more memory and bandwidth cost.
In practice, the community has converged on $G = 8$ as a strong default for models in the 7B-70B parameter range. This gives 4x-8x reduction in KV cache size relative to MHA depending on the total number of query heads, while keeping quality nearly indistinguishable from MHA in most evaluations.
Choosing $G$ is a design decision that must balance several considerations:
During training, GQA has the same computational cost as MHA (when processing the full sequence in parallel). The K/V projection matrices are smaller by a factor of $h/G$, but this is a minor fraction of total model parameters for typical head dimensions. The uptraining procedure imposes additional compute costs when converting from MHA checkpoints, but training a model with GQA from scratch incurs no overhead compared to MHA training.
At inference time, the benefit is most pronounced during the decode phase and grows with sequence length and batch size. For short sequences or batch size 1, the speedup is less dramatic. For long-context generation (sequences of 16k-128k tokens) at batch sizes of 16 or more, GQA delivers throughput improvements of 2x-5x over MHA for the same hardware.
A useful way to think about the inference arithmetic: during the decode phase, for each new token, the GPU must load from the KV cache all stored K/V tensors for the current context. With MHA (h=32, head_dim=128), a single layer at sequence length 8192 requires loading 32 * 128 * 8192 * 2 (K and V) * 2 (bytes, bfloat16) = 134 MB per layer per decode step. With GQA (8 K/V heads instead of 32), the same operation requires 33 MB -- a 4x reduction in bandwidth demand per layer, and the speedup is roughly proportional.
On modern NVIDIA A100 and H100 GPUs, the memory bandwidth is on the order of 2-3.35 TB/s. A large model with 32 layers might load several gigabytes from the KV cache per decode step at long sequences, easily saturating the available bandwidth and making the token generation rate memory-bound. GQA directly addresses this by reducing what must be loaded.
The fundamental insight of GQA is that the mapping from K/V heads to quality is highly non-linear. Going from $h$ K/V heads (MHA) to $h/2$ K/V heads recovers half the memory savings with very little quality loss. Going all the way to 1 K/V head (MQA) saves the most memory but at a more noticeable quality cost. The intermediate regime of 4-16 groups is where GQA provides the best quality-per-memory-byte tradeoff.
Researchers have also explored non-uniform grouping strategies, where different layers of the transformer use different values of $G$. The intuition is that early layers (which tend to do positional and syntactic encoding) may benefit from more diverse attention heads than later layers (which do higher-level semantic reasoning). Work published in 2024 showed that activation-informed grouping -- where the group assignments are determined by clustering K/V head activations rather than by simple positional assignment -- can yield accuracy gains of up to 7.5% on challenging reasoning tasks for the same total KV cache budget. This line of work suggests that $G = 8$ is a well-performing heuristic rather than a theoretical optimum, and that the quality-efficiency frontier of GQA can be pushed further with careful design.
LLaMA 2, released by Meta in July 2023, was one of the first widely distributed open-weight models to use GQA. Importantly, only the 70B variant of Llama 2 uses GQA -- the 7B and 13B variants use standard MHA. The Llama 2 70B uses 64 query heads and 8 K/V heads (GQA-8), reducing the KV cache by a factor of 8 compared to a hypothetical MHA 70B model. Meta cited inference efficiency at 70B scale as the reason for the asymmetric choice.
The adoption of GQA in Llama 2 coincided closely with the publication of the GQA paper (both appeared in mid-2023), and the Llama 2 technical report explicitly credited the GQA paper as the source of the technique.
LLaMA 3, released by Meta starting in April 2024, extends GQA to all model sizes, not just the largest. The Llama 3.1 family (released July 2024) uses GQA across all variants:
The decision to apply GQA to the 8B model reflects the broader industry consensus that GQA's quality-efficiency tradeoff is favorable even at smaller scales, particularly given the widespread deployment of 7B-8B models on consumer hardware where memory is tightly constrained.
Mistral 7B, released by Mistral AI in October 2023 (arXiv:2310.06825, Jiang et al.), uses GQA alongside sliding window attention (SWA). The model has 32 query heads and 8 K/V heads, the same ratio as Llama 3.1 8B. Mistral argued that GQA was essential for making the 7B model practical to deploy: the 4x reduction in KV cache size allows the model to handle longer contexts and larger batch sizes on GPU hardware with limited VRAM.
Mistral subsequently applied GQA to its entire model family, including Mixtral (the mixture-of-experts variant) and later Mistral Medium and Mistral Large. The pairing of GQA with sliding window attention in the original Mistral 7B was noteworthy because both techniques address the same bottleneck (KV cache memory and bandwidth) from different angles.
Qwen, Alibaba's open-weight model family, adopted GQA starting with the Qwen2 series (released mid-2024). The Qwen2 technical report explicitly notes that "Qwen2 adopts Grouped Query Attention (GQA) instead of conventional multi-head attention (MHA)" to optimize KV cache usage and enhance inference throughput. The Qwen2.5-14B model uses 40 query heads and 8 K/V heads. Qwen3, released in 2025, continues this pattern across all sizes.
GQA has become the de facto standard for efficient attention in open-weight decoder-only transformers. Additional notable adopters include:
Multi-head Latent Attention (MLA), introduced in the DeepSeek-V2 technical report (May 2024) and used in DeepSeek-V3 (December 2024), takes a fundamentally different approach to reducing the KV cache bottleneck.
Instead of reducing the number of K/V heads (as GQA does), MLA compresses the K/V representations into a low-rank latent vector using a learned down-projection. At decode time, the full K/V heads are reconstructed on demand from this latent via a learned up-projection. What is stored in the KV cache is the compressed latent, not the full K/V tensors.
Specifically, for each token position, MLA stores a single compressed vector of dimension $c_{KV}$ (much smaller than the full $h \cdot d_k$ of MHA K or V). The K and V heads are then computed from this latent as needed. This means the KV cache contains a dense, low-rank representation rather than the sparse head-selection of GQA.
| Property | GQA | MLA (DeepSeek) |
|---|---|---|
| Cache compression method | Reduce number of K/V heads | Low-rank projection of K/V |
| Theoretical compression ratio vs. MHA | $h/G$ (e.g., 8x for 64 heads, 8 groups) | ~60x for DeepSeek-V3 |
| Compression vs. GQA (8 K/V heads) | baseline | ~12x further reduction |
| Reconstruction overhead | None (K/V used directly) | Up-projection at each decode step |
| Compatibility with FlashAttention | Full support (standard formulation) | Requires absorption trick or custom kernels |
| Quality vs. MHA | Slightly below MHA | At or slightly above MHA |
| Introduced | Ainslie et al. (2023) | DeepSeek-V2 technical report (2024) |
| Widely adopted | Yes (Llama, Mistral, Qwen, etc.) | DeepSeek models; not yet widely adopted |
MLA achieves dramatically higher compression ratios -- roughly 60x smaller KV cache than MHA for DeepSeek-V3, compared to roughly 8-16x for typical GQA configurations. This allows DeepSeek models to serve longer contexts and larger batches at the same memory budget. MLA also appears to preserve quality better: evaluations suggest MLA quality matches or slightly exceeds MHA, while GQA quality is slightly below MHA.
However, MLA introduces complexity. The reconstruction of K/V heads from latents at each decode step requires matrix multiplications that GQA does not. A key optimization used in DeepSeek -- the "absorbed" formulation, where the up-projection matrices are merged into the query and output projections -- makes MLA compatible with efficient attention kernels, but this requires careful implementation. GQA is simpler to implement correctly: the K/V heads can be expanded (repeated) to match the number of query heads, at which point a standard scaled dot-product attention kernel handles the rest.
As of 2025, GQA remains the dominant approach in the open-source ecosystem due to its simplicity, Flash Attention compatibility, and strong quality-efficiency tradeoff. MLA is used in the DeepSeek model family and has attracted research interest, but has not yet seen wide adoption outside DeepSeek.
PyTorch added native GQA support to torch.nn.functional.scaled_dot_product_attention via an enable_gqa=True flag. With this flag enabled, the function accepts K and V tensors with a different (smaller) number of heads than Q, and handles the internal broadcasting automatically.
A common pattern for implementing GQA in PyTorch is to expand the K/V heads to match the number of Q heads before passing to the attention kernel:
# q: (batch, seq, num_heads, head_dim)
# k, v: (batch, seq, num_kv_heads, head_dim)
repeat_factor = num_heads // num_kv_heads
k_expanded = k.repeat_interleave(repeat_factor, dim=2) # expand K
v_expanded = v.repeat_interleave(repeat_factor, dim=2) # expand V
out = F.scaled_dot_product_attention(q, k_expanded, v_expanded, is_causal=True)
This "expand and reuse" approach adds no memory overhead when used with attention kernels that fuse the expansion with the attention computation (as FlashAttention does).
Flash Attention (Dao et al.) supports GQA natively starting with FlashAttention-2 (released mid-2023). The FlashAttention-2 kernel accepts separate num_heads and num_heads_k arguments, allowing K and V to have fewer heads than Q. Internally, the kernel tiles the computation so that each K/V head is reused by the corresponding query heads in its group, without materializing the expanded K/V tensors in memory. This is important for efficiency: expanding K/V before calling attention would multiply memory bandwidth by the group ratio.
FlashAttention-3 (2024) and FlashInfer (an alternative fused attention library) both carry this GQA-aware kernel support forward with additional optimizations for the Hopper (H100) GPU architecture.
Hugging Face Transformers implements GQA in its model configurations via the num_key_value_heads parameter. Models that set num_key_value_heads < num_attention_heads use GQA; setting num_key_value_heads = 1 uses MQA; and num_key_value_heads = num_attention_heads uses standard MHA. This unified parameter makes it easy to experiment with different group counts without changing model code.
NVIDIA TensorRT-LLM supports GQA through its num_gqa_groups parameter in transformer layer configurations, using FlashAttention kernels for longer sequences. The vLLM serving framework similarly supports GQA for all major model families, with paged KV cache management that is aware of the reduced number of K/V heads.
GQA is best understood as one point in a two-dimensional design space defined by (1) the number of query heads and (2) the number of K/V heads. The diagonal where both quantities are equal corresponds to MHA. The line where the number of K/V heads is 1 corresponds to MQA. GQA fills the space between these extremes.
Other attention variants occupy different parts of the design space or introduce additional dimensions:
Sliding Window Attention (SWA), used in Mistral 7B, limits each query to attending only to a fixed-size local window of past tokens rather than the entire context. SWA reduces the KV cache size by capping the number of stored positions rather than by reducing the number of K/V heads. GQA and SWA are orthogonal and can be combined, as Mistral 7B demonstrates.
Sparse Attention mechanisms, such as those used in Longformer and BigBird, selectively compute attention over a sparse subset of token pairs. This reduces the attention computation (which is quadratic in sequence length for dense attention) but does not directly address the K/V head count dimension that GQA targets. Sparse attention is more relevant for very long contexts during training; GQA primarily targets inference.
Tensor Parallelism Compatibility: One practical consideration in choosing the number of K/V heads is compatibility with tensor parallelism during distributed inference. When a model is sharded across $P$ GPUs (a common technique for serving large models), it is typically required that the number of K/V heads be divisible by $P$. With $G = 8$ K/V heads and tensor parallelism across 8 GPUs, each GPU gets exactly 1 K/V head -- a valid configuration. With 4 GPUs, each GPU gets 2 K/V heads. The choice of $G = 8$ is partly motivated by its divisibility by common parallelism degrees (1, 2, 4, 8).
Head Dimension Scaling: The GQA paper fixes the head dimension $d_k$ and varies only the number of K/V heads. An alternative approach is to keep the total K/V capacity (measured in parameters) constant but redistribute it by using fewer heads with larger head dimensions, or more heads with smaller dimensions. This is not strictly GQA but touches on similar tradeoffs. In practice, most GQA deployments keep the head dimension fixed at 128 and vary only $G$.
Quality gap at small group counts. While GQA with 8 K/V heads closely matches MHA for large models, the quality gap becomes more noticeable with aggressive compression (fewer groups). Models converted from MHA via uptraining, rather than trained with GQA from scratch, may show slightly larger gaps because the uptraining budget is finite.
Not effective for all tasks. The quality advantage of MHA over GQA tends to be most visible on tasks that benefit from fine-grained, head-diverse attention patterns -- such as tasks requiring precise entity tracking over long contexts or tasks requiring fine-grained cross-lingual alignment in translation. For most standard NLP benchmarks, the difference is small, which is why GQA's adoption has been so rapid: the benchmarks used to evaluate models do not strongly distinguish MHA from GQA.
Prefill is not accelerated. The KV cache reduction benefits apply only to the decode phase. During prefill (processing a long prompt), GQA saves memory but does not reduce the number of attention FLOPs, since the full sequence is processed in parallel. This means GQA does not improve latency for the first-token time-to-response metric -- only for the subsequent token generation rate.
Does not address KV cache eviction. GQA reduces the rate at which the KV cache fills up, but does not address the problem of very long contexts eventually filling the cache entirely. Complementary techniques such as Sparse Attention, sliding window attention (as in Mistral 7B), and KV cache eviction policies (which drop cache entries for less-attended tokens) are needed to handle arbitrarily long contexts.
Interaction with speculative decoding. Speculative decoding methods that use a small draft model and a large target model require both models to share the KV cache or synchronize their caches. When the draft and target models have different GQA configurations (different numbers of K/V heads), cache management becomes more complex.
Tensor parallelism constraints. As noted above, the number of K/V heads must be divisible by the tensor parallelism degree. This creates a hard minimum: a model deployed on 8 GPUs with tensor parallelism must have at least 8 K/V heads. This constraint means that very aggressive compression (e.g., 4 K/V heads) is incompatible with 8-way tensor parallelism, limiting the applicability of the most aggressive GQA configurations in large-scale deployments.
Not a solution for KV cache quantization. GQA reduces the number of cache entries but does not address the precision of each entry. KV cache quantization (storing K/V in INT8 or INT4 rather than FP16) is a separate and complementary technique. Combining GQA with KV cache quantization can yield further memory savings, but the two techniques interact: quantization errors in the KV cache may be more visible when K/V heads are shared across multiple query heads (as in GQA) than when each query head has its own K/V (as in MHA).