Grouped-Query Attention

Deep Learning Machine Learning Transformer Models

26 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v8 · 5,161 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Grouped-query attention (GQA) is an attention mechanism for transformer language models that partitions the query heads into a small number of groups, where every query head in a group shares one key projection and one value projection. It was introduced by Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai of Google Research in a paper submitted to arXiv on May 22, 2023, and it sits between multi-head attention (a separate key/value head for every query head) and multi-query attention (a single shared key/value head). ^[1] GQA cuts the size of the key/value (KV) cache by a factor of h/G (the number of query heads divided by the number of groups), which speeds up autoregressive inference while keeping output quality close to standard multi-head attention. The paper's stated result is that "uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA," and GQA is now the default attention design in production-scale models including LLaMA 2, LLaMA 3, Mistral 7B, and Qwen. ^[1]

The core idea interpolates between the two extremes of the attention design space: full multi-head attention, which maintains independent K/V heads for every query head, and multi-query attention, which collapses all K/V heads into one. By sitting between these extremes, GQA achieves a large reduction in the size of the KV cache with only a small degradation in model quality. ^[1]

What problem does GQA solve?

The KV Cache Bottleneck

During the autoregressive decoding phase of transformer inference, a model generates one token at a time. At each step, the model must compute attention over every token in the context window. To avoid recomputing the key and value projections of all previous tokens at every step, practical inference systems maintain a KV cache, a stored buffer of the K and V tensors for each layer and each prior token. ^[2]

The memory cost of the KV cache grows as:

\text{cache\_size} = 2 \cdot \text{num\_layers} \cdot \text{num\_kv\_heads} \cdot \text{head\_dim} \cdot \text{seq\_len} \cdot \text{batch\_size} \cdot \text{bytes\_per\_element}

For a large model serving long contexts at scale, this can easily reach tens of gigabytes. For example, Llama 2 13B at a 4096-token context with batch size 8 requires roughly 25 GB for the KV cache alone. Beyond raw memory, the decoding step is memory-bandwidth-bound: for each new token, the GPU must read the entire KV cache from DRAM just to compute a single attention query. The ratio of computation to memory access is extremely low, so inference throughput is limited by how fast the hardware can transfer data rather than by floating-point throughput. Noam Shazeer framed this directly in the multi-query attention paper, noting that incremental inference is "slow due to the memory-bandwidth cost of repeatedly loading the large keys and values tensors." ^[2]

Reducing the number of K/V heads directly reduces both the memory footprint and the bandwidth requirement, making the decoding step faster without changing the arithmetic of the attention computation itself. ^[2]

Multi-Head Attention

The original transformer architecture described by Vaswani et al. (2017) uses Multi-Head Attention (MHA). ^[3] In MHA, an input hidden state of dimension $d_{\text{model}}$ is linearly projected into $h$ separate query, key, and value vectors, each of dimension $d_k = d_{\text{model}} / h$. Attention scores are computed independently in each of the $h$ heads, and the results are concatenated and projected back to $d_{\text{model}}$. ^[3]

MHA is expressive because each head attends to different parts of the context using its own key and value projections. The KV cache size scales linearly with $h$: for a model with $h$ heads, $L$ layers, head dimension $d_k$, sequence length $T$, and batch size $B$, the cache requires $2 \cdot h \cdot d_k \cdot L \cdot T \cdot B$ elements (times bytes per element for a concrete figure).

Multi-Query Attention

In 2019, Noam Shazeer proposed Multi-Query Attention (MQA) in the paper "Fast Transformer Decoding: One Write-Head is All You Need" (arXiv:1911.02150). MQA retains $h$ independent query projections but uses only a single shared key head and a single shared value head across all query heads. At decoding time, all query heads attend to the same K and V vectors. ^[2]

The impact on memory is dramatic. Compared to MHA, MQA reduces the number of K/V elements by a factor of $h$: for a model with 32 attention heads, MQA reduces the KV cache size by roughly 97%. The decoding throughput increases proportionally because far less data must be read from DRAM per step. ^[2]

However, MQA comes at a quality cost. Because all query heads share the same K/V representations, the model has less representational capacity. The 2019 paper reports that the resulting models "can be much faster to decode, with only minor quality degradation from the baseline," but subsequent work found the gap to be more significant for larger models trained on longer contexts. ^[2]^[1] Models trained with MQA from scratch can recover much of the quality through scale and training time, but converting a pre-trained MHA checkpoint to MQA is difficult without substantial additional training. ^[1]

What does the GQA paper (May 2023) propose?

The paper "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (arXiv:2305.13245) by Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai of Google Research was submitted to arXiv on May 22, 2023 and accepted at EMNLP 2023 (the final version appeared December 23, 2023). ^[1]

The paper makes two related contributions, stated in its abstract as: "(1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads." ^[1] The first contribution is a practical uptraining procedure for converting existing MHA checkpoints to either MQA or GQA format without requiring full retraining. The second introduces GQA itself as a principled generalization that performs better than converted MQA at a modest increase in KV cache cost. ^[1]

Uptraining Procedure

Converting a pre-trained MHA model requires transforming the K/V projection matrices from $h$ independent heads down to $G$ grouped heads (where $G < h$). The paper finds that mean-pooling the projection matrices within each group gives better results than selecting a single head or random initialization. In the authors' words, "the projection matrices for key and value heads are mean pooled into single projection matrices": all the original K heads assigned to a given group are averaged element-wise to produce the single K head for that group, and the same is done for V heads. ^[1]

After this structural conversion, the model is further pre-trained ("uptrained") on a small fraction of the original training data. The paper uses $\alpha = 0.05$ (5% of the original pre-training compute), and reports that for this fraction, "training took approximately 600 TPUv3 chip-days" for T5-XXL. At this fraction, quality had already plateaued for GQA, while MQA continued to benefit from additional uptraining but never fully closed the gap to GQA. ^[1]

How does GQA work?

Query Head Grouping

GQA divides the $h$ query heads into $G$ equally sized groups. All query heads within a group share a single key projection and a single value projection. The notation GQA-$G$ denotes grouped-query attention with $G$ groups. ^[1]

The two limiting cases recover existing methods:

GQA-1: All query heads share one K/V pair, equivalent to MQA.
GQA-$h$: Each query head has its own K/V pair, equivalent to standard MHA.

For an intermediate value of $G$, GQA has $G$ K/V heads instead of 1 (as in MQA) or $h$ (as in MHA). The KV cache is reduced by a factor of $h / G$ relative to MHA. ^[1]

Attention Computation

At inference time, for each group $g$ ($g = 1, \ldots, G$), the query heads in that group attend to the shared K/V pair for that group. If there are $h/G$ query heads per group, the attention scores for those heads are computed as:

\mathrm{Attention}(Q_i, K_g, V_g) \quad \text{for each query head } i \text{ in group } g

where $K_g$ and $V_g$ are the shared key and value matrices for group $g$. The outputs of all $h$ heads are then concatenated and linearly projected as in standard MHA. ^[1]

The computation cost is nearly identical to MHA during both training (where the full sequence is processed in parallel) and prefill (the first forward pass over the prompt). The benefit of GQA appears primarily at the decode step, where the number of K/V entries that must be loaded from the KV cache is reduced by a factor of $h / G$. ^[1]

KV Cache Reduction

For a model with $h$ query heads, $G$ K/V groups, $d_k$ head dimension, $L$ layers, sequence length $T$, batch size $B$, and 2 bytes per element (float16):

MHA: $2 \cdot h \cdot d_k \cdot L \cdot T \cdot B \cdot 2$ bytes
GQA-$G$: $2 \cdot G \cdot d_k \cdot L \cdot T \cdot B \cdot 2$ bytes
MQA: $2 \cdot 1 \cdot d_k \cdot L \cdot T \cdot B \cdot 2$ bytes

The reduction factor of GQA over MHA is $h / G$. For Llama 3.1 8B with $h = 32$ and $G = 8$, the KV cache is 4x smaller than it would be with MHA. ^[5]

How does GQA differ from MHA and MQA?

Quality and Speed Tradeoffs

The GQA paper evaluates models on a suite of sequence-to-sequence tasks using T5-based architectures. A key result is that GQA-8-XXL (a T5-XXL model with 8 K/V groups) achieves quality close to MHA-XXL while running at near-MQA speed. The paper summarizes this as: "GQA achieves significant additional quality gains, achieving performance close to MHA-XXL with speed close to MQA." ^[1] Selected results from Table 1 of the paper include:

Model	ROUGE-1 (avg)	Inference time (s/sample)
MHA-Large	46.0	0.37
MHA-XXL	47.2	1.51
MQA-XXL	46.6	0.24
GQA-8-XXL	47.1	0.28

The result illustrates the key tradeoff: MHA-XXL has the best quality but is about 6x slower than MQA-XXL. GQA-8-XXL matches MHA-XXL quality within 0.1 ROUGE-1 points while being only slightly slower than MQA-XXL. On English-German translation (WMT 2014) and question answering (TriviaQA), the same pattern holds: GQA matches MHA quality while running significantly faster. ^[1]

Architectural Comparison Table

Property	MHA	GQA-$G$	MQA
K/V heads	$h$	$G$ (where $1 < G < h$)	1
KV cache size	baseline	$G/h$ of MHA	$1/h$ of MHA
Decode bandwidth	highest	$G/h$ of MHA	lowest
Model quality	best	close to MHA	degraded
Representational capacity	highest	intermediate	lowest
Introduced	Vaswani et al. (2017)	Ainslie et al. (2023)	Shazeer (2019)

For models with 32-64 query heads and 8 K/V heads (the common production configuration), GQA reduces the KV cache to 25% or 12.5% of the MHA baseline. This is a substantial improvement at a relatively modest cost in model quality.

Performance Tradeoffs

How many groups should GQA use?

Figure 6 of the GQA paper shows the effect of varying $G$ from 1 (MQA) to $h$ (MHA). The main finding is that the quality-speed curve has diminishing returns: most of the quality gap between MQA and MHA is recovered with a small number of groups (around 4-8), while further increases in $G$ add little quality but proportionally more memory and bandwidth cost. The authors state plainly that "we selected 8 groups as a favorable middle ground." ^[1]

In practice, the community has converged on $G = 8$ as a strong default for models in the 7B-70B parameter range. This gives 4x-8x reduction in KV cache size relative to MHA depending on the total number of query heads, while keeping quality nearly indistinguishable from MHA in most evaluations.

Choosing $G$ is a design decision that must balance several considerations:

Model size: Larger models with more parameters per layer are generally more tolerant of aggressive K/V compression, because the feedforward layers and other components provide abundant representational capacity that compensates for reduced attention diversity.
Context length: For very long contexts (32k-128k tokens), even a modest per-layer KV cache size multiplies into large totals, making lower values of $G$ more attractive.
Target hardware: Models designed for GPU servers can afford more K/V heads than models targeting edge devices, mobile inference, or single-consumer GPUs where memory is tightly constrained.
Task diversity: Models used for diverse tasks (coding, reasoning, instruction following, multilingual) benefit from higher attention diversity and may prefer $G = 8$ over $G = 4$ even when memory allows either.

Training vs. Inference Overhead

During training, GQA has the same computational cost as MHA (when processing the full sequence in parallel). The K/V projection matrices are smaller by a factor of $h/G$, but this is a minor fraction of total model parameters for typical head dimensions. The uptraining procedure imposes additional compute costs when converting from MHA checkpoints, but training a model with GQA from scratch incurs no overhead compared to MHA training. ^[1]

At inference time, the benefit is most pronounced during the decode phase and grows with sequence length and batch size. For short sequences or batch size 1, the speedup is less dramatic. For long-context generation (sequences of 16k-128k tokens) at batch sizes of 16 or more, GQA delivers throughput improvements of 2x-5x over MHA for the same hardware.

A useful way to think about the inference arithmetic: during the decode phase, for each new token, the GPU must load from the KV cache all stored K/V tensors for the current context. With MHA (h=32, head_dim=128), a single layer at sequence length 8192 requires loading 32 * 128 * 8192 * 2 (K and V) * 2 (bytes, bfloat16) = 134 MB per layer per decode step. With GQA (8 K/V heads instead of 32), the same operation requires 33 MB, a 4x reduction in bandwidth demand per layer, and the speedup is roughly proportional.

On modern NVIDIA A100 and H100 GPUs, the memory bandwidth is on the order of 2-3.35 TB/s. A large model with 32 layers might load several gigabytes from the KV cache per decode step at long sequences, easily saturating the available bandwidth and making the token generation rate memory-bound. GQA directly addresses this by reducing what must be loaded.

Memory vs. Quality Frontier

The fundamental insight of GQA is that the mapping from K/V heads to quality is highly non-linear. Going from $h$ K/V heads (MHA) to $h/2$ K/V heads recovers half the memory savings with very little quality loss. Going all the way to 1 K/V head (MQA) saves the most memory but at a more noticeable quality cost. The intermediate regime of 4-16 groups is where GQA provides the best quality-per-memory-byte tradeoff. ^[1]

Researchers have also explored non-uniform grouping strategies, where different layers of the transformer use different values of $G$. The intuition is that early layers (which tend to do positional and syntactic encoding) may benefit from more diverse attention heads than later layers (which do higher-level semantic reasoning). Work published in 2024 showed that activation-informed grouping, where the group assignments are determined by clustering K/V head activations rather than by simple positional assignment, can yield accuracy gains of up to 7.5% on challenging reasoning tasks for the same total KV cache budget. This line of work suggests that $G = 8$ is a well-performing heuristic rather than a theoretical optimum, and that the quality-efficiency frontier of GQA can be pushed further with careful design.

Which models use GQA?

LLaMA 2

LLaMA 2, released by Meta in July 2023, was one of the first widely distributed open-weight models to use GQA. Importantly, only the 70B variant of Llama 2 uses GQA; the 7B and 13B variants use standard MHA. The Llama 2 70B uses 64 query heads and 8 K/V heads (GQA-8), reducing the KV cache by a factor of 8 compared to a hypothetical MHA 70B model. Meta reported that this configuration cost "under one percent on most benchmarks" relative to MHA, and cited inference efficiency at 70B scale as the reason for the asymmetric choice. ^[4]

The adoption of GQA in Llama 2 coincided closely with the publication of the GQA paper (both appeared in mid-2023), and the Llama 2 technical report explicitly credited the GQA paper as the source of the technique. ^[4]

LLaMA 3

LLaMA 3, released by Meta starting in April 2024, extends GQA to all model sizes, not just the largest. The Llama 3 report states that the family "uses grouped query attention (GQA) with 8 key-value heads to improve inference speed and to reduce the size of key-value caches during decoding." ^[5] The Llama 3.1 family (released July 2024) uses GQA across all variants, with 8 K/V heads in every size per Table 3 of the report: ^[5]

Model	Query heads	K/V heads	KV cache reduction vs. MHA
Llama 3.1 8B	32	8	4x
Llama 3.1 70B	64	8	8x
Llama 3.1 405B	128	8	16x

All three sizes are GQA-8 (8 K/V groups); the differing reduction factors come from the differing query-head counts. The decision to apply GQA to the 8B model reflects the broader industry consensus that GQA's quality-efficiency tradeoff is favorable even at smaller scales, particularly given the widespread deployment of 7B-8B models on consumer hardware where memory is tightly constrained.

Mistral 7B

Mistral 7B, released by Mistral AI in October 2023 (arXiv:2310.06825, Jiang et al.), uses GQA alongside sliding window attention (SWA). The paper states that Mistral 7B "leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost." ^[6] The model has 32 query heads and 8 K/V heads, the same ratio as Llama 3.1 8B. The 4x reduction in KV cache size allows the model to handle longer contexts and larger batch sizes on GPU hardware with limited VRAM. ^[6]

Mistral subsequently applied GQA to its entire model family, including Mixtral (the mixture-of-experts variant) and later Mistral Medium and Mistral Large. The pairing of GQA with sliding window attention in the original Mistral 7B was noteworthy because both techniques address the same bottleneck (KV cache memory and bandwidth) from different angles. ^[6]

Qwen

Qwen, Alibaba's open-weight model family, adopted GQA starting with the Qwen2 series (released mid-2024). The Qwen2 technical report explicitly notes that "Qwen2 adopts Grouped Query Attention (GQA) instead of conventional multi-head attention (MHA)" to optimize KV cache usage and enhance inference throughput. The Qwen2.5-14B model uses 40 query heads and 8 K/V heads. Qwen3, released in 2025, continues this pattern across all sizes. ^[7]

Other Models

GQA has become the de facto standard for efficient attention in open-weight decoder-only transformers. Additional notable adopters include:

Gemma 2 (Google, 2024): Uses GQA in the 9B and 27B variants.
Phi-3 and Phi-4 (Microsoft, 2024-2025): Adopt GQA for memory efficiency at the 3B-14B scale.
Command R+ (Cohere, 2024): Uses GQA for long-context inference.
Yi (01.AI, 2024): Uses GQA in the 34B and 200K context models.

How does GQA compare to Multi-Head Latent Attention?

Multi-head Latent Attention (MLA), introduced in the DeepSeek-V2 technical report (May 2024) and used in DeepSeek-V3 (December 2024), takes a fundamentally different approach to reducing the KV cache bottleneck. ^[8]

How MLA Works

Instead of reducing the number of K/V heads (as GQA does), MLA compresses the K/V representations into a low-rank latent vector using a learned down-projection. At decode time, the full K/V heads are reconstructed on demand from this latent via a learned up-projection. What is stored in the KV cache is the compressed latent, not the full K/V tensors. ^[8]

Specifically, for each token position, MLA stores a single compressed vector of dimension $c_{KV}$ (much smaller than the full $h \cdot d_k$ of MHA K or V). The K and V heads are then computed from this latent as needed. This means the KV cache contains a dense, low-rank representation rather than the sparse head-selection of GQA. ^[8]

Comparison Table

Property	GQA	MLA (DeepSeek)
Cache compression method	Reduce number of K/V heads	Low-rank projection of K/V
Theoretical compression ratio vs. MHA	$h/G$ (e.g., 8x for 64 heads, 8 groups)	~60x for DeepSeek-V3
Compression vs. GQA (8 K/V heads)	baseline	~12x further reduction
Reconstruction overhead	None (K/V used directly)	Up-projection at each decode step
Compatibility with FlashAttention	Full support (standard formulation)	Requires absorption trick or custom kernels
Quality vs. MHA	Slightly below MHA	At or slightly above MHA
Introduced	Ainslie et al. (2023)	DeepSeek-V2 technical report (2024)
Widely adopted	Yes (Llama, Mistral, Qwen, etc.)	DeepSeek models; not yet widely adopted

Tradeoffs

MLA achieves dramatically higher compression ratios, roughly 60x smaller KV cache than MHA for DeepSeek-V3, compared to roughly 8-16x for typical GQA configurations. This allows DeepSeek models to serve longer contexts and larger batches at the same memory budget. MLA also appears to preserve quality better: evaluations suggest MLA quality matches or slightly exceeds MHA, while GQA quality is slightly below MHA. ^[8]

However, MLA introduces complexity. The reconstruction of K/V heads from latents at each decode step requires matrix multiplications that GQA does not. A key optimization used in DeepSeek, the "absorbed" formulation, where the up-projection matrices are merged into the query and output projections, makes MLA compatible with efficient attention kernels, but this requires careful implementation. ^[8] GQA is simpler to implement correctly: the K/V heads can be expanded (repeated) to match the number of query heads, at which point a standard scaled dot-product attention kernel handles the rest.

As of 2025, GQA remains the dominant approach in the open-source ecosystem due to its simplicity, Flash Attention compatibility, and strong quality-efficiency tradeoff. MLA is used in the DeepSeek model family and has attracted research interest, but has not yet seen wide adoption outside DeepSeek.

How is GQA implemented?

PyTorch

PyTorch added native GQA support to torch.nn.functional.scaled_dot_product_attention via an enable_gqa=True flag. With this flag enabled, the function accepts K and V tensors with a different (smaller) number of heads than Q, and handles the internal broadcasting automatically.

A common pattern for implementing GQA in PyTorch is to expand the K/V heads to match the number of Q heads before passing to the attention kernel:

# q: (batch, seq, num_heads, head_dim)
# k, v: (batch, seq, num_kv_heads, head_dim)
repeat_factor = num_heads // num_kv_heads
k_expanded = k.repeat_interleave(repeat_factor, dim=2)  # expand K
v_expanded = v.repeat_interleave(repeat_factor, dim=2)  # expand V
out = F.scaled_dot_product_attention(q, k_expanded, v_expanded, is_causal=True)

This "expand and reuse" approach adds no memory overhead when used with attention kernels that fuse the expansion with the attention computation (as FlashAttention does).

FlashAttention

Flash Attention (Dao et al.) supports GQA natively starting with FlashAttention-2 (released mid-2023). The FlashAttention-2 kernel accepts separate num_heads and num_heads_k arguments, allowing K and V to have fewer heads than Q. Internally, the kernel tiles the computation so that each K/V head is reused by the corresponding query heads in its group, without materializing the expanded K/V tensors in memory. This is important for efficiency: expanding K/V before calling attention would multiply memory bandwidth by the group ratio. ^[9]

FlashAttention-3 (2024) and FlashInfer (an alternative fused attention library) both carry this GQA-aware kernel support forward with additional optimizations for the Hopper (H100) GPU architecture.

Transformers Library

Hugging Face Transformers implements GQA in its model configurations via the num_key_value_heads parameter. Models that set num_key_value_heads < num_attention_heads use GQA; setting num_key_value_heads = 1 uses MQA; and num_key_value_heads = num_attention_heads uses standard MHA. This unified parameter makes it easy to experiment with different group counts without changing model code.

TensorRT-LLM and vLLM

NVIDIA TensorRT-LLM supports GQA through its num_gqa_groups parameter in transformer layer configurations, using FlashAttention kernels for longer sequences. The vLLM serving framework similarly supports GQA for all major model families, with paged KV cache management that is aware of the reduced number of K/V heads.

Relationship to the Broader Attention Design Space

GQA is best understood as one point in a two-dimensional design space defined by (1) the number of query heads and (2) the number of K/V heads. The diagonal where both quantities are equal corresponds to MHA. The line where the number of K/V heads is 1 corresponds to MQA. GQA fills the space between these extremes. ^[1]

Other attention variants occupy different parts of the design space or introduce additional dimensions:

Sliding Window Attention (SWA), used in Mistral 7B, limits each query to attending only to a fixed-size local window of past tokens rather than the entire context. SWA reduces the KV cache size by capping the number of stored positions rather than by reducing the number of K/V heads. GQA and SWA are orthogonal and can be combined, as Mistral 7B demonstrates. ^[6]

Sparse Attention mechanisms, such as those used in Longformer and BigBird, selectively compute attention over a sparse subset of token pairs. This reduces the attention computation (which is quadratic in sequence length for dense attention) but does not directly address the K/V head count dimension that GQA targets. Sparse attention is more relevant for very long contexts during training; GQA primarily targets inference.

Tensor Parallelism Compatibility: One practical consideration in choosing the number of K/V heads is compatibility with tensor parallelism during distributed inference. The GQA paper itself raises this point, noting that standard sharding replicates the single key and value head by the number of model partitions under MQA, and that GQA removes the waste from such partitioning. ^[1] When a model is sharded across $P$ GPUs (a common technique for serving large models), it is typically required that the number of K/V heads be divisible by $P$. With $G = 8$ K/V heads and tensor parallelism across 8 GPUs, each GPU gets exactly 1 K/V head, a valid configuration. With 4 GPUs, each GPU gets 2 K/V heads. The choice of $G = 8$ is partly motivated by its divisibility by common parallelism degrees (1, 2, 4, 8).

Head Dimension Scaling: The GQA paper fixes the head dimension $d_k$ and varies only the number of K/V heads. An alternative approach is to keep the total K/V capacity (measured in parameters) constant but redistribute it by using fewer heads with larger head dimensions, or more heads with smaller dimensions. This is not strictly GQA but touches on similar tradeoffs. In practice, most GQA deployments keep the head dimension fixed at 128 and vary only $G$. ^[1]

Limitations

Quality gap at small group counts. While GQA with 8 K/V heads closely matches MHA for large models, the quality gap becomes more noticeable with aggressive compression (fewer groups). Models converted from MHA via uptraining, rather than trained with GQA from scratch, may show slightly larger gaps because the uptraining budget is finite. ^[1]

Not effective for all tasks. The quality advantage of MHA over GQA tends to be most visible on tasks that benefit from fine-grained, head-diverse attention patterns, such as tasks requiring precise entity tracking over long contexts or tasks requiring fine-grained cross-lingual alignment in translation. For most standard NLP benchmarks, the difference is small, which is why GQA's adoption has been so rapid: the benchmarks used to evaluate models do not strongly distinguish MHA from GQA.

Prefill is not accelerated. The KV cache reduction benefits apply only to the decode phase. During prefill (processing a long prompt), GQA saves memory but does not reduce the number of attention FLOPs, since the full sequence is processed in parallel. This means GQA does not improve latency for the first-token time-to-response metric, only for the subsequent token generation rate.

Does not address KV cache eviction. GQA reduces the rate at which the KV cache fills up, but does not address the problem of very long contexts eventually filling the cache entirely. Complementary techniques such as Sparse Attention, sliding window attention (as in Mistral 7B), and KV cache eviction policies (which drop cache entries for less-attended tokens) are needed to handle arbitrarily long contexts.

Interaction with speculative decoding. Speculative decoding methods that use a small draft model and a large target model require both models to share the KV cache or synchronize their caches. When the draft and target models have different GQA configurations (different numbers of K/V heads), cache management becomes more complex.

Tensor parallelism constraints. As noted above, the number of K/V heads must be divisible by the tensor parallelism degree. This creates a hard minimum: a model deployed on 8 GPUs with tensor parallelism must have at least 8 K/V heads. This constraint means that very aggressive compression (e.g., 4 K/V heads) is incompatible with 8-way tensor parallelism, limiting the applicability of the most aggressive GQA configurations in large-scale deployments.

Not a solution for KV cache quantization. GQA reduces the number of cache entries but does not address the precision of each entry. KV cache quantization (storing K/V in INT8 or INT4 rather than FP16) is a separate and complementary technique. Combining GQA with KV cache quantization can yield further memory savings, but the two techniques interact: quantization errors in the KV cache may be more visible when K/V heads are shared across multiple query heads (as in GQA) than when each query head has its own K/V (as in MHA).

References

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. *Proceedings of EMNLP 2023*. arXiv:2305.13245. ↩
Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150. ↩
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. *Advances in Neural Information Processing Systems 30*. ↩
Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288. ↩
Dubey, A., et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783. ↩
Jiang, A. Q., et al. (2023). Mistral 7B. arXiv:2310.06825. ↩
Qwen Team (2024). Qwen2 Technical Report. arXiv:2407.10671. ↩
DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434. ↩
Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit

Grouped-Query Attention

What problem does GQA solve?

The KV Cache Bottleneck

Multi-Head Attention

Multi-Query Attention

What does the GQA paper (May 2023) propose?

Uptraining Procedure

How does GQA work?

Query Head Grouping

Attention Computation

KV Cache Reduction

How does GQA differ from MHA and MQA?

Quality and Speed Tradeoffs

Architectural Comparison Table

Performance Tradeoffs

How many groups should GQA use?

Training vs. Inference Overhead

Memory vs. Quality Frontier

Which models use GQA?

LLaMA 2

LLaMA 3

Mistral 7B

Qwen

Other Models

How does GQA compare to Multi-Head Latent Attention?

How MLA Works

Comparison Table

Tradeoffs

How is GQA implemented?

PyTorch

FlashAttention

Transformers Library

TensorRT-LLM and vLLM

Relationship to the Broader Attention Design Space

Limitations

See Also

References

Improve this article

What links here (24 of 50)

What links here (24 of 50)

What problem does GQA solve?

The KV Cache Bottleneck

Multi-Head Attention

Multi-Query Attention

What does the GQA paper (May 2023) propose?

Uptraining Procedure

How does GQA work?

Query Head Grouping

Attention Computation

KV Cache Reduction

How does GQA differ from MHA and MQA?

Quality and Speed Tradeoffs

Architectural Comparison Table

Performance Tradeoffs

How many groups should GQA use?

Training vs. Inference Overhead

Memory vs. Quality Frontier

Which models use GQA?

LLaMA 2

LLaMA 3

Mistral 7B

Qwen

Other Models

How does GQA compare to Multi-Head Latent Attention?

How MLA Works

Comparison Table

Tradeoffs

How is GQA implemented?

PyTorch

FlashAttention

Transformers Library

TensorRT-LLM and vLLM

Relationship to the Broader Attention Design Space

Limitations

See Also

References

Improve this article

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

KV Cache

Self-attention

XLNet

RoBERTa

What links here (24 of 50)

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

KV Cache

Self-attention

XLNet

RoBERTa

What links here (24 of 50)