Grouped-Query Attention

Grouped-Query Attention (GQA) is an attention mechanism for transformer language models that reduces the memory and bandwidth cost of autoregressive inference while preserving output quality close to standard multi-head attention. Introduced by Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai of Google Research in May 2023, GQA has been widely adopted in production-scale large language models including LLaMA 2, LLaMA 3, Mistral 7B, and Qwen.

The core idea is to partition the query heads into a fixed number of groups, where all query heads within the same group share a single pair of key and value projection matrices. This interpolates between the two extremes of the attention design space -- full multi-head attention, which maintains independent K/V heads for every query head, and multi-query attention, which collapses all K/V heads into one. By sitting between these extremes, GQA achieves a large reduction in the size of the KV cache with only a small degradation in model quality.

Background

The KV Cache Bottleneck

During the autoregressive decoding phase of transformer inference, a model generates one token at a time. At each step, the model must compute attention over every token in the context window. To avoid recomputing the key and value projections of all previous tokens at every step, practical inference systems maintain a KV cache -- a stored buffer of the K and V tensors for each layer and each prior token.

The memory cost of the KV cache grows as:

cache_size = 2 * num_layers * num_kv_heads * head_dim * seq_len * batch_size * bytes_per_element

For a large model serving long contexts at scale, this can easily reach tens of gigabytes. For example, Llama 2 13B at a 4096-token context with batch size 8 requires roughly 25 GB for the KV cache alone. Beyond raw memory, the decoding step is memory-bandwidth-bound: for each new token, the GPU must read the entire KV cache from DRAM just to compute a single attention query. The ratio of computation to memory access is extremely low, so inference throughput is limited by how fast the hardware can transfer data rather than by floating-point throughput.

Reducing the number of K/V heads directly reduces both the memory footprint and the bandwidth requirement, making the decoding step faster without changing the arithmetic of the attention computation itself.

Multi-Head Attention

The original transformer architecture described by Vaswani et al. (2017) uses Multi-Head Attention (MHA). In MHA, an input hidden state of dimension $d_{\text{model}}$ is linearly projected into $h$ separate query, key, and value vectors, each of dimension $d_k = d_{\text{model}} / h$. Attention scores are computed independently in each of the $h$ heads, and the results are concatenated and projected back to $d_{\text{model}}$.

MHA is expressive because each head attends to different parts of the context using its own key and value projections. The KV cache size scales linearly with $h$: for a model with $h$ heads, $L$ layers, head dimension $d_k$, sequence length $T$, and batch size $B$, the cache requires $2 \cdot h \cdot d_k \cdot L \cdot T \cdot B$ elements (times bytes per element for a concrete figure).

Multi-Query Attention

In 2019, Noam Shazeer proposed Multi-Query Attention (MQA) in the paper "Fast Transformer Decoding: One Write-Head is All You Need" (arXiv:1911.02150). MQA retains $h$ independent query projections but uses only a single shared key head and a single shared value head across all query heads. At decoding time, all query heads attend to the same K and V vectors.

The impact on memory is dramatic. Compared to MHA, MQA reduces the number of K/V elements by a factor of $h$ -- for a model with 32 attention heads, MQA reduces the KV cache size by roughly 97%. The decoding throughput increases proportionally because far less data must be read from DRAM per step.

However, MQA comes at a quality cost. Because all query heads share the same K/V representations, the model has less representational capacity. The paper notes only "minor quality degradation" in some settings, but subsequent work found the gap to be more significant for larger models trained on longer contexts. Models trained with MQA from scratch can recover much of the quality through scale and training time, but converting a pre-trained MHA checkpoint to MQA is difficult without substantial additional training.

The GQA Paper (May 2023)

The paper "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints" (arXiv:2305.13245) by Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai of Google Research was submitted to arXiv on May 22, 2023 and accepted at EMNLP 2023 (the final version appeared December 23, 2023).

The paper makes two related contributions. First, it describes a practical uptraining procedure for converting existing MHA checkpoints to either MQA or GQA format without requiring full retraining. Second, it introduces GQA itself as a principled generalization that performs better than converted MQA at a modest increase in KV cache cost.

Uptraining Procedure

Converting a pre-trained MHA model requires transforming the K/V projection matrices from $h$ independent heads down to $G$ grouped heads (where $G < h$). The paper finds that mean-pooling the projection matrices within each group gives better results than selecting a single head or random initialization. Specifically, all the original K heads assigned to a given group are averaged element-wise to produce the single K head for that group, and the same is done for V heads.

After this structural conversion, the model is further pre-trained ("uptrained") on a small fraction of the original training data. The paper uses $\alpha = 0.05$ (5% of the original pre-training compute), which required approximately 600 TPUv3 chip-days for T5-XXL. At this fraction, quality had already plateaued for GQA, while MQA continued to benefit from additional uptraining but never fully closed the gap to GQA.

Mechanism

Query Head Grouping

GQA divides the $h$ query heads into $G$ equally sized groups. All query heads within a group share a single key projection and a single value projection. The notation GQA-$G$ denotes grouped-query attention with $G$ groups.

The two limiting cases recover existing methods:

GQA-1: All query heads share one K/V pair -- equivalent to MQA.
GQA-$h$: Each query head has its own K/V pair -- equivalent to standard MHA.

For an intermediate value of $G$, GQA has $G$ K/V heads instead of 1 (as in MQA) or $h$ (as in MHA). The KV cache is reduced by a factor of $h / G$ relative to MHA.

Attention Computation

At inference time, for each group $g$ ($g = 1, \ldots, G$), the query heads in that group attend to the shared K/V pair for that group. If there are $h/G$ query heads per group, the attention scores for those heads are computed as:

Attention(Q_i, K_g, V_g)  for each query head i in group g

where $K_g$ and $V_g$ are the shared key and value matrices for group $g$. The outputs of all $h$ heads are then concatenated and linearly projected as in standard MHA.

The computation cost is nearly identical to MHA during both training (where the full sequence is processed in parallel) and prefill (the first forward pass over the prompt). The benefit of GQA appears primarily at the decode step, where the number of K/V entries that must be loaded from the KV cache is reduced by a factor of $h / G$.

KV Cache Reduction

For a model with $h$ query heads, $G$ K/V groups, $d_k$ head dimension, $L$ layers, sequence length $T$, batch size $B$, and 2 bytes per element (float16):

MHA: $2 \cdot h \cdot d_k \cdot L \cdot T \cdot B \cdot 2$ bytes
GQA-$G$: $2 \cdot G \cdot d_k \cdot L \cdot T \cdot B \cdot 2$ bytes
MQA: $2 \cdot 1 \cdot d_k \cdot L \cdot T \cdot B \cdot 2$ bytes

The reduction factor of GQA over MHA is $h / G$. For Llama 3.1 8B with $h = 32$ and $G = 8$, the KV cache is 4x smaller than it would be with MHA.

Comparison with MHA and MQA

Quality and Speed Tradeoffs

The GQA paper evaluates models on a suite of sequence-to-sequence tasks using T5-based architectures. A key result is that GQA-8-XXL (a T5-XXL model with 8 K/V groups) achieves quality close to MHA-XXL while running at near-MQA speed. Selected results from Table 1 of the paper include:

Model	ROUGE-1 (avg)	Inference time (s/sample)
MHA-Large	46.0	0.37
MHA-XXL	47.2	1.51
MQA-XXL	46.6	0.24
GQA-8-XXL	47.1	0.28

The result illustrates the key tradeoff: MHA-XXL has the best quality but is 6x slower than MQA-XXL. GQA-8-XXL matches MHA-XXL quality within 0.1 ROUGE-1 points while being only slightly slower than MQA-XXL. On English-German translation (WMT 2014) and question answering (TriviaQA), the same pattern holds: GQA matches MHA quality while running significantly faster.

Architectural Comparison Table

Property	MHA	GQA-$G$	MQA
K/V heads	$h$	$G$ (where $1 < G < h$)	1
KV cache size	baseline	$G/h$ of MHA	$1/h$ of MHA
Decode bandwidth	highest	$G/h$ of MHA	lowest
Model quality	best	close to MHA	degraded
Representational capacity	highest	intermediate	lowest
Introduced	Vaswani et al. (2017)	Ainslie et al. (2023)	Shazeer (2019)

For models with 32-64 query heads and 8 K/V heads (the common production configuration), GQA reduces the KV cache to 25% or 12.5% of the MHA baseline. This is a substantial improvement at a relatively modest cost in model quality.

Performance Tradeoffs

Choosing the Number of Groups

Figure 6 of the GQA paper shows the effect of varying $G$ from 1 (MQA) to $h$ (MHA). The main finding is that the quality-speed curve has diminishing returns: most of the quality gap between MQA and MHA is recovered with a small number of groups (around 4-8), while further increases in $G$ add little quality but proportionally more memory and bandwidth cost.

In practice, the community has converged on $G = 8$ as a strong default for models in the 7B-70B parameter range. This gives 4x-8x reduction in KV cache size relative to MHA depending on the total number of query heads, while keeping quality nearly indistinguishable from MHA in most evaluations.

Choosing $G$ is a design decision that must balance several considerations:

Model size: Larger models with more parameters per layer are generally more tolerant of aggressive K/V compression, because the feedforward layers and other components provide abundant representational capacity that compensates for reduced attention diversity.
Context length: For very long contexts (32k-128k tokens), even a modest per-layer KV cache size multiplies into large totals, making lower values of $G$ more attractive.
Target hardware: Models designed for GPU servers can afford more K/V heads than models targeting edge devices, mobile inference, or single-consumer GPUs where memory is tightly constrained.
Task diversity: Models used for diverse tasks (coding, reasoning, instruction following, multilingual) benefit from higher attention diversity and may prefer $G = 8$ over $G = 4$ even when memory allows either.

Training vs. Inference Overhead

During training, GQA has the same computational cost as MHA (when processing the full sequence in parallel). The K/V projection matrices are smaller by a factor of $h/G$, but this is a minor fraction of total model parameters for typical head dimensions. The uptraining procedure imposes additional compute costs when converting from MHA checkpoints, but training a model with GQA from scratch incurs no overhead compared to MHA training.

At inference time, the benefit is most pronounced during the decode phase and grows with sequence length and batch size. For short sequences or batch size 1, the speedup is less dramatic. For long-context generation (sequences of 16k-128k tokens) at batch sizes of 16 or more, GQA delivers throughput improvements of 2x-5x over MHA for the same hardware.

A useful way to think about the inference arithmetic: during the decode phase, for each new token, the GPU must load from the KV cache all stored K/V tensors for the current context. With MHA (h=32, head_dim=128), a single layer at sequence length 8192 requires loading 32 * 128 * 8192 * 2 (K and V) * 2 (bytes, bfloat16) = 134 MB per layer per decode step. With GQA (8 K/V heads instead of 32), the same operation requires 33 MB -- a 4x reduction in bandwidth demand per layer, and the speedup is roughly proportional.

On modern NVIDIA A100 and H100 GPUs, the memory bandwidth is on the order of 2-3.35 TB/s. A large model with 32 layers might load several gigabytes from the KV cache per decode step at long sequences, easily saturating the available bandwidth and making the token generation rate memory-bound. GQA directly addresses this by reducing what must be loaded.

Memory vs. Quality Frontier

The fundamental insight of GQA is that the mapping from K/V heads to quality is highly non-linear. Going from $h$ K/V heads (MHA) to $h/2$ K/V heads recovers half the memory savings with very little quality loss. Going all the way to 1 K/V head (MQA) saves the most memory but at a more noticeable quality cost. The intermediate regime of 4-16 groups is where GQA provides the best quality-per-memory-byte tradeoff.

Researchers have also explored non-uniform grouping strategies, where different layers of the transformer use different values of $G$. The intuition is that early layers (which tend to do positional and syntactic encoding) may benefit from more diverse attention heads than later layers (which do higher-level semantic reasoning). Work published in 2024 showed that activation-informed grouping -- where the group assignments are determined by clustering K/V head activations rather than by simple positional assignment -- can yield accuracy gains of up to 7.5% on challenging reasoning tasks for the same total KV cache budget. This line of work suggests that $G = 8$ is a well-performing heuristic rather than a theoretical optimum, and that the quality-efficiency frontier of GQA can be pushed further with careful design.

Adoption

LLaMA 2

LLaMA 2, released by Meta in July 2023, was one of the first widely distributed open-weight models to use GQA. Importantly, only the 70B variant of Llama 2 uses GQA -- the 7B and 13B variants use standard MHA. The Llama 2 70B uses 64 query heads and 8 K/V heads (GQA-8), reducing the KV cache by a factor of 8 compared to a hypothetical MHA 70B model. Meta cited inference efficiency at 70B scale as the reason for the asymmetric choice.

The adoption of GQA in Llama 2 coincided closely with the publication of the GQA paper (both appeared in mid-2023), and the Llama 2 technical report explicitly credited the GQA paper as the source of the technique.

LLaMA 3

LLaMA 3, released by Meta starting in April 2024, extends GQA to all model sizes, not just the largest. The Llama 3.1 family (released July 2024) uses GQA across all variants:

Llama 3.1 8B: 32 query heads, 8 K/V heads (GQA-8; 4x KV cache reduction vs. MHA)
Llama 3.1 70B: 64 query heads, 8 K/V heads (GQA-8; 8x KV cache reduction vs. MHA)
Llama 3.1 405B: 128 query heads, 8 K/V heads (GQA-16 effectively; 16x KV cache reduction vs. MHA)

The decision to apply GQA to the 8B model reflects the broader industry consensus that GQA's quality-efficiency tradeoff is favorable even at smaller scales, particularly given the widespread deployment of 7B-8B models on consumer hardware where memory is tightly constrained.

Mistral 7B

Mistral 7B, released by Mistral AI in October 2023 (arXiv:2310.06825, Jiang et al.), uses GQA alongside sliding window attention (SWA). The model has 32 query heads and 8 K/V heads, the same ratio as Llama 3.1 8B. Mistral argued that GQA was essential for making the 7B model practical to deploy: the 4x reduction in KV cache size allows the model to handle longer contexts and larger batch sizes on GPU hardware with limited VRAM.

Mistral subsequently applied GQA to its entire model family, including Mixtral (the mixture-of-experts variant) and later Mistral Medium and Mistral Large. The pairing of GQA with sliding window attention in the original Mistral 7B was noteworthy because both techniques address the same bottleneck (KV cache memory and bandwidth) from different angles.

Qwen

Qwen, Alibaba's open-weight model family, adopted GQA starting with the Qwen2 series (released mid-2024). The Qwen2 technical report explicitly notes that "Qwen2 adopts Grouped Query Attention (GQA) instead of conventional multi-head attention (MHA)" to optimize KV cache usage and enhance inference throughput. The Qwen2.5-14B model uses 40 query heads and 8 K/V heads. Qwen3, released in 2025, continues this pattern across all sizes.

Other Models

GQA has become the de facto standard for efficient attention in open-weight decoder-only transformers. Additional notable adopters include:

Gemma 2 (Google, 2024): Uses GQA in the 9B and 27B variants.
Phi-3 and Phi-4 (Microsoft, 2024-2025): Adopt GQA for memory efficiency at the 3B-14B scale.
Command R+ (Cohere, 2024): Uses GQA for long-context inference.
Yi (01.AI, 2024): Uses GQA in the 34B and 200K context models.

Comparison with Multi-Head Latent Attention

Multi-head Latent Attention (MLA), introduced in the DeepSeek-V2 technical report (May 2024) and used in DeepSeek-V3 (December 2024), takes a fundamentally different approach to reducing the KV cache bottleneck.

How MLA Works

Instead of reducing the number of K/V heads (as GQA does), MLA compresses the K/V representations into a low-rank latent vector using a learned down-projection. At decode time, the full K/V heads are reconstructed on demand from this latent via a learned up-projection. What is stored in the KV cache is the compressed latent, not the full K/V tensors.

Specifically, for each token position, MLA stores a single compressed vector of dimension $c_{KV}$ (much smaller than the full $h \cdot d_k$ of MHA K or V). The K and V heads are then computed from this latent as needed. This means the KV cache contains a dense, low-rank representation rather than the sparse head-selection of GQA.

Comparison Table

Property	GQA	MLA (DeepSeek)
Cache compression method	Reduce number of K/V heads	Low-rank projection of K/V
Theoretical compression ratio vs. MHA	$h/G$ (e.g., 8x for 64 heads, 8 groups)	~60x for DeepSeek-V3
Compression vs. GQA (8 K/V heads)	baseline	~12x further reduction
Reconstruction overhead	None (K/V used directly)	Up-projection at each decode step
Compatibility with FlashAttention	Full support (standard formulation)	Requires absorption trick or custom kernels
Quality vs. MHA	Slightly below MHA	At or slightly above MHA
Introduced	Ainslie et al. (2023)	DeepSeek-V2 technical report (2024)
Widely adopted	Yes (Llama, Mistral, Qwen, etc.)	DeepSeek models; not yet widely adopted

Tradeoffs

MLA achieves dramatically higher compression ratios -- roughly 60x smaller KV cache than MHA for DeepSeek-V3, compared to roughly 8-16x for typical GQA configurations. This allows DeepSeek models to serve longer contexts and larger batches at the same memory budget. MLA also appears to preserve quality better: evaluations suggest MLA quality matches or slightly exceeds MHA, while GQA quality is slightly below MHA.

However, MLA introduces complexity. The reconstruction of K/V heads from latents at each decode step requires matrix multiplications that GQA does not. A key optimization used in DeepSeek -- the "absorbed" formulation, where the up-projection matrices are merged into the query and output projections -- makes MLA compatible with efficient attention kernels, but this requires careful implementation. GQA is simpler to implement correctly: the K/V heads can be expanded (repeated) to match the number of query heads, at which point a standard scaled dot-product attention kernel handles the rest.

As of 2025, GQA remains the dominant approach in the open-source ecosystem due to its simplicity, Flash Attention compatibility, and strong quality-efficiency tradeoff. MLA is used in the DeepSeek model family and has attracted research interest, but has not yet seen wide adoption outside DeepSeek.

Implementations

PyTorch

PyTorch added native GQA support to torch.nn.functional.scaled_dot_product_attention via an enable_gqa=True flag. With this flag enabled, the function accepts K and V tensors with a different (smaller) number of heads than Q, and handles the internal broadcasting automatically.

A common pattern for implementing GQA in PyTorch is to expand the K/V heads to match the number of Q heads before passing to the attention kernel:

# q: (batch, seq, num_heads, head_dim)
# k, v: (batch, seq, num_kv_heads, head_dim)
repeat_factor = num_heads // num_kv_heads
k_expanded = k.repeat_interleave(repeat_factor, dim=2)  # expand K
v_expanded = v.repeat_interleave(repeat_factor, dim=2)  # expand V
out = F.scaled_dot_product_attention(q, k_expanded, v_expanded, is_causal=True)

This "expand and reuse" approach adds no memory overhead when used with attention kernels that fuse the expansion with the attention computation (as FlashAttention does).

FlashAttention

Flash Attention (Dao et al.) supports GQA natively starting with FlashAttention-2 (released mid-2023). The FlashAttention-2 kernel accepts separate num_heads and num_heads_k arguments, allowing K and V to have fewer heads than Q. Internally, the kernel tiles the computation so that each K/V head is reused by the corresponding query heads in its group, without materializing the expanded K/V tensors in memory. This is important for efficiency: expanding K/V before calling attention would multiply memory bandwidth by the group ratio.

FlashAttention-3 (2024) and FlashInfer (an alternative fused attention library) both carry this GQA-aware kernel support forward with additional optimizations for the Hopper (H100) GPU architecture.

Transformers Library

Hugging Face Transformers implements GQA in its model configurations via the num_key_value_heads parameter. Models that set num_key_value_heads < num_attention_heads use GQA; setting num_key_value_heads = 1 uses MQA; and num_key_value_heads = num_attention_heads uses standard MHA. This unified parameter makes it easy to experiment with different group counts without changing model code.

TensorRT-LLM and vLLM

NVIDIA TensorRT-LLM supports GQA through its num_gqa_groups parameter in transformer layer configurations, using FlashAttention kernels for longer sequences. The vLLM serving framework similarly supports GQA for all major model families, with paged KV cache management that is aware of the reduced number of K/V heads.

Relationship to the Broader Attention Design Space

GQA is best understood as one point in a two-dimensional design space defined by (1) the number of query heads and (2) the number of K/V heads. The diagonal where both quantities are equal corresponds to MHA. The line where the number of K/V heads is 1 corresponds to MQA. GQA fills the space between these extremes.

Other attention variants occupy different parts of the design space or introduce additional dimensions:

Sliding Window Attention (SWA), used in Mistral 7B, limits each query to attending only to a fixed-size local window of past tokens rather than the entire context. SWA reduces the KV cache size by capping the number of stored positions rather than by reducing the number of K/V heads. GQA and SWA are orthogonal and can be combined, as Mistral 7B demonstrates.

Sparse Attention mechanisms, such as those used in Longformer and BigBird, selectively compute attention over a sparse subset of token pairs. This reduces the attention computation (which is quadratic in sequence length for dense attention) but does not directly address the K/V head count dimension that GQA targets. Sparse attention is more relevant for very long contexts during training; GQA primarily targets inference.

Tensor Parallelism Compatibility: One practical consideration in choosing the number of K/V heads is compatibility with tensor parallelism during distributed inference. When a model is sharded across $P$ GPUs (a common technique for serving large models), it is typically required that the number of K/V heads be divisible by $P$. With $G = 8$ K/V heads and tensor parallelism across 8 GPUs, each GPU gets exactly 1 K/V head -- a valid configuration. With 4 GPUs, each GPU gets 2 K/V heads. The choice of $G = 8$ is partly motivated by its divisibility by common parallelism degrees (1, 2, 4, 8).

Head Dimension Scaling: The GQA paper fixes the head dimension $d_k$ and varies only the number of K/V heads. An alternative approach is to keep the total K/V capacity (measured in parameters) constant but redistribute it by using fewer heads with larger head dimensions, or more heads with smaller dimensions. This is not strictly GQA but touches on similar tradeoffs. In practice, most GQA deployments keep the head dimension fixed at 128 and vary only $G$.

Limitations

Quality gap at small group counts. While GQA with 8 K/V heads closely matches MHA for large models, the quality gap becomes more noticeable with aggressive compression (fewer groups). Models converted from MHA via uptraining, rather than trained with GQA from scratch, may show slightly larger gaps because the uptraining budget is finite.

Not effective for all tasks. The quality advantage of MHA over GQA tends to be most visible on tasks that benefit from fine-grained, head-diverse attention patterns -- such as tasks requiring precise entity tracking over long contexts or tasks requiring fine-grained cross-lingual alignment in translation. For most standard NLP benchmarks, the difference is small, which is why GQA's adoption has been so rapid: the benchmarks used to evaluate models do not strongly distinguish MHA from GQA.

Prefill is not accelerated. The KV cache reduction benefits apply only to the decode phase. During prefill (processing a long prompt), GQA saves memory but does not reduce the number of attention FLOPs, since the full sequence is processed in parallel. This means GQA does not improve latency for the first-token time-to-response metric -- only for the subsequent token generation rate.

Does not address KV cache eviction. GQA reduces the rate at which the KV cache fills up, but does not address the problem of very long contexts eventually filling the cache entirely. Complementary techniques such as Sparse Attention, sliding window attention (as in Mistral 7B), and KV cache eviction policies (which drop cache entries for less-attended tokens) are needed to handle arbitrarily long contexts.

Interaction with speculative decoding. Speculative decoding methods that use a small draft model and a large target model require both models to share the KV cache or synchronize their caches. When the draft and target models have different GQA configurations (different numbers of K/V heads), cache management becomes more complex.

Tensor parallelism constraints. As noted above, the number of K/V heads must be divisible by the tensor parallelism degree. This creates a hard minimum: a model deployed on 8 GPUs with tensor parallelism must have at least 8 K/V heads. This constraint means that very aggressive compression (e.g., 4 K/V heads) is incompatible with 8-way tensor parallelism, limiting the applicability of the most aggressive GQA configurations in large-scale deployments.

Not a solution for KV cache quantization. GQA reduces the number of cache entries but does not address the precision of each entry. KV cache quantization (storing K/V in INT8 or INT4 rather than FP16) is a separate and complementary technique. Combining GQA with KV cache quantization can yield further memory savings, but the two techniques interact: quantization errors in the KV cache may be more visible when K/V heads are shared across multiple query heads (as in GQA) than when each query head has its own K/V (as in MHA).

References

Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., and Sanghai, S. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. *Proceedings of EMNLP 2023*. arXiv:2305.13245.
Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. *Advances in Neural Information Processing Systems 30*.
Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
Dubey, A., et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
Jiang, A. Q., et al. (2023). Mistral 7B. arXiv:2310.06825.
Qwen Team (2024). Qwen2 Technical Report. arXiv:2407.10671.
DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model. arXiv:2405.04434.
Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv:2307.08691.

Grouped-Query Attention

Background

The KV Cache Bottleneck

Multi-Head Attention

Multi-Query Attention

The GQA Paper (May 2023)

Uptraining Procedure

Mechanism

Query Head Grouping

Attention Computation

KV Cache Reduction

Comparison with MHA and MQA

Quality and Speed Tradeoffs

Architectural Comparison Table

Performance Tradeoffs

Choosing the Number of Groups

Training vs. Inference Overhead

Memory vs. Quality Frontier

Adoption

LLaMA 2

LLaMA 3

Mistral 7B

Qwen

Other Models

Comparison with Multi-Head Latent Attention

How MLA Works

Comparison Table

Tradeoffs

Implementations

PyTorch

FlashAttention

Transformers Library

TensorRT-LLM and vLLM

Relationship to the Broader Attention Design Space

Limitations

See Also

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

KV Cache

Self-attention

Multi-Head Self-Attention

Rotary Position Embedding

Grouped-Query Attention

Background

The KV Cache Bottleneck

Multi-Head Attention

Multi-Query Attention

The GQA Paper (May 2023)

Uptraining Procedure

Mechanism

Query Head Grouping

Attention Computation

KV Cache Reduction

Comparison with MHA and MQA

Quality and Speed Tradeoffs

Architectural Comparison Table

Performance Tradeoffs

Choosing the Number of Groups

Training vs. Inference Overhead

Memory vs. Quality Frontier

Adoption

LLaMA 2

LLaMA 3

Mistral 7B

Qwen

Other Models

Comparison with Multi-Head Latent Attention

How MLA Works

Comparison Table

Tradeoffs

Implementations

PyTorch

FlashAttention

Transformers Library

TensorRT-LLM and vLLM

Relationship to the Broader Attention Design Space

Limitations