Multi-head Latent Attention

Deep Learning Machine Learning Model Architecture Neural Networks Transformer Models

34 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

32 citations

Revision

v6 · 6,762 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Multi-head Latent Attention (MLA) is an attention mechanism for transformer models that achieves a 93.3% reduction in key-value cache size while maintaining or exceeding the performance of traditional multi-head attention.^[1] Introduced by DeepSeek-AI in May 2024 as part of the DeepSeek-V2 model, MLA addresses the critical memory bottleneck that limits inference throughput and context length in large language models.^[1] In the words of the DeepSeek-V2 paper, "MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector."^[1]

Unlike previous approaches such as Multi-Query Attention (MQA) or Grouped-Query Attention (GQA) that reduced or grouped attention heads, MLA uses low-rank joint compression of keys and values into a compact latent representation.^[1] This enables models to support 128K token contexts with 5.76x greater inference throughput and 42.5% lower training costs compared to traditional approaches.^[1]

By 2026, MLA had become one of the most widely adopted alternatives to GQA among open-weight frontier models. Beyond DeepSeek's own model family, it was incorporated into Moonshot's Kimi K2 and K2.5, Zhipu's GLM-5, and InclusionAI's Ling 2.5, while DeepSeek itself layered fine-grained sparsity on top of MLA in DeepSeek-V3.2-Exp (September 2025) and integrated MLA into the hybrid attention stack of DeepSeek-V4 (2026).^[25]^[26]^[27]

What problem does MLA solve?

The development of MLA emerged from DeepSeek-AI's effort to create economical large-scale language models. In autoregressive text generation, a model generates output one token at a time, with each newly generated token being added to the input sequence for the next step. Traditional multi-head attention mechanisms store full-dimensional key and value matrices for every attention head at every sequence position, creating enormous memory requirements that scale linearly with context length.^[2]

Why is the KV cache a bottleneck?

In the original Transformer architecture, introduced in the paper "Attention Is All You Need", the model relies on a mechanism called Multi-head Attention (MHA).^[3] For each token in an input sequence, the model computes three vectors: a Query (Q), a Key (K), and a Value (V). The attention score between any two tokens is calculated based on the similarity of their Q and K vectors, enabling the model to capture complex, long-range dependencies. The computational complexity of this operation is quadratic with respect to the sequence length, $O(N^2 \cdot d)$ , where N is the sequence length and d is the model dimension.^[4]

A naive implementation would require recomputing the K and V vectors for all previous tokens at every single generation step. To avoid this redundant computation, a technique called KV caching is used. The K and V vectors for all tokens in the context are computed once and stored in memory (the KV cache). For each new token, the model only needs to compute its own K and V vectors and append them to the cache.^[4]

While this dramatically speeds up inference, it introduces a new bottleneck: memory. For a model with 128 heads and 128-dimensional heads, the system must cache 32,768 values per token per layer. For a 60-layer model processing an 8K context, this consumes over 30 gigabytes of GPU memory just for the key-value cache.^[2] This memory bottleneck limits the batch sizes that can be processed simultaneously, directly constraining inference speed and making long-context applications economically infeasible. The size of the KV cache grows linearly with the sequence length and the number of attention heads, and for modern LLMs with billions of parameters, many attention heads, and very long context windows (for example 128,000 tokens or more), the KV cache can quickly exceed the capacity of a single GPU.^[5]

Key researchers Huazuo Gao and Wangding Zeng are credited with the core architectural breakthroughs that enabled MLA's development.^[1]

How do MQA and GQA reduce the cache?

To address the KV cache problem, several variants of MHA were developed as performance-for-efficiency trade-offs:

Multi-Query Attention (MQA): This approach uses the standard number of Query heads but shares a single Key and a single Value head across all of them. MQA uses a single shared key and value for all heads, drastically reducing memory usage but at the cost of some accuracy degradation.^[6] This limits the model's expressive capacity and can lead to degradation in model quality.^[7]
Grouped-Query Attention (GQA): GQA offers a compromise between MHA and MQA. It divides the Query heads into several groups and shares a single Key and Value head within each group. This provides a balance, reducing the KV cache size significantly compared to MHA while often preserving more model quality than MQA.^[8] GQA shares keys and values among groups of heads (instead of all heads), achieving intermediate memory savings with a smaller accuracy penalty than MQA.^[7]

However, both methods sacrifice some modeling capacity compared to full multi-head attention.^[7] Multi-head Latent Attention (MLA) was developed to strike a better balance between memory efficiency and model fidelity. It compresses the key/value space into a latent vector rather than sharing actual keys, preserving distinct per-head representations and attention flexibility.^[9]^[10]

Comparison of attention mechanisms

The table below summarizes the differences between standard multi-head attention and its memory-saving variants:

Attention mechanism	Key/value sharing strategy	KV cache per token	Relative memory	Effect on performance
Standard Multi-Head Attention (MHA)	Separate K and V for each head (no sharing)	$2 \times n_h \times d_h$	100% (baseline)	Baseline accuracy (full modeling capacity)
Multi-Query Attention (MQA)	One shared K and V across all heads	$2 \times d_h$	~3% (if n_h=32)	Some quality drop (single KV limits attention diversity)^[6]
Grouped-Query Attention (GQA)	K and V shared within groups of heads	$2 \times n_g \times d_h$	~12.5% (if 8 heads/group)	Slight quality drop (better than MQA, still below MHA)^[8]
Multi-Head Latent Attention (MLA)	No direct sharing; uses learned low-rank latent vectors for K and V	$(d_c + d^R_h)$	~6 to 10% of MHA	Minimal to no loss; can improve performance^[4]^[7]

Notes: $n_h$ = number of heads; $n_g$ = number of groups. Memory percentages are illustrative; actual savings depend on model settings (e.g. number of heads or latent dimension). The performance effect assumes comparable model size and training; MLA's learned projections give it greater expressiveness than fixed sharing in MQA/GQA, often closing the gap with or exceeding MHA.^[10]

How does MLA work?

Core mechanism

MLA fundamentally reimagines how transformer models cache and retrieve attention information. The core idea behind MLA is to use low-rank approximation to compress the information needed for the Key and Value vectors.^[9] MLA introduces a shared down-projection matrix and separate up-projection matrices, where the compressed latent dimension is much smaller than the full K and V dimensions combined.^[9] The mechanism operates through three key phases:

Compression phase (down-projection to latent space): Input tokens are projected down to a compact latent space using a compression matrix. Instead of directly projecting the input hidden state to the full-sized K and V vectors, MLA first projects it into a much smaller, low-dimensional "latent vector."
Caching phase: Only the compressed latent vectors are stored in memory, not full keys and values. Since the latent dimension is significantly smaller than the full K and V dimensions combined, the memory required for the cache is drastically reduced.
Decompression phase (up-projection during attention): Full-dimensional keys and values are reconstructed on-the-fly during attention computation. When attention needs to be computed, the cached latent vector is "decompressed" back into the full-sized K and V vectors for each head using separate up-projection matrices.

This workflow effectively trades increased computational cost (more matrix multiplications for the down- and up-projections) for reduced memory access cost (loading a much smaller cache from memory).^[5]^[4] On modern hardware like GPUs, which are often limited by memory bandwidth rather than raw compute power, this is a highly favorable trade-off that leads to significant gains in overall inference speed.^[11]

Mathematical formulation

Preliminaries: standard multi-head attention

Given an input hidden state $h_t \in \mathbb{R}^d$ for a token at timestep t, where d is the model dimension, standard MHA computes the Query ( $q_t$ ), Key ( $k_t$ ), and Value ( $v_t$ ) vectors using projection matrices $W_Q, W_K, W_V \in \mathbb{R}^{n_h \cdot d_h \times d}$ :

q_t = W_Q h_t

k_t = W_K h_t

v_t = W_V h_t

Here, $n_h$ is the number of heads and $d_h$ is the dimension per head. The resulting vectors are then reshaped into $n_h$ separate heads, and the attention output is calculated using the scaled dot-product attention formula.^[3]

MLA formulation

MLA modifies the computation of the Key and Value vectors through low-rank compression. It replaces the standard $W^K$ and $W^V$ projection matrices with a low-rank factorization that produces a shared latent representation for keys and values.^[4] MLA introduces a shared down-projection matrix $W^{DKV} \in \mathbb{R}^{d_c \times d}$ and separate up-projection matrices $W^{UK}, W^{UV} \in \mathbb{R}^{n_h \cdot d_h \times d_c}$ , where $d_c$ is the compressed latent dimension and typically $d_c \ll n_h \cdot d_h$ .

1. Latent KV compression: The hidden state $h_t$ is first projected into a shared, compressed latent vector $c^{KV}_t \in \mathbb{R}^{d_c}$ :

c^{KV}_t = W^{DKV} h_t

This small vector $c^{KV}_t$ is what gets stored in the KV cache. Importantly, the same compressed vector $c^{KV}_t$ is used for all heads' keys and values, meaning only $d_c$ values per token need to be stored in the KV cache (instead of storing $K_t$ and $V_t$ for every head).^[1]^[7]

2. Attention computation with latent states: The Query vector $q_t$ is computed as in standard MHA. The full Key and Value vectors are reconstructed from the latent vector $c^{KV}_t$ on-the-fly:

q_t = W_Q h_t

k_t = W^{UK} c^{KV}_t

v_t = W^{UV} c^{KV}_t

Two up-projection matrices then map this latent vector to the high-dimensional key and value spaces: $W^{UK} \in \mathbb{R}^{d_c \times n_h \cdot d_h}$ and $W^{UV} \in \mathbb{R}^{d_c \times n_h \cdot d_h}$ . This yields the keys $K_t = c^{KV}_t W^{UK}$ and values $V_t = c^{KV}_t W^{UV}$ , which can be split into per-head vectors.^[9] The attention mechanism then proceeds with these reconstructed $q_t$ , $k_t$ , and $v_t$ vectors as in standard MHA.^[1]

For the query projection, MLA can similarly apply a low-rank factorization ( $W^{DQ}$ and $W^{UQ}$ ) to compress queries during training, though query vectors are not cached during inference:^[9]

c^Q_t = W^{DQ} h_t

q^C_t = W^{UQ} c^Q_t

where $c^Q_t \in \mathbb{R}^{d'_c}$ , $W^{DQ} \in \mathbb{R}^{d'_c \times d}$ , and $W^{UQ} \in \mathbb{R}^{n_h \cdot d_h \times d'_c}$ .

The joint compression of K and V into a single latent vector is a key design choice. It forces the model to learn a more general and compact representation from which both the token's identity (Key) and its content (Value) can be derived. This constraint can act as a form of regularization, which may contribute to MLA's ability to maintain or even improve performance over MHA.^[1]

In DeepSeek-V2, hyperparameters include $n_h = 128$ , $d_h = 128$ , $d_c = 512$ , $d'_c = 1{,}536$ , and a decoupled per-head dimension $d^R_h = 64$ .^[1] For DeepSeek-V2, the standard 16,384-dimensional key-value representation compresses down to just 512 dimensions, a 32-fold reduction.^[1] Each attention head gets its own unique up-projection, preserving the expressiveness of having distinct keys and values per head despite sharing the same compact latent representation. Additional RMSNorm layers and scaling factors are applied at bottlenecks in DeepSeek-V3.^[12]

MLA is implemented using an improved version of FlashAttention-2 for faster computation.^[13]

Parameter	DeepSeek-V2	DeepSeek-V3
KV compression dimension	512	512
Query compression dimension	1,536	1,536
Number of attention heads	128	128
Head dimension	128	128
Standard cache size per token	32,768	32,768
MLA cache size per token	576	576
Cache reduction	98.2%	98.2%

How does MLA stay compatible with RoPE?

A critical technical challenge MLA solves is compatibility with Rotary Position Embeddings (RoPE), a widely-used positional encoding that applies rotation matrices to queries and keys.^[1]^[14] Modern LLMs heavily rely on Rotary Positional Embeddings (RoPE) to encode the relative position of tokens. RoPE works by rotating the Query and Key vectors based on their absolute position in the sequence.^[15]

However, applying RoPE in MLA requires special handling, because the rotation operations on keys and queries interfere with the low-rank decomposition.^[4]^[7] Standard RoPE is incompatible with low-rank compression because the position-dependent rotations prevent weight matrices from being absorbed and merged during inference. A direct application of RoPE after the up-projection would require decompressing the entire KV cache at every generation step, defeating the purpose of the compression.^[16]

To solve this, the DeepSeek-V2 paper introduced decoupled RoPE.^[15]^[4] This technique modifies the structure of the Q and K heads by splitting them into two parts: one part that carries positional information (the "rope" part) and one that does not (the "nope" part).^[7] MLA introduces decoupled RoPE that separates positional encoding from the main key projection. In practice, an additional set of auxiliary query (and/or key) vectors are maintained solely for the purpose of applying RoPE, while the primary key latent remains unrotated.^[7]

The model maintains additional multi-head query vectors $q^R$ and a shared key $k^R$ specifically for positional information. These receive RoPE transformations while compressed components remain position-independent.^[1] The solution uses additional multi-head queries $q^R_{t,i} \in \mathbb{R}^{d^R_h}$ and a shared key $k^R_t \in \mathbb{R}^{d^R_h}$ :

q^R_t = \mathrm{RoPE}(W^{QR} c^Q_t)

k^R_t = \mathrm{RoPE}(W^{KR} h_t)

where $W^{QR} \in \mathbb{R}^{d^R_h \cdot n_h \times d'_c}$ and $W^{KR} \in \mathbb{R}^{d^R_h \times d}$ . Final queries and keys are formed by concatenation:

q_{t,i} = [q^C_{t,i}; q^R_{t,i}]

k_{t,i} = [k^C_{t,i}; k^R_t]

Attention is then computed with a scaled denominator:

o_{t,i} = \sum_j \mathrm{softmax}_j\left(\frac{q_{t,i}^\top k_{j,i}}{\sqrt{d_h + d^R_h}}\right) v^C_{j,i}

The total KV cache per token becomes $(d_c + d^R_h) \cdot l$ , including the decoupled key.^[7] The shared RoPE key $k^R_t$ is broadcast to all attention heads, adding only 64 dimensions to the cache per token.^[1] This architectural decision allows weight matrices for the non-RoPE components to be pre-multiplied and absorbed during inference, eliminating the need to explicitly materialize full key and query matrices. The decoupled RoPE technique enables MLA to be compatible with positional encodings and long-context support.

The positional rotations are applied only to the designated "rope" part of the vectors. This allows the model to efficiently integrate positional information without interfering with the compression and decompression process, preserving the efficiency gains of MLA. The necessity of a specialized solution like decoupled RoPE highlights an important trend in advanced model design: as individual components become more optimized, they lose their modularity. An efficient attention mechanism like MLA is not a simple "drop-in" replacement for MHA but an architectural shift that can have cascading effects on other parts of the model, such as positional encodings. Subsequent analyses have framed this design as separating a content "sketch" (capturing the "what") from a positional "sticky note" (capturing the "where"), processing them independently and recombining only at the dot-product step to preserve both compression efficiency and positional integrity.^[28]

Weight absorption optimization

During inference, MLA avoids expensive full reconstruction of keys and values at every step by computing attention in the latent space when possible. In particular, DeepSeek's implementation uses a "weight absorption" trick: the key up-projection matrix can be merged with the query projection for computing attention scores, so that the dot-product $QK^\top$ is effectively calculated in the compressed $d_c$ -dimensional space.^[10]

MLA enables significant runtime optimization through weight absorption. The model pre-computes merged weight matrices:^[2]

For attention scores: $W^{KQ} = (W^{UK})^\top W^{UQ}$
For value output: $W^{OV} = W^O W^{UV}$

This allows attention computation as:

S = (C^{KV})^\top W^{KQ} C^Q

Y = W^{OV} C^{KV} Z

where Z contains attention weights. Concretely, for a query $q_i$ and a cached latent $c^{KV}_j$ from a past token j, the attention score can be written as $q_i^\top W^{UK} c^{KV}_j$ , which equals $((W^{UK})^\top q_i)^\top c^{KV}_j$ . The term $(W^{UK})^\top q_i$ is a transformed query of dimension $d_c$ , so the dot-product now involves two $d_c$ -dimensional vectors instead of the full head dimension $d_h$ .^[10] This optimization means the softmax attention is computed over compressed representations.

Only after obtaining the attention weights does the model apply them to the full value vectors, which are reconstructed by $V_t = c^{KV}_t W^{UV}$ as needed.^[9] By minimizing how often the large up-projection is applied, MLA achieves faster inference than a naive implementation. This reduces memory bandwidth by approximately 32-fold while requiring only a one-time pre-computation of merged weights at model loading.^[2] In summary, the KV cache stores only compressed vectors, and the model performs most computations in the low-dimensional latent space, decompressing to full per-head values only at the final combination stage.

How much faster is MLA?

Computational efficiency

MLA has been shown to greatly improve memory efficiency and inference speed for large language models. MLA achieves a remarkable transformation in attention mechanics that breaks the traditional performance-efficiency tradeoff. The DeepSeek-V2 MoE model (236B parameters) demonstrated that using MLA could shrink KV cache storage by 93.3% and achieve over 5× higher generation throughput, compared to a baseline with standard attention, while maintaining strong accuracy.^[1] As the DeepSeek-V2 paper summarizes, the model "saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times" relative to the dense DeepSeek 67B.^[1] DeepSeek-V3, an upgraded model released in 2025, also adopted MLA along with other innovations to enable 128K token context lengths for long-text reasoning.^[7]

Benchmark results demonstrate:^[1]

7B parameter models with MLA outperform equivalent MHA models by approximately 2%
DeepSeek-V2 achieves 78.5 on MMLU and 78.9 on BBH, compared to 71.3 and 68.7 for the 67B dense predecessor
Over 50,000 tokens per second generation throughput on 8 H800 GPUs
5.76-fold throughput increase over DeepSeek 67B

Empirically, MLA's performance impact is minimal or even positive: it retained model quality better than GQA/MQA alternatives in ablation studies, and DeepSeek reported that adding MLA actually improved certain benchmarks relative to their previous model version.^[4]^[10] Beyond DeepSeek's own models, external researchers have explored applying MLA to other LLMs.

Real-world production benchmarks from vLLM implementations show 3.4x throughput improvements on 8 H200 GPUs specifically attributable to MLA, with additional 40% gains from FP8 quantization optimizations.^[17]

Metric	DeepSeek 67B	DeepSeek-V2	Improvement
MMLU score	71.3	78.5	+7.2 points
BBH score	68.7	78.9	+10.2 points
Generation throughput	~8,700 tps	50,000+ tps	5.76x
Training GPU hours/trillion tokens	300,600	172,800	42.5% reduction
Activated parameters	67B	21B	68.7% reduction

Arithmetic intensity transformation

Beyond raw throughput, MLA fundamentally changes the computational profile of attention. Traditional attention is memory-bound, with performance limited by memory bandwidth rather than compute capacity, exhibiting an arithmetic intensity around 1 FLOP per byte of memory accessed.^[10]

MLA increases arithmetic intensity to approximately 235 FLOPs per byte at practical sequence lengths, transforming attention into a compute-bound operation that fully utilizes GPU tensor cores.^[10] This shift is visible in roofline analysis:

Standard attention achieves only 4-5 teraFLOPs per second regardless of sequence length
MLA scales to approach the hardware's theoretical maximum of 990 teraFLOPs per second on H200 GPUs^[10]

Training efficiency

The training efficiency improvements prove equally substantial. DeepSeek-V2 required 172,800 GPU hours per trillion tokens trained, compared to 300,600 hours for DeepSeek 67B, a 42.5% reduction.^[1] For the full training run of 8.1 trillion tokens, this translates to real cost savings of millions of dollars.

DeepSeek-V3, with 671B parameters, trained for just 2.788 million H800 GPU hours at an estimated cost of $5.576 million,^[12] roughly 100-fold less than the billion-dollar budgets reported for comparable models from major tech companies.

How does MLA compare to MHA, MQA, and GQA?

Evolution from multi-head attention

The lineage from Multi-Head Attention to MLA spans seven years of transformer optimization research:^[4]

2017: Multi-Head Attention (MHA) established the standard with separate key-value heads for each query head
2019: Multi-Query Attention (MQA) reduced memory by using a single shared KV head across all queries^[6]
2023: Grouped-Query Attention (GQA) interpolated between MHA and MQA by grouping query heads^[8]
2024: Multi-head Latent Attention (MLA) uses low-rank compression while maintaining distinct per-head keys and values
2025: DeepSeek Sparse Attention (DSA) layers fine-grained sparsity on top of MLA^[26]
2026: Hybrid attention architectures combine MLA with linear attention and sparse selectors at the layer level^[27]

Mechanism	Cache size per token	Cache size formula	Quality	Typical models
MHA	$2 \times n_h \times d_h$	$2 n_h d_h l$	Baseline	GPT-2, GPT-3
MQA	$2 \times d_h$	$2 d_h l$	-5 to -10%	PaLM, Falcon
GQA	$2 \times n_g \times d_h$	$2 n_g d_h l$	-1 to -2%	LLaMA-2, LLaMA-3, Mistral, MiniMax M2
MLA	$(d_c + d_h^R)$	$(d_c + d^R_h)l \approx \frac{9}{2} d_h l$	+1 to +2%	DeepSeek-V2, DeepSeek-V3, Kimi K2, GLM-5
MLA + DSA	sparsified $(d_c + d_h^R)$	varies with sparsity ratio	parity with MLA	DeepSeek-V3.2-Exp, GLM-5

Note: $n_h$ = number of heads; $n_g$ = number of groups; $l$ = number of layers; $d_h$ = head dimension; $d_c$ = compressed latent dimension; $d^R_h$ = decoupled RoPE dimension.

Comparison with grouped-query attention

MLA achieves cache sizes equivalent to GQA with only 2.25 groups, far more aggressive compression than typically used (GQA models commonly use 8-24 groups). Yet remarkably, MLA demonstrates superior performance to even standard MHA on multiple benchmarks.^[1]

Theoretical analysis proves that any GQA configuration can be represented as an MLA configuration with equivalent KV cache overhead, but the reverse is not true. MLA possesses strictly greater expressive power.^[5] The key difference lies in the mechanism:

GQA: Forces multiple query heads to share identical key-value representations, limiting the diversity of attention patterns
MLA: Uses low-rank compression to preserve distinct keys and values per head reconstructed from a shared compressed representation

The computational trade-offs favor different mechanisms in different regimes. MLA requires approximately 4x more floating-point operations than MHA due to the compression and decompression matrix multiplications.^[2] However, modern GPU inference is memory-bandwidth-limited, not compute-limited. MLA's reduced memory traffic more than compensates for increased operations at practical sequence lengths of 8K-128K tokens.

Which models use MLA?

DeepSeek model family

DeepSeek-AI has deployed MLA across its entire model family since introducing the mechanism:

DeepSeek-V2 (May 2024)^[1]

236B total parameters, 21B activated per token
128K token context window
Trained on 8.1 trillion tokens
First production implementation of MLA

DeepSeek-V3 (December 2024)^[12]

671B total parameters, 37B activated per token
61 transformer layers with 256 routed experts plus one shared expert
Trained on 14.8 trillion tokens
Cost: $5.576 million for training
Performance matches or exceeds GPT-4 and Claude 3.5 Sonnet

DeepSeek-R1 (2025)

Same V3 architecture optimized for reasoning tasks
Reinforcement learning from human feedback (RLHF) training
Multi-step thought processes

DeepSeek-V3.2-Exp (September 2025)^[26]

First DeepSeek release to layer DeepSeek Sparse Attention (DSA) on top of MLA
DSA selects a sparse subset of tokens per query using a learned indexer, and is implemented under MLA in MQA mode for decoding so each latent KV entry is shared across query heads, aligning with the kernel-level requirement that KV entries be reused across queries for high throughput
Maintains benchmark parity with V3.1-Terminus while dropping the inference cost at 128K context to roughly $0.45 per million output tokens, a 78% reduction from V3.1-Terminus
Establishes that fine-grained sparsity and latent compression are complementary rather than competing techniques

DeepSeek-V4 (2026)^[27]^[29]

Combines MLA with a new hybrid attention stack adding Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), rather than treating MLA as the sole attention primitive
Reported to require only 27% of single-token inference FLOPs and 10% of the KV cache of DeepSeek-V3.2 at a one-million-token context window
Trained from scratch with the hybrid stack so internal representations are shaped by CSA, HCA, and MLA from the first step rather than retrofitted onto an MLA-only checkpoint

Adoption beyond DeepSeek

After DeepSeek-V3's release, MLA spread rapidly across the open-weight ecosystem during 2025 and 2026, particularly among Chinese frontier labs.^[25]

Moonshot Kimi K2 and K2.5: Kimi K2 (mid-2025) and its K2.5 refresh (early 2026) adopted MLA together with a sparse mixture-of-experts backbone of 384 experts with 8 activated per token, SwiGLU activations, and a 256K-token context window.^[25] K2.5 keeps the same MLA configuration as K2 while improving training data and post-training. Moonshot's related research direction, Kimi Linear, explores expressive linear-attention variants that interoperate with MLA layers in hybrid stacks.^[30]

Zhipu GLM-5: GLM-5 (2026) adopted MLA together with DeepSeek Sparse Attention, becoming one of the first non-DeepSeek models to ship the MLA+DSA combination introduced in V3.2-Exp.^[25]

InclusionAI Ling 2.5: Ling 2.5 (2026) uses MLA combined with linear-attention components, positioning MLA as the "global" attention pathway alongside local linear-attention pathways.^[25]

Models that did not adopt MLA: Not every 2026 release moved to MLA. MiniMax M2 and M2.5 stayed with conventional Grouped-Query Attention.^[25] Major Western labs including OpenAI, Anthropic, Google, and Meta likewise did not announce MLA adoption in any flagship model through 2026, continuing with GQA-based architectures.^[4]

Can an existing GQA model be converted to MLA?

Yes. The TransMLA framework, published in January 2025 and accepted as a spotlight presentation at NeurIPS 2025, enables retrofitting GQA-based models to use MLA architecture without training from scratch.^[5] For example, Meng et al. (2025) introduced TransMLA, a method to convert Grouped-Query Attention models (such as Llama-2) into MLA-based models. In their experiments, converting a 7B-parameter GQA model to MLA compressed ~93% of the KV cache and yielded a 10.6× inference speedup at an 8K context length, without significant output quality loss after a brief fine-tuning.^[5] This highlights MLA's practical benefit for enabling longer contexts and faster generation in resource-constrained deployments.

The conversion process:

Restructures attention layers to use low-rank compression
Preserves learned representations through techniques like RoRoPE (PCA-based RoPE concentration) and FreqFold (frequency grouping)
Fine-tunes on just 6 billion tokens (approximately 3% of original training data)

The framework demonstrates:^[5]

10.6x inference speedup at 8K context length for LLaMA-2-7B
93% KV cache compression
Full compatibility with DeepSeek-optimized inference frameworks
Support for LLaMA, Qwen, and Mixtral families

Metric	Original GQA	After MLA conversion
KV cache size	100%	7%
Inference throughput (8K context)	1.0x	10.6x
MMLU score	45.3	45.1
Fine-tuning tokens required	N/A	6 billion
Fine-tuning GPU hours	N/A	100-200

MHA2MLA and vision-language extensions

A 2026 follow-up, MHA2MLA-VLM, extends MLA conversion to vision-language models that originally used MHA or GQA, showing that the latent-compression recipe transfers from text-only LLMs to multimodal architectures with minimal additional fine-tuning.^[31] This positions MLA as a general attention substrate that can underlie multimodal systems where long visual token sequences dominate the KV cache budget.

MLA's successful use in DeepSeek and follow-up works suggests that learned latent attention has become a common technique for scaling sequence lengths in modern large language models. By balancing memory savings with modeling flexibility, multi-head latent attention allows models to handle long sequences more efficiently than standard multi-head attention, without the severe accuracy trade-offs of earlier methods.

Deployment and applications

Cloud platforms

Production deployments of MLA-based models span multiple cloud platforms:^[18]

Google Cloud Vertex Model Garden: One-click deployment optimized for GKE environments
Alibaba Cloud PAI: Model Gallery with vLLM or BladeLLM acceleration
FireworksAI: On-demand GPU deployment with multi-token prediction optimizations

Inference frameworks

The inference framework ecosystem has matured rapidly around MLA:^[17]

vLLM (version 0.6.6+): Optimized MLA kernels with FP8 and BF16 precision modes; a 2026 release added native support for the DeepSeek-V4 hybrid stack^[29]
SGLang: Day-one AMD GPU support alongside NVIDIA
LightLLM: Multi-machine tensor-parallel deployments

PyTorch implementations of MLA are available, such as in Sebastian Raschka's "LLMs-from-Scratch" repository, demonstrating memory savings (for example 75% vs. MHA) and speedups.^[19]

These frameworks implement specialized operators including broadcasted batched matrix multiplication and weight absorption optimizations critical for realizing MLA's theoretical speedups.

FlashMLA kernels

To close the gap between MLA's theoretical efficiency and what naive PyTorch implementations deliver, DeepSeek open-sourced FlashMLA, a set of fused CUDA kernels analogous to FlashAttention but specialized for the MLA computation pattern.^[32] FlashMLA exploits two structural properties of MLA: the shared compressed cache $c^{KV}$ and the merged $W^{UK}$ -into-query absorption that lets attention scores be computed in $d_c$ -dimensional space, both of which eliminate large intermediate tensors that a naive kernel would materialize.

A 2025 FlashMLA update delivered an additional 5 to 15% performance improvement on compute-bound workloads, reaching up to 660 TFLOPS on NVIDIA H800 SXM5 GPUs.^[32] FlashMLA has been integrated into vLLM, SGLang, and DeepSeek's own production serving stack.

Production deployment patterns

Production deployment patterns for DeepSeek-V3:^[12]

Prefill operations (processing input prompts)

4 nodes with 32 GPUs
4-way tensor parallelism
8-way data parallelism
32-way expert parallelism

Decoding operations (generating output tokens)

40 nodes with 320 GPUs
320-way expert parallelism
Dedicated GPUs for redundant and shared experts
Throughput: Over 50,000 tokens per second

Limitations and challenges

Implementation complexity

Despite proven benefits, MLA faces implementation complexity that limits broader adoption:^[2]

Requires five matrix multiplications versus three for standard MHA
Naive implementations without weight absorption may perform slower than optimized MHA
Particularly challenging on CPU or when processing short sequences below 1,400 tokens

The decoupled RoPE mechanism adds mathematical and implementation complexity. Broadcasting operations for the shared RoPE key across all attention heads require efficient GPU kernels; without proper optimization, virtual replication can introduce performance overhead.^[2] Empirical comparisons have noted that in pure RoPE-only settings without the decoupled split, MHA can outperform MLA, underscoring that MLA's quality match depends critically on the decoupled RoPE design and an optimized broadcasted batched-matmul kernel.^[28]

Industry adoption

Major AI companies remain invested in GQA-based architectures despite MLA's demonstrated advantages. OpenAI, Anthropic, Google, and Meta have made no public announcements regarding MLA adoption, likely due to:^[4]

Substantial existing infrastructure optimized for GQA
Custom silicon and kernel libraries
Distributed training frameworks
Engineering risk and validation burden for production systems

The TransMLA framework reduces barriers for open-source models, but closed-source developers may wait for hardware vendors to provide native MLA acceleration before investing in migration.^[5] By 2026 the divide had widened along geographic lines: most Chinese open-weight frontier models (DeepSeek, Moonshot, Zhipu, InclusionAI) had standardized on MLA, while Western frontier developers continued to ship GQA variants.^[25]

Quality considerations

Quality concerns center on the effects of lossy low-rank compression on specific capabilities. While aggregate benchmark scores match or exceed MHA, detailed ablation studies directly comparing MLA and GQA at equivalent cache sizes remain limited.^[4]

Some researchers note that training perplexity occasionally shows slight degradation with MLA, though this could indicate beneficial regularization rather than harmful information loss. The optimal compression ratio remains an open hyperparameter requiring empirical tuning per model architecture and scale. There is a theoretical risk that important, fine-grained information could be lost during the down-projection to the latent space. The performance of the model becomes dependent on the choice of the latent dimension d_c, which is a critical hyperparameter balancing compression ratio and information fidelity.^[9]^[4] Coverage of DeepSeek-V4's 90% KV cache cut at the million-token scale flagged that very aggressive compression could in principle produce needle-in-a-haystack retrieval failures, a concern DeepSeek addressed by pre-training V4 with its hybrid attention stack rather than retrofitting compression onto an MLA-only checkpoint.^[29]

Training and adoption challenges

MLA is an architectural feature that must be incorporated during a model's pre-training. Converting an existing model pre-trained with MHA or GQA to use MLA is a non-trivial task that can lead to a significant drop in performance without extensive fine-tuning.^[5] This creates a significant "path dependency" and a barrier to adoption, as organizations with large investments in existing GQA-based models may be hesitant to retrain them from scratch. The development of post-training conversion methods like TransMLA, and the multimodal extension MHA2MLA-VLM, are direct responses to this challenge.^[5]^[31]

Theoretical foundations

Low-rank factorization

MLA's low-rank factorization approach builds on decades of research into matrix compression and efficient neural architectures. The key insight is that the key-value matrices in transformer attention, while high-dimensional, can be well-approximated through a compressed bottleneck without significant information loss.^[1] This compression is applied jointly to keys and values, making MLA particularly effective for Mixture-of-Experts (MoE) models like DeepSeek-V2, which has 236B total parameters but activates only 21B per token.

The technique is rooted in matrix approximation methods like singular value decomposition (SVD), where a matrix $M \in \mathbb{R}^{n \times m}$ is approximated as $M \approx UV$ with $U \in \mathbb{R}^{n \times r}$ , $V \in \mathbb{R}^{r \times m}$ , and $r \ll n, m$ .^[9]

Mathematical analysis proves that for any GQA configuration with n_g groups, there exists an MLA configuration with equivalent or smaller KV cache that can represent the same function class. Conversely, MLA can express attention patterns impossible for GQA to represent, establishing strict theoretical superiority in modeling capacity.^[5]

Expressive power

The expressive power advantage stems from MLA's ability to reconstruct distinct keys and values for each attention head from the shared latent representation. While GQA forces heads to share identical KV matrices, MLA's per-head up-projection matrices $W^{UK}_i$ and $W^{UV}_i$ enable each head to extract different information from the compressed representation.^[1]

This architectural choice preserves the diversity of attention patterns that makes multi-head attention effective while dramatically reducing the cached information. The compression bottleneck may provide regularization that prevents overfitting to spurious correlations in attention patterns.^[1]

Theoretical proofs in TransMLA show MLA's superior expressive power over GQA under equivalent KV overhead, as any GQA layer can be expressed as MLA, but not vice versa.^[5]

Future directions

Hardware optimization

Hardware optimization represents a critical frontier for MLA adoption. The DeepSeek technical reports explicitly request hardware features:^[12]

Communication offload processors for all-to-all operations
Native tile-wise and block-wise quantization support in tensor cores
Higher FP8 accumulation precision for fine-grained mixed precision
Near-memory computing for quantization operations
Efficient transposed GEMM operations

Alternative and hybrid architectures

Alternative attention architectures continue to emerge alongside MLA:^[15]

Tensor Product Attention (TPA): Unifies MHA, MQA, and GQA as special cases
DeepSeek Sparse Attention (DSA): Layers fine-grained sparsity on top of MLA in DeepSeek-V3.2-Exp^[26]
Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA): Introduced in DeepSeek-V4's hybrid stack to push effective KV cache reduction beyond what MLA alone achieves at million-token contexts^[27]
Kimi Linear: Moonshot's expressive linear-attention architecture, designed to interoperate with MLA layers in a hybrid stack^[30]
Hybrid architectures: Different attention mechanisms for different layers, increasingly the dominant pattern in late-2025 and 2026 frontier open-weight models

Subsequent works include optimizations for economical inference and integrations with sparse attention.^[20]^[21]

Research directions

The TransMLA framework roadmap includes:^[5]

Extending support to Gemma2 and other model families
Exploring variants like Gated Linear Attention (GLA)
Developing MLA-specific inference acceleration strategies
Investigating optimal compression ratios for different scales
Releasing converted model checkpoints

Beyond TransMLA, active research areas in 2026 include latent-space speculative decoding for MLA, methods for quantizing the compressed latent cache below 8 bits without retraining, and learned sparsity patterns that integrate the DSA token selector into the MLA latent dimension itself.

Democratization implications

MLA enables researchers and small organizations to train frontier-scale models within constrained budgets, breaking the monopoly on cutting-edge AI held by the wealthiest technology companies.^[12] This accessibility may accelerate innovation by diversifying the pool of researchers working on fundamental questions in large language model development and deployment.

The term "latent attention" is used in different contexts, and it is important to distinguish MLA from other mechanisms that use similar terminology.

Perceiver: Iterative latent attention

The Perceiver architecture, introduced by DeepMind, also uses a "latent array" and an attention bottleneck.^[22] However, its purpose is fundamentally different. The Perceiver is designed to handle extremely high-dimensional and multi-modal inputs (such as images, audio, and point clouds) that are too large for a standard Transformer. It uses an asymmetric cross-attention mechanism where a small, fixed-size latent array iteratively "queries" the massive input data to distill it into a manageable representation.^[22]

The bottleneck in the Perceiver serves to decouple the network's depth from the input's size, solving a problem at the input encoding stage. In contrast, MLA's latent vector is a mechanism for compressing the internal KV cache to improve inference efficiency for standard sequential data, addressing a bottleneck in the model's state management during generation.

Low-rank adaptation (LoRA)

Low-rank adaptation (LoRA) is a popular technique for parameter-efficient fine-tuning (PEFT) of large models. LoRA also uses low-rank matrix decomposition, but for a different purpose.^[23] It freezes the large, pre-trained weight matrices of a model and injects small, trainable low-rank "adapter" matrices alongside them. During fine-tuning, only these small adapters are updated, dramatically reducing the number of trainable parameters.^[24]

MLA, on the other hand, uses low-rank decomposition as a permanent, integral part of the model's core architecture to optimize inference, not as a temporary adapter for efficient training. MLA draws from low-rank adaptations like LoRA and efficient attention variants.^[23]

References

DeepSeek-AI et al. (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434. ↩
Shekhar, S. "Understanding DeepSeek's Multi-Head Latent Attention (MLA)." Retrieved from shashankshekhar.com/blog/flashmla. ↩
Vaswani, A. et al. (2017). "Attention Is All You Need." NeurIPS 2017. ↩
Sinai, L. (February 2025). "DeepSeek's Multi-Head Latent Attention." liorsinai.github.io. ↩
Meng, F. et al. (January 2025). "TransMLA: Multi-Head Latent Attention Is All You Need." NeurIPS 2025 (Spotlight). arXiv:2502.07864. ↩
Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv:1911.02150. ↩
Banatt, P. "Understanding Multi-Head Latent Attention." planetbanatt.net/articles/mla.html. ↩
Ainslie, J. et al. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." EMNLP 2023. ↩
Raschka, S. "Multi-Head Latent Attention (MLA)." LLM Architecture Gallery, sebastianraschka.com. ↩
Calmops. "Multi-Head Latent Attention MLA: DeepSeek's Memory Optimization." calmops.com/algorithms/multi-head-latent-attention-mla-deepseek. ↩
Aussie AI. "Multi-head Latent Attention (MLA)." aussieai.com/research/mla. ↩
DeepSeek-AI et al. (December 2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. ↩
Dao, T. (2023). "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning." arXiv:2307.08691. ↩
Su, J. et al. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864. ↩
PyImageSearch (October 2025). "KV Cache Optimization via Multi-Head Latent Attention." pyimagesearch.com. ↩
Vizuara. "Decoding Multi-Head Latent Attention (Part 2): Solving the RoPE Paradox." vizuara.substack.com. ↩
vLLM Project (2025). "DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action." blog.vllm.ai/2025/09/29/deepseek-v3-2.html. ↩
Google Cloud, Alibaba Cloud PAI, and Fireworks AI deployment documentation for DeepSeek-V3. ↩
Raschka, S. "LLMs-from-Scratch" GitHub repository, MLA implementation. ↩
DeepSeek-AI (2025). "Native Sparse Attention" technical material. ↩
Various authors (2025). Studies on economical inference for MLA-based models. ↩
Jaegle, A. et al. (2021). "Perceiver: General Perception with Iterative Attention." ICML 2021. ↩
Hu, E. et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685. ↩
Hugging Face PEFT library documentation. ↩
Raschka, S. (2026). "A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026." magazine.sebastianraschka.com. ↩
DeepSeek-AI (September 2025). "DeepSeek-V3.2-Exp: Boosting Long-Context Efficiency with DeepSeek Sparse Attention." arXiv:2512.02556; MarkTechPost coverage (September 30, 2025). ↩
Koh, J. (May 2026). "DeepSeek-V4 and before: getting the foundations right." MITB For All, Medium. ↩
Shekhar, S. "Understanding DeepSeek's Multi-Head Latent Attention (MLA), Part 1: FlashMLA." shashankshekhar.com. ↩
Wccftech (2026). "DeepSeek Aims At Memory Shortage With Latest AI Model But Might Sacrifice Performance." wccftech.com; vLLM Blog, "DeepSeek V4 in vLLM: Efficient Long-context Attention." ↩
Moonshot AI (October 2025). "Kimi Linear: An Expressive, Efficient Attention Architecture." arXiv:2510.26692. ↩
MHA2MLA-VLM authors (2026). "MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models." arXiv:2601.11464. ↩
DeepSeek-AI. "FlashMLA: Efficient Multi-head Latent Attention Kernels." github.com/deepseek-ai/FlashMLA. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit