FlashMLA
Last reviewed
Jun 9, 2026
Sources
10 citations
Review status
Source-backed
Revision
v8 · 2,228 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 9, 2026
Sources
10 citations
Review status
Source-backed
Revision
v8 · 2,228 words
Add missing citations, update stale details, or suggest a clearer explanation.
FlashMLA is an open-source GPU kernel from DeepSeek that speeds up the decoding step of Multi-head Latent Attention (MLA), the attention variant DeepSeek uses to shrink the KV cache in its large language models. DeepSeek released it on 24 February 2025 as the first project of its "Open Source Week," a run of five daily code drops pulled straight from the company's production inference stack. The kernel targets NVIDIA Hopper GPUs, handles variable-length sequences and a paged KV cache, and supports the BF16 and FP16 numeric formats. On an H800 SXM5 card it reports up to 3000 GB/s of memory bandwidth in memory-bound settings and up to 580 TFLOPS in compute-bound settings. [1][2][3]
The project sits at the intersection of two trends in efficient inference. One is the attention redesign DeepSeek introduced with DeepSeek-V2 to cut the memory cost of long contexts. The other is the family of fused attention kernels, led by FlashAttention, that keep the GPU busy by avoiding round trips to slow high-bandwidth memory. FlashMLA applies the second idea to the first, and it does so for the specific shape of attention that MLA produces during token-by-token generation. [1][4]
DeepSeek announced Open Source Week on 21 February 2025 and began publishing repositories on the 24th. The company framed the effort plainly. Its index repository describes the team as "a tiny team @deepseek-ai pushing our limits in AGI exploration" and says the releases are "humble building blocks of our online service: documented, deployed and battle-tested in production. No vaporware, just sincere code." Over the week DeepSeek shipped FlashMLA on day one, the DeepEP expert-parallel communication library on day two, the DeepGEMM FP8 matrix-multiply library on day three, a set of parallelism tools including DualPipe and EPLB on day four, and the Fire-Flyer File System on day five. [2][5]
The day-one announcement was specific about what FlashMLA does. DeepSeek called it "our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production." That last phrase matters. FlashMLA is not a research prototype. It is the code that serves DeepSeek's own chat and API traffic, which means it has to cope with real workloads where requests arrive at different lengths and batch sizes shift from moment to moment. [2]
MLA itself predates the kernel. DeepSeek introduced Multi-head Latent Attention in the DeepSeek-V2 technical report in May 2024 and carried it forward into DeepSeek-V3 and the R1 reasoning model. The motivation is the KV cache. During autoregressive generation a transformer stores the key and value vectors for every past token so it does not recompute them at each step. That cache grows with sequence length and batch size, and on long contexts it can dominate GPU memory, which caps how many requests a server can run at once. [4][6]
Standard multi-head attention gives every attention head its own keys and values, so the cache holds a full set of key and value vectors per head per token. Grouped-query attention reduces that by letting several query heads share one key-value head, which is the path used by models such as Llama 2 70B and many others. Multi-query attention pushes the same idea to a single shared key-value head. Both trade a little quality for a smaller cache. [4]
MLA takes a different route. Instead of cutting the number of key-value heads, it compresses the keys and values themselves. The DeepSeek-V2 paper describes a low-rank joint compression that projects the hidden state down into a single small latent vector. Only that latent vector goes into the cache. When the model needs the actual keys and values for an attention step, it reconstructs them from the latent vector through up-projection matrices. The cache therefore stores one compact representation per token rather than a full set of per-head key and value vectors. [4]
Rotary position embedding does not fit this scheme cleanly, because the position-dependent rotation has to be applied before the projections in a way that breaks the compression trick. DeepSeek handled this with a decoupled RoPE design that carries the rotary signal on extra dedicated query and key dimensions, kept separate from the compressed part. The result keeps positional information without forcing the model to cache full keys and values. [4]
The payoff is large. DeepSeek-V2 reports that MLA cuts the KV cache by 93.3 percent relative to the standard multi-head attention used in DeepSeek 67B, and that it lifts maximum generation throughput to 5.76 times that of the earlier model. Those numbers are attributable to the DeepSeek-V2 paper and describe the architecture, not the FlashMLA kernel by itself. [4]
| Attention variant | What gets cached | Relative KV cache | Used by |
|---|---|---|---|
| Multi-head attention (MHA) | Full keys and values per head | Largest | GPT-style models, DeepSeek 67B |
| Grouped-query attention (GQA) | Keys and values per head group | Reduced | Llama 2 70B and many others |
| Multi-query attention (MQA) | One shared key-value head | Small | PaLM and others |
| Multi-head latent attention (MLA) | One compressed latent vector | Smallest reported | DeepSeek-V2, DeepSeek-V3, DeepSeek-R1 |
The table is a qualitative summary. Exact ratios depend on head counts, dimensions, and the compression rank, and they vary by model.
The compression that makes MLA cheap on memory also changes the math the GPU has to run at decode time. The cached latent vector has to be expanded back into keys and values, and the attention computation works over tensor shapes that differ from plain multi-head or grouped-query attention. A generic attention kernel can run MLA, but it will not lay out the work in the way that keeps a Hopper GPU's tensor cores and memory pipeline near their limits. A purpose-built kernel can.
Decoding is also the hard case for utilization. During prefill, when the model reads a prompt, attention processes many tokens at once and the GPU has plenty of parallel work. During decode, the model generates one token per step per sequence, so each step has very little compute relative to the memory it must touch. That makes decode memory-bound, and it makes batching across many concurrent requests the main lever for throughput. FlashMLA is built for exactly this regime, which is why DeepSeek describes it as a decoding kernel for variable-length serving rather than a general attention library. [1][2]
The public README states what the first release covers. FlashMLA supports the BF16 and FP16 formats and a paged KV cache with a block size of 64. It requires Hopper GPUs, CUDA 12.3 or newer, and PyTorch 2.0 or newer. The repository credits its lineage directly, saying FlashMLA "is inspired by FlashAttention 2&3 and cutlass projects." [1]
Those two influences map onto two parts of the design. From FlashAttention it borrows the fused, tiled approach to attention. Rather than writing the large intermediate attention-score matrix out to high-bandwidth memory and reading it back, the kernel streams blocks of keys and values through fast on-chip memory and computes the softmax in an online, running fashion. That keeps the slow memory traffic down, which is what matters most in the memory-bound decode setting. FlashAttention 3 in particular was written to use Hopper features, and FlashMLA targets the same hardware. [1][7]
From CUTLASS it inherits the building blocks for high-performance matrix multiply on NVIDIA tensor cores. CUTLASS is NVIDIA's template library of CUDA primitives for linear algebra, and it gives kernel authors a way to schedule tensor-core work, manage shared memory, and pipeline data movement at a level close to the metal. MLA decode is, underneath, a sequence of small matrix multiplies tied together by the attention pattern, so CUTLASS-style scheduling is a natural fit. [1]
Two serving features round out the design. The paged KV cache stores the cache in fixed-size blocks rather than one contiguous buffer per sequence, the same idea popularized by PagedAttention in vLLM. Paging reduces memory fragmentation and lets a server pack more sequences into the same memory, which raises the achievable batch size. The variable-length support means the kernel handles a batch where each sequence has a different length without padding everything to the longest one, so no compute is wasted on padding tokens. The usage example in the README reflects this, with a metadata step that plans the work across heads and sequence lengths before the per-layer attention calls. [1]
The headline figures come from the FlashMLA README and are tied to a specific setup. On an H800 SXM5 GPU, using CUDA 12.6, the kernel reports up to 3000 GB/s in a memory-bound configuration and up to 580 TFLOPS in a compute-bound configuration. The H800 is the China-market variant of NVIDIA's H100, with the same Hopper architecture but reduced interconnect bandwidth, and its theoretical memory bandwidth is roughly 3.3 TB/s, so a measured 3000 GB/s puts the memory-bound case near the hardware ceiling. [1]
| Item | Value | Source |
|---|---|---|
| Release date | 24 February 2025 | DeepSeek open-infra-index |
| Hardware tested | H800 SXM5 | FlashMLA README |
| CUDA version | 12.6 | FlashMLA README |
| Memory-bound throughput | up to 3000 GB/s | FlashMLA README |
| Compute-bound throughput | up to 580 TFLOPS | FlashMLA README |
| Numeric formats | BF16, FP16 | FlashMLA README |
| Paged KV cache block size | 64 | FlashMLA README |
| Minimum CUDA / PyTorch | 12.3 / 2.0 | FlashMLA README |
These are the project's own benchmark numbers on its own hardware, not third-party measurements, and real throughput on a given workload depends on model dimensions, batch size, sequence length, and the GPU in use. [1][3]
FlashMLA arrived at a moment when the open-source serving community was racing to run DeepSeek-V3 and R1 efficiently, and MLA was the part that needed new code. The vLLM project, a widely used inference engine, added MLA support to serve these models and described the motivation in plain terms. Its blog notes that "DeepSeek models use Multi-head Latent Attention (MLA), which compresses the KV cache into a latent vector," and that this "significantly reduces memory usage and allows for larger batch sizes." The same effort drew on DeepSeek's Hopper-optimized kernels. SGLang, another high-throughput serving system, also added MLA handling for DeepSeek models. [8][9]
Because DeepSeek published FlashMLA under an open license with a clear interface, serving stacks and GPU programmers could read the production code rather than reverse-engineering it from a paper. That lowered the cost of supporting MLA broadly, and it fed back into a wider body of work on KV-cache reduction and attention kernels across the field. [1][8]
FlashMLA matters for a narrow reason that has broad effects. Inference cost for large models is dominated by how well a server keeps its GPUs full while holding the KV cache in memory, and MLA attacks the memory side hard. A kernel that runs MLA decode near hardware limits turns the architectural saving into a real serving saving, which is the difference between a clever idea on paper and lower cost per token in production. By open-sourcing the exact kernel it runs, DeepSeek let others reach a similar efficiency on Hopper hardware, which is part of why DeepSeek-V3 and R1 became practical to host outside DeepSeek itself. [1][8]
The release also fit a pattern in AI infrastructure where the hard-won engineering, not just the model weights, gets shared. FlashMLA, DeepEP, and DeepGEMM together exposed a stack tuned for training and serving sparse mixture-of-experts models on Hopper GPUs, and they gave smaller teams a concrete reference for techniques that had mostly lived inside a few large labs. [2][5]
FlashMLA is specialized, and that is its main constraint. It targets Hopper GPUs, so it does not directly help users on older NVIDIA architectures or on other vendors' accelerators. It is a decoding kernel for MLA, so it applies to models that use MLA, chiefly DeepSeek's own line, rather than to the much larger set of models built on multi-head or grouped-query attention. The first public release covered BF16 and FP16, which leaves lower-precision formats such as FP8 outside the initial scope even though DeepSeek uses FP8 elsewhere in its stack. [1]
The performance figures, while strong, are best read as a ceiling for a favorable configuration on one specific GPU rather than a guarantee for every workload. And like any kernel that lives close to the hardware, it carries tight coupling to specific CUDA versions and tensor-core features, which is the cost of running near the limits of the machine. [1][3]