FlashMLA

14 min read

Updated Jul 23, 2026

FlashMLA is an open-source GPU kernel from DeepSeek that accelerates the decoding step of Multi-head Latent Attention (MLA), the attention variant DeepSeek uses to shrink the KV cache in its large language models. DeepSeek released it on 24 February 2025 as the first project of its "Open Source Week," a run of five daily code drops pulled straight from the company's production inference stack. The kernel targets NVIDIA Hopper GPUs (the H800 and H100), handles variable-length sequences and a paged KV cache, and supports the BF16 and FP16 numeric formats. On an H800 SXM5 card DeepSeek reports up to 3000 GB/s of memory bandwidth in memory-bound settings and up to 580 TFLOPS in compute-bound settings. ^[1]^[2]^[3]

The project sits at the intersection of two trends in efficient inference. One is the attention redesign DeepSeek introduced with DeepSeek-V2 to cut the memory cost of long contexts. The other is the family of fused attention kernels, led by FlashAttention, that keep the GPU busy by avoiding round trips to slow high-bandwidth memory. FlashMLA applies the second idea to the first, and it does so for the specific shape of attention that MLA produces during token-by-token generation. ^[1]^[4]

What is FlashMLA?

FlashMLA is a CUDA kernel that runs the decode-time attention math for MLA models near the limits of NVIDIA Hopper hardware. It is not a model and not a full framework; it is the low-level routine a serving stack calls to compute attention while generating tokens. DeepSeek describes it in the day-one announcement as "our efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and now in production." That last phrase matters: FlashMLA is not a research prototype but the exact code that serves DeepSeek's own chat and API traffic, which means it has to cope with real workloads where requests arrive at different lengths and batch sizes shift from moment to moment. ^[1]^[2]

The public README credits the kernel's lineage directly, stating that FlashMLA "is inspired by FlashAttention 2&3 and cutlass projects." In short, FlashMLA borrows FlashAttention's fused, tiled approach to attention and CUTLASS's tensor-core scheduling, then specializes both for the tensor shapes MLA creates at decode time. ^[1]

When was FlashMLA released, and what was Open Source Week?

DeepSeek announced Open Source Week on 21 February 2025 and began publishing repositories on the 24th. The company framed the effort plainly. Its index repository describes the team as "a tiny team @deepseek-ai pushing our limits in AGI exploration" and says the releases are "humble building blocks of our online service: documented, deployed and battle-tested in production. No vaporware, just sincere code." Over the week DeepSeek shipped FlashMLA on day one, the DeepEP expert-parallel communication library on day two, the DeepGEMM FP8 matrix-multiply library on day three, a set of parallelism tools including DualPipe and EPLB on day four, and the Fire-Flyer File System on day five. ^[2]^[5]

Day	Date (2025)	Project	What it does
1	24 Feb	FlashMLA	MLA decoding kernel for Hopper GPUs
2	25 Feb	DeepEP	Expert-parallel communication library for MoE
3	26 Feb	DeepGEMM	FP8 general matrix-multiply library
4	27 Feb	DualPipe / EPLB	Pipeline parallelism and expert load balancing
5	28 Feb	3FS (Fire-Flyer File System)	High-throughput distributed file system

What is MLA, and how does FlashMLA relate to it?

MLA itself predates the kernel. DeepSeek introduced Multi-head Latent Attention in the DeepSeek-V2 technical report in May 2024 and carried it forward into DeepSeek-V3 and the R1 reasoning model. The motivation is the KV cache. During autoregressive generation a transformer stores the key and value vectors for every past token so it does not recompute them at each step. That cache grows with sequence length and batch size, and on long contexts it can dominate GPU memory, which caps how many requests a server can run at once. ^[4]^[6]

The distinction is worth stating plainly: MLA is the attention architecture baked into the model weights, while FlashMLA is the kernel that executes that architecture efficiently at inference time. You can train and define a model with MLA and run it with a generic attention kernel; FlashMLA is the purpose-built routine that makes MLA decode fast on Hopper hardware. ^[1]^[4]

How does MLA differ from MHA and GQA?

Standard multi-head attention gives every attention head its own keys and values, so the cache holds a full set of key and value vectors per head per token. Grouped-query attention reduces that by letting several query heads share one key-value head, which is the path used by models such as Llama 2 70B and many others. Multi-query attention pushes the same idea to a single shared key-value head. Both trade a little quality for a smaller cache. ^[4]

MLA takes a different route. Instead of cutting the number of key-value heads, it compresses the keys and values themselves. The DeepSeek-V2 paper describes a low-rank joint compression that projects the hidden state down into a single small latent vector. Only that latent vector goes into the cache. When the model needs the actual keys and values for an attention step, it reconstructs them from the latent vector through up-projection matrices. The cache therefore stores one compact representation per token rather than a full set of per-head key and value vectors. ^[4]

Rotary position embedding does not fit this scheme cleanly, because the position-dependent rotation has to be applied before the projections in a way that breaks the compression trick. DeepSeek handled this with a decoupled RoPE design that carries the rotary signal on extra dedicated query and key dimensions, kept separate from the compressed part. The result keeps positional information without forcing the model to cache full keys and values. ^[4]

The payoff is large. DeepSeek-V2 reports that MLA cuts the KV cache by 93.3 percent relative to the standard multi-head attention used in DeepSeek 67B (leaving roughly 6.7 percent of the original cache), and that it lifts maximum generation throughput to 5.76 times that of the earlier model. Those numbers are attributable to the DeepSeek-V2 paper and describe the architecture, not the FlashMLA kernel by itself. ^[4]

Attention variant	What gets cached	Relative KV cache	Used by
Multi-head attention (MHA)	Full keys and values per head	Largest	GPT-style models, DeepSeek 67B
Grouped-query attention (GQA)	Keys and values per head group	Reduced	Llama 2 70B and many others
Multi-query attention (MQA)	One shared key-value head	Small	PaLM and others
Multi-head latent attention (MLA)	One compressed latent vector	Smallest reported	DeepSeek-V2, DeepSeek-V3, DeepSeek-R1

The table is a qualitative summary. Exact ratios depend on head counts, dimensions, and the compression rank, and they vary by model.

Why does MLA need its own kernel?

The compression that makes MLA cheap on memory also changes the math the GPU has to run at decode time. The cached latent vector has to be expanded back into keys and values, and the attention computation works over tensor shapes that differ from plain multi-head or grouped-query attention. A generic attention kernel can run MLA, but it will not lay out the work in the way that keeps a Hopper GPU's tensor cores and memory pipeline near their limits. A purpose-built kernel can.

Decoding is also the hard case for utilization. During prefill, when the model reads a prompt, attention processes many tokens at once and the GPU has plenty of parallel work. During decode, the model generates one token per step per sequence, so each step has very little compute relative to the memory it must touch. That makes decode memory-bound, and it makes batching across many concurrent requests the main lever for throughput. FlashMLA is built for exactly this regime, which is why DeepSeek describes it as a decoding kernel for variable-length serving rather than a general attention library. ^[1]^[2]

How is FlashMLA designed?

The public README states what the first release covers. FlashMLA supports the BF16 and FP16 formats and a paged KV cache with a block size of 64. The initial release required Hopper GPUs, CUDA 12.3 or newer, and PyTorch 2.0 or newer. ^[1]

The FlashAttention and CUTLASS influences map onto two parts of the design. From FlashAttention it borrows the fused, tiled approach to attention. Rather than writing the large intermediate attention-score matrix out to high-bandwidth memory and reading it back, the kernel streams blocks of keys and values through fast on-chip memory and computes the softmax in an online, running fashion. That keeps the slow memory traffic down, which is what matters most in the memory-bound decode setting. FlashAttention 3 in particular was written to use Hopper features, and FlashMLA targets the same hardware. ^[1]^[7]

From CUTLASS it inherits the building blocks for high-performance matrix multiply on NVIDIA tensor cores. CUTLASS is NVIDIA's template library of CUDA primitives for linear algebra, and it gives kernel authors a way to schedule tensor-core work, manage shared memory, and pipeline data movement at a level close to the metal. MLA decode is, underneath, a sequence of small matrix multiplies tied together by the attention pattern, so CUTLASS-style scheduling is a natural fit. ^[1]

Two serving features round out the design. The paged KV cache stores the cache in fixed-size blocks rather than one contiguous buffer per sequence, the same idea popularized by PagedAttention in vLLM. Paging reduces memory fragmentation and lets a server pack more sequences into the same memory, which raises the achievable batch size. The variable-length support means the kernel handles a batch where each sequence has a different length without padding everything to the longest one, so no compute is wasted on padding tokens. The usage example in the README reflects this, with a metadata step that plans the work across heads and sequence lengths before the per-layer attention calls. ^[1]

How fast is FlashMLA?

The headline figures come from the FlashMLA README and are tied to a specific setup. On an H800 SXM5 GPU, using CUDA 12.6, the kernel reports up to 3000 GB/s in a memory-bound configuration and up to 580 TFLOPS in a compute-bound configuration. The H800 is the China-market variant of NVIDIA's H100, with the same Hopper architecture but reduced interconnect bandwidth, and its theoretical memory bandwidth is roughly 3.3 TB/s, so a measured 3000 GB/s puts the memory-bound case near the hardware ceiling. ^[1]

Item	Value	Source
Release date	24 February 2025	DeepSeek open-infra-index
Hardware tested	H800 SXM5	FlashMLA README
CUDA version	12.6	FlashMLA README
Memory-bound throughput	up to 3000 GB/s	FlashMLA README
Compute-bound throughput	up to 580 TFLOPS	FlashMLA README
Numeric formats	BF16, FP16	FlashMLA README
Paged KV cache block size	64	FlashMLA README
Minimum CUDA / PyTorch	12.3 / 2.0	FlashMLA README

These are the project's own benchmark numbers on its own hardware, not third-party measurements, and real throughput on a given workload depends on model dimensions, batch size, sequence length, and the GPU in use. ^[1]^[3]

What changed after the initial release?

FlashMLA has continued to evolve past the February 2025 day-one drop. DeepSeek published a performance update and an architectural deep-dive blog on 22 April 2025, refining the kernel's scheduling. A larger change came on 29 September 2025, when DeepSeek added sparse attention kernels to FlashMLA to power DeepSeek Sparse Attention (DSA), the mechanism introduced with the DeepSeek-V3.2-Exp model. The sparse path loads only selected KV entries from high-bandwidth memory and fuses the gather, mask, and attention into a single kernel; DeepSeek reports up to 640 TFLOPS during prefill and 410 TFLOPS during decode for these sparse kernels on H800 SXM5. By that point the README described FlashMLA as a library of optimized attention kernels powering DeepSeek-V3 and DeepSeek-V3.2-Exp, with the newest builds targeting CUDA 12.8 and above. ^[10]^[11]

This history is worth keeping straight when reading the repository today. The original Open Source Week release is the BF16/FP16 dense MLA decoding kernel described above; later builds added FP8 KV-cache support and the sparse decoding path, and the headline TFLOPS figures shifted upward as the kernel and its hardware targets changed. The 580 TFLOPS / 3000 GB/s numbers are the original day-one figures. ^[1]^[11]

How did the community adopt FlashMLA?

FlashMLA arrived at a moment when the open-source serving community was racing to run DeepSeek-V3 and R1 efficiently, and MLA was the part that needed new code. The vLLM project, a widely used inference engine, added MLA support to serve these models and described the motivation in plain terms. Its blog notes that "DeepSeek models use Multi-head Latent Attention (MLA), which compresses the KV cache into a latent vector," and that this "significantly reduces memory usage and allows for larger batch sizes." The same effort drew on DeepSeek's Hopper-optimized kernels. SGLang, another high-throughput serving system, also added MLA handling for DeepSeek models. ^[8]^[9]

Because DeepSeek published FlashMLA under an open license with a clear interface, serving stacks and GPU programmers could read the production code rather than reverse-engineering it from a paper. That lowered the cost of supporting MLA broadly, and it fed back into a wider body of work on KV-cache reduction and attention kernels across the field. ^[1]^[8]

Why does FlashMLA matter?

FlashMLA matters for a narrow reason that has broad effects. Inference cost for large models is dominated by how well a server keeps its GPUs full while holding the KV cache in memory, and MLA attacks the memory side hard. A kernel that runs MLA decode near hardware limits turns the architectural saving into a real serving saving, which is the difference between a clever idea on paper and lower cost per token in production. By open-sourcing the exact kernel it runs, DeepSeek let others reach a similar efficiency on Hopper hardware, which is part of why DeepSeek-V3 and R1 became practical to host outside DeepSeek itself. ^[1]^[8]

The release also fit a pattern in AI infrastructure where the hard-won engineering, not just the model weights, gets shared. FlashMLA, DeepEP, and DeepGEMM together exposed a stack tuned for training and serving sparse mixture-of-experts models on Hopper GPUs, and they gave smaller teams a concrete reference for techniques that had mostly lived inside a few large labs. ^[2]^[5]

What are FlashMLA's limitations?

FlashMLA is specialized, and that is its main constraint. It targets Hopper GPUs, so it does not directly help users on older NVIDIA architectures or on other vendors' accelerators. It is a decoding kernel for MLA, so it applies to models that use MLA, chiefly DeepSeek's own line, rather than to the much larger set of models built on multi-head or grouped-query attention. The first public release covered BF16 and FP16, which left lower-precision formats such as FP8 outside the initial scope even though DeepSeek uses FP8 elsewhere in its stack (later builds added FP8 KV-cache support). ^[1]

The performance figures, while strong, are best read as a ceiling for a favorable configuration on one specific GPU rather than a guarantee for every workload. And like any kernel that lives close to the hardware, it carries tight coupling to specific CUDA versions and tensor-core features, which is the cost of running near the limits of the machine. ^[1]^[3]

References

^DeepSeek. "FlashMLA." GitHub repository. github.com/...FlashMLA
^DeepSeek. "open-infra-index: Open Source Week." GitHub. github.com/...open-infra-index
^InfoQ. "DeepSeek Kicks Off Open Source Week with FlashMLA." March 2025. infoq.com/...deepseek-flashmla
^DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434, 2024. arxiv.org/...2405.04434
^DeepSeek. "Open Source Week recap." open-infra-index, 2025. github.com/...open-infra-index
^DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 2024. arxiv.org/...2412.19437
^Shah, J. et al. "FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision." 2024. github.com/...flash-attention
^vLLM Project. "vLLM and DeepSeek: cost-efficient AI serving." March 2025. blog.vllm.ai/...deepseek-v3
^NVIDIA. "CUTLASS: CUDA Templates for Linear Algebra Subroutines." GitHub. github.com/...cutlass
^SGLang. "SGLang serving framework." GitHub. github.com/...sglang
^vLLM Project. "DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action." 29 September 2025. blog.vllm.ai/...deepseek-v3-2

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

9 revisions by 1 contributors · v10 · 2,718 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

DeepEP DeepGEMM Flash Attention 3

What is FlashMLA?

When was FlashMLA released, and what was Open Source Week?

What is MLA, and how does FlashMLA relate to it?

How does MLA differ from MHA and GQA?

Why does MLA need its own kernel?

How is FlashMLA designed?

How fast is FlashMLA?

What changed after the initial release?

How did the community adopt FlashMLA?

Why does FlashMLA matter?

What are FlashMLA's limitations?

References

Improve this article

Related Articles

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet

Horovod

LanceDB

What links here

Related Articles

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet

Horovod

LanceDB

What links here