DeepGEMM

15 min read

Updated Jul 23, 2026

DeepGEMM is an open-source library from DeepSeek that provides fast FP8 general matrix multiplication (GEMM) kernels for NVIDIA Hopper GPUs. It was released on 26 February 2025 as the third project in DeepSeek's "Open Source Week," a five-day series in which the company published several pieces of the training and inference infrastructure behind DeepSeek-V3 and DeepSeek-R1 ^[1]^[2]^[4]. DeepSeek introduced it as "an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference," reaching "up to 1350+ FP8 TFLOPS on Hopper GPUs" in code the company called "as clean as a tutorial" ^[2]^[15]. The library is deliberately small, with a core kernel of roughly 300 lines of code, yet it reaches performance competitive with, and in many shapes ahead of, heavily tuned expert libraries ^[1]^[9].

DeepGEMM matters because matrix multiplication dominates the cost of both training and serving large language models, and doing that arithmetic in 8-bit floating point (FP8) roughly doubles throughput and halves memory traffic compared with 16-bit formats ^[8]. The hard part is keeping FP8 numerically stable. DeepGEMM packages the techniques DeepSeek used to train a frontier model in FP8 into a clean, reusable component ^[1]^[3]. Since its first release the project has grown beyond a single kernel: by 2026 it had become a broader tensor-core kernel library covering FP8, FP4, and BF16 GEMMs, fused mixture-of-experts kernels, and the newer Blackwell architecture, while keeping the same compact, JIT-compiled design ^[1]. The repository now describes itself as "a unified, high-performance tensor core kernel library that brings together the key computation primitives of modern large language models," and it had drawn more than 7,000 GitHub stars by mid-2026 ^[1].

Is DeepGEMM open source?

Yes. DeepGEMM is released under the permissive MIT license, and its full source lives in DeepSeek's public GitHub repository ^[1]. There is no install-time build step: all kernels are compiled at runtime through a lightweight just-in-time (JIT) module, so a user clones the repository and the library only ever compiles the exact matrix shapes a workload needs ^[1]. Because the core dense kernel is only about 300 lines and avoids a heavy template framework, DeepSeek presents it as a teaching reference as much as a production tool, code the company describes as "as clean as a tutorial" ^[1]^[15]. That combination of a readable core, an open license, and production use inside DeepSeek-V3 and R1 has made DeepGEMM one of the most studied open examples of hand-written FP8 kernels ^[1].

Background and release

DeepSeek-V3 was one of the first widely deployed large models trained natively in FP8. Its technical report describes a fine-grained quantization scheme in which activations and weights are scaled in small blocks or tiles rather than per tensor, so that a few large values do not force the rest of a tensor into a lossy range ^[3]. Activations are grouped and scaled on a 1x128 tile basis (one token across 128 channels) and weights on a 128x128 block basis (128 input channels by 128 output channels), with the E4M3 FP8 format used throughout and scaling factors computed online from the running maximum absolute value of each group ^[3]. Reproducing that scheme efficiently requires GEMM kernels that understand per-block scaling factors and that compensate for the limited accumulation precision of Hopper's FP8 Tensor Cores.

The full V3 model is large: 671 billion total parameters with 37 billion activated per token, trained on 14.8 trillion tokens ^[3]. DeepSeek reported the pre-training run at roughly 2.788 million H800 GPU hours, an unusually low figure for a model of that scale ^[3]. Efficient FP8 matrix multiplication is one reason that number is as small as it is, since GEMM is where most of the floating-point work happens.

Rather than keep those kernels internal, DeepSeek open-sourced them during Open Source Week, a five-day run from 24 to 28 February 2025. The series included FlashMLA on day one (an attention-decoding kernel for Hopper), DeepEP on day two (the first open-source expert-parallel communication library for mixture-of-experts models), DeepGEMM on day three, the DualPipe pipeline-parallelism algorithm and EPLB expert-parallel load balancer on day four, and the Fire-Flyer File System (3FS) on day five ^[2]^[10]. The open-infra-index that catalogs the week describes DeepGEMM as "an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference," with up to "1350+ FP8 TFLOPS on Hopper GPUs" ^[2]. The repository is released under the MIT license ^[1].

How does DeepGEMM keep FP8 accurate?

FP8 stores each number in eight bits, which gives very little room for both range and precision. DeepGEMM uses the E4M3 variant (four exponent bits, three mantissa bits) for the matrix inputs, matching DeepSeek-V3's training recipe ^[3]^[8]^[12]. Multiplying two FP8 matrices is fast on Hopper Tensor Cores, but two numerical problems get in the way. The first is dynamic range: a single tensor often contains values spanning many orders of magnitude, and one global scale cannot represent all of them well. DeepGEMM addresses this by consuming the fine-grained, per-tile and per-block scaling factors that DeepSeek's quantization recipe produces, so each small region of a matrix carries its own scale ^[1]^[3].

The second problem is accumulation precision. A GEMM sums many products into each output element, and on Hopper the FP8 Tensor Cores accumulate those partial products at reduced internal precision (DeepSeek measured the effective mantissa at roughly 14 bits rather than the full FP32 width), which lets rounding error build up over a long reduction ^[3]^[13]. DeepGEMM's answer is two-level accumulation, also called promotion. The Tensor Cores do the bulk of the multiply-accumulate work quickly, and at intervals the partial results are promoted to a higher-precision accumulator on the CUDA cores, where the running sum is kept in FP32 ^[1]^[3]. DeepSeek-V3's training used FP22 accumulation registers (one sign, eight exponent, and 13 mantissa bits) for the same purpose ^[3]. This recovers most of the accuracy lost to the narrow Tensor Core accumulator while keeping the Tensor Cores busy on the parts of the computation they handle best. The result is FP8 throughput with numerical behavior close enough to higher precision to train and serve a frontier model.

How is DeepGEMM optimized for Hopper GPUs?

DeepGEMM was written for the NVIDIA Hopper architecture and leans on hardware features specific to it. The most important is the Tensor Memory Accelerator (TMA), a dedicated unit introduced with Hopper that moves large tiles of data between global and shared memory asynchronously, without tying up the regular compute threads ^[1]^[6]^[14]. DeepGEMM uses the TMA for loads, stores, and multicast broadcasts of input tiles, and it prefetches TMA descriptors so the address generation does not stall the pipeline ^[1]. The scaling-factor layouts the kernels expect are described as "TMA-aligned" for exactly this reason ^[1].

On top of the TMA, the library uses a persistent, warp-specialized kernel design. Different warps are assigned to distinct jobs: some handle the TMA data movement, some issue the matrix-multiply-accumulate (MMA) instructions on the Tensor Cores, and some run the CUDA-core promotion step. Keeping these stages overlapping means the memory pipeline and the compute pipeline run at the same time rather than waiting on each other ^[1]. Two smaller tricks round out the performance work. DeepGEMM supports unaligned block sizes (for example a block of 112 rather than a power of two), which improves streaming-multiprocessor utilization on matrix shapes that do not divide evenly into standard tiles ^[1]. It also interleaves FFMA (fused floating-point multiply-add) instructions at the SASS assembly level, a low-level scheduling tweak DeepSeek reported as worth more than 10 percent on some shapes ^[1]. Because the kernels are compiled at runtime, these choices can be specialized to the exact matrix dimensions a workload uses.

Just-in-time compilation is itself a design pillar. All kernels are compiled at runtime through a lightweight JIT module, so there is no heavy build step at installation and the library only ever compiles the shapes a workload needs ^[1]. A May 2025 update added an NVRTC path that DeepSeek reported as up to ten times faster to compile, and a later refactor introduced a low-overhead JIT C++ module ^[1]. This keeps the codebase small enough that the single core kernel reads as a teaching reference as well as a production tool.

How does DeepGEMM support mixture-of-experts models?

Mixture-of-experts models do not run one big matrix multiply per layer. Instead each token is routed to a subset of experts, and every expert multiplies only the tokens sent to it, so a layer turns into many smaller GEMMs of varying sizes. DeepGEMM provides grouped GEMM kernels for exactly this pattern, in two layouts ^[1]^[9].

The contiguous layout is aimed at training and prefill. Tokens for the different experts are packed back to back along the M axis (the row dimension) while the N and K dimensions stay fixed, and the kernel walks through the groups in one launch ^[1]. The masked layout is aimed at decoding-time inference. The number of tokens per expert is not known until routing happens at runtime, which is awkward when the kernel launch is captured ahead of time in a CUDA graph. DeepGEMM handles this with a mask that tells the kernel how many rows in each group are valid, so it can skip the padding and compute only the real work while still fitting inside a static CUDA graph ^[1]^[9]. Grouped GEMM is the operation a mixture-of-experts layer needs, and having both layouts means the same library serves the dense parts of a model and its sparse expert layers.

How fast is DeepGEMM?

DeepSeek benchmarked the launch version on NVIDIA H800 GPUs using NVCC 12.8 across the matrix shapes that show up in DeepSeek-V3 and R1 ^[1]^[9]. The headline dense-GEMM number is up to 1,358 FP8 TFLOPS on a large 4096x7168x16384 multiplication, with memory bandwidth reaching about 2,668 GB/s ^[9]. On a smaller but common shape, 128x7168x16384, it reaches about 645 TFLOPS ^[9]. Against a carefully tuned CUTLASS-based internal baseline, the dense kernels show speedups from roughly 1.4x up to about 2.7x, with the largest gains on the smaller and more irregular shapes where generic libraries leave performance unused ^[1]^[9]. The grouped MoE kernels, in both contiguous and masked layouts, run about 1.1x to 1.2x faster than the baseline ^[1]^[9].

Shape (M x N x K)	Kernel type	Performance
4096 x 7168 x 16384	Dense GEMM	~1,358 TFLOPS, ~2,668 GB/s
128 x 7168 x 16384	Dense GEMM	~645 TFLOPS (~1.4x speedup)
Various small/irregular	Dense GEMM	up to ~2.7x speedup vs CUTLASS baseline
Grouped (contiguous, MoE)	Grouped GEMM	~1.1x to 1.2x speedup
Grouped (masked, MoE)	Grouped GEMM	~1.1x to 1.2x speedup

These figures come from DeepSeek's own benchmarks and depend heavily on matrix dimensions, so they are best read as representative rather than universal. They also moved over time: an update on 18 April 2025 reported DeepGEMM reaching up to 1,550 TFLOPS on H800 as the kernels were tuned further ^[1].

Property	Detail
Developer	DeepSeek
Released	26 February 2025 (Open Source Week, day 3)
Target hardware	NVIDIA Hopper (H100, H800); later NVIDIA Blackwell (SM100)
Precision	FP8 (E4M3) with fine-grained scaling and higher-precision second accumulation; later FP4 and BF16
Operations	Dense GEMM, grouped GEMM (contiguous and masked), later fused MoE
Compilation	Just-in-time, no install-time build
Core size	About 300 lines for the main kernel
Latest major update	16 April 2026 (Mega MoE, FP8xFP4 GEMM, FP4 indexer, PDL)
GitHub stars	More than 7,000 (mid-2026)
License	MIT

How does DeepGEMM relate to CUTLASS and DeepSeek's models?

DeepGEMM describes itself as drawing on ideas from NVIDIA's CUTLASS and its CuTe layout library, while deliberately avoiding a heavy dependency on them. The repository puts it plainly: it "leverages some concepts from CUTLASS and CuTe, but avoids heavy reliance on their templates or algebras" ^[1]. CUTLASS is a large, general, template-heavy framework that covers many data types and architectures; DeepGEMM trades that generality for a single readable kernel that does FP8 on Hopper and does it well ^[1]^[5]. The benchmark comparisons are made against an internal CUTLASS-based implementation, which is the natural baseline for this kind of work ^[9].

Inside DeepSeek's own stack, DeepGEMM is the FP8 matrix-multiply engine behind both training and inference of V3 and R1 ^[1]^[2]. The same fine-grained quantization and promotion techniques that the V3 technical report describes are what the public kernels implement, which is why DeepGEMM doubles as a concrete reference for how that training run was made to work in FP8 ^[3]. When DeepSeek shipped V3.2-Exp in September 2025 with its sparse "lightning indexer" attention, DeepGEMM gained scoring kernels for that indexer (weighted-ReLU multi-query-attention logits) on 28 September 2025, and serving frameworks such as vLLM and SGLang wired those kernels in for day-zero support ^[1]^[7]^[11]. DeepGEMM also sits alongside FlashMLA and DeepEP as part of a broader pattern in 2025 of Chinese laboratories releasing not just model weights but the low-level systems software needed to train and serve them efficiently.

How has DeepGEMM evolved since launch?

DeepGEMM did not stop at the launch feature set. Through 2025 and 2026 the project added weight-gradient kernels for both dense and MoE backward passes (14 May 2025), the faster NVRTC compilation path (7 May 2025), and a full refactor that brought support for both Hopper (SM90) and the newer Blackwell (SM100) architecture on 20 July 2025 ^[1]. Running on Blackwell requires CUDA 12.9 or later and uses a packed UE8M0 scaling format for the fine-grained scales, where SM90 builds work with CUDA 12.3 and up, though DeepSeek recommends 12.9 or higher even on Hopper for the best performance ^[1]. The Blackwell path also widened the supported memory layouts: where the SM90 kernels handle the NT layout (row-major times column-major) that DeepSeek's models use, the SM100 kernels support all four combinations, NT, TN, NN, and TT ^[1]. An update on 16 April 2026 folded in a fused MoE path with overlapped communication that DeepSeek calls Mega MoE, an FP8-times-FP4 mixed-precision GEMM, an FP4 indexer, and support for Programmatic Dependent Launch (PDL) ^[1]. Mega MoE fuses the whole expert path, expert-parallel dispatch, the first FP8-times-FP4 linear layer, the SwiGLU activation, the second FP8-times-FP4 linear layer, and the expert-parallel combine, into a single kernel ^[1]. The FP4 work connects DeepGEMM to the microscaling formats and NVFP4 directions that hardware vendors are pushing for even lower-precision inference.

Through these changes the description on the repository broadened. What began as "clean and efficient FP8 GEMM kernels with fine-grained scaling" is now presented as a unified, high-performance tensor-core kernel library that brings GEMMs in FP8, FP4, and BF16, along with fused MoE kernels, multi-query-attention scoring for the V3.2 lightning indexer, and a TensorFloat-32 HyperConnection (HC) pre-normalization GEMM, into a single CUDA codebase ^[1]. The design philosophy stayed constant: keep the core small, compile at runtime, and lean on a handful of well-understood hardware features rather than a large abstraction layer.

Who uses DeepGEMM, and what are its limitations?

After release DeepGEMM was studied closely by the open-source community as a compact, well-documented example of production FP8 kernels, and the repository accumulated thousands of GitHub stars within its first year ^[1]. Inference frameworks adopted it as a building block, most visibly in the day-zero support that vLLM and SGLang built for DeepSeek-V3.2's sparse attention using DeepGEMM's indexer kernels ^[1]^[11].

The main limitation is hardware scope. DeepGEMM is written for NVIDIA data-center GPUs and uses architecture-specific features such as the TMA, so its kernels do not transfer to other vendors' accelerators, and the FP8 path in particular assumes the Hopper or Blackwell Tensor Cores. The benchmark numbers are DeepSeek's own and are tied to specific matrix shapes, GPUs, and compiler versions, so they should be treated as indicative rather than guaranteed for an arbitrary workload. And because the library specializes aggressively to the shapes it sees, the first call for a new shape pays a JIT compilation cost before the fast kernel is available, although caching and the faster NVRTC path reduce that overhead in practice ^[1].

References

^DeepSeek. "DeepGEMM: a unified, high-performance tensor core kernel library." GitHub repository, 2025-2026. github.com/...DeepGEMM
^DeepSeek. "open-infra-index: Open Source Week." GitHub, 2025. github.com/...open-infra-index
^DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv preprint arXiv:2412.19437, 2024. arxiv.org/...2412.19437
^MarkTechPost. "DeepSeek AI Releases DeepGEMM: An FP8 GEMM Library that Supports both Dense and MoE GEMMs Powering V3/R1 Training and Inference." 25 February 2025. marktechpost.com/...g-v3-r1-training-and-inference
^NVIDIA. "CUTLASS: CUDA Templates for Linear Algebra Subroutines." GitHub repository. github.com/...cutlass
^NVIDIA. "NVIDIA Hopper Architecture." NVIDIA Data Center. nvidia.com/...hopper-architecture
^DeepSeek-AI. "DeepSeek-V3.2-Exp." GitHub repository, 2025. github.com/...DeepSeek-V3.2-Exp
^Micikevicius, Paulius, et al. "FP8 Formats for Deep Learning." arXiv preprint arXiv:2209.05433, 2022. arxiv.org/...2209.05433
^DigiAlps. "DeepSeek AI Drops DeepGEMM, An FP8 GEMM Library That Powers V3 and R1 AI Models." 2025. digialps.com/...ry-that-powers-v3-and-r1-ai-models
^DeepSeek. "DeepEP: an efficient expert-parallel communication library." GitHub, 2025. github.com/...DeepEP
^vLLM. "DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action." vLLM Blog, 29 September 2025. blog.vllm.ai/...deepseek-v3-2
^Zhang, Jinpeng. "DeepSeek Technical Analysis (5): FP8 Training." Medium, 2025. dataturbo.medium.com/...-fp8-training-ff34768727b8
^DeepSeek-AI. "Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures." arXiv preprint arXiv:2505.09343, 2025. arxiv.org/...2505.09343
^NVIDIA. "NVIDIA Hopper Tensor Memory Accelerator and FP8 Tensor Cores (Hopper Architecture Whitepaper)." NVIDIA, 2022. resources.nvidia.com/en-us-tensor-core
^DeepSeek. "Day 3 of #OpenSourceWeek: DeepGEMM." X (formerly Twitter), 26 February 2025. x.com/...1894553164235640933

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · v5 · 2,931 words · full history

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Suggest edit

What links here

DeepEP FlashMLA LMDeploy

Is DeepGEMM open source?

Background and release

How does DeepGEMM keep FP8 accurate?

How is DeepGEMM optimized for Hopper GPUs?

How does DeepGEMM support mixture-of-experts models?

How fast is DeepGEMM?

How does DeepGEMM relate to CUTLASS and DeepSeek's models?

How has DeepGEMM evolved since launch?

Who uses DeepGEMM, and what are its limitations?

References

Improve this article

Related Articles

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet

Horovod

LanceDB

What links here

Related Articles

Ray (framework)

XLA (Accelerated Linear Algebra)

Supabase

Apache MXNet

Horovod

LanceDB

What links here