# DeepGEMM

> Source: https://aiwiki.ai/wiki/deepgemm
> Updated: 2026-06-09
> Categories: AI Infrastructure, Developer Tools, Open Source AI
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**DeepGEMM** is an open-source library from [DeepSeek](/wiki/deepseek) that provides fast FP8 general matrix multiplication (GEMM) kernels for NVIDIA Hopper GPUs. It was released on 26 February 2025 as the third project in DeepSeek's "Open Source Week," a five-day series in which the company published several pieces of the training and inference infrastructure behind [DeepSeek-V3](/wiki/deepseek_v3) and [DeepSeek-R1](/wiki/deepseek_r1) [1][2][4]. The library is deliberately small, with a core kernel of roughly 300 lines of code, yet it reaches performance competitive with, and in many shapes ahead of, heavily tuned expert libraries [1][9].

DeepGEMM matters because matrix multiplication dominates the cost of both training and serving [large language models](/wiki/large_language_model), and doing that arithmetic in 8-bit floating point (FP8) roughly doubles throughput and halves memory traffic compared with 16-bit formats [8]. The hard part is keeping FP8 numerically stable. DeepGEMM packages the techniques DeepSeek used to train a frontier model in FP8 into a clean, reusable component [1][3]. Since its first release the project has grown beyond a single kernel: by 2026 it had become a broader tensor-core kernel library covering FP8, FP4, and BF16 GEMMs, fused mixture-of-experts kernels, and the newer Blackwell architecture, while keeping the same compact, JIT-compiled design [1].

## Background and release

DeepSeek-V3 was one of the first widely deployed large models trained natively in FP8. Its technical report describes a fine-grained quantization scheme in which activations and weights are scaled in small blocks or tiles rather than per tensor, so that a few large values do not force the rest of a tensor into a lossy range [3]. Activations are grouped and scaled on a 1x128 tile basis (one token across 128 channels) and weights on a 128x128 block basis (128 input channels by 128 output channels), with the E4M3 FP8 format used throughout and scaling factors computed online from the running maximum absolute value of each group [3]. Reproducing that scheme efficiently requires GEMM kernels that understand per-block scaling factors and that compensate for the limited accumulation precision of Hopper's FP8 Tensor Cores.

The full V3 model is large: 671 billion total parameters with 37 billion activated per token, trained on 14.8 trillion tokens [3]. DeepSeek reported the pre-training run at roughly 2.788 million H800 GPU hours, an unusually low figure for a model of that scale [3]. Efficient FP8 matrix multiplication is one reason that number is as small as it is, since GEMM is where most of the floating-point work happens.

Rather than keep those kernels internal, DeepSeek open-sourced them during Open Source Week, a five-day run from 24 to 28 February 2025. The series included [FlashMLA](/wiki/flashmla) on day one (an attention-decoding kernel for Hopper), [DeepEP](/wiki/deepep) on day two (the first open-source expert-parallel communication library for [mixture-of-experts](/wiki/mixture_of_experts) models), DeepGEMM on day three, the [DualPipe](/wiki/dualpipe) pipeline-parallelism algorithm and EPLB expert-parallel load balancer on day four, and the Fire-Flyer File System (3FS) on day five [2]. The open-infra-index that catalogs the week describes DeepGEMM as "an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference," with up to "1350+ FP8 TFLOPS on Hopper GPUs" [2]. The repository is released under the MIT license [1].

## FP8 and the fine-grained scaling design

FP8 stores each number in eight bits, which gives very little room for both range and precision. DeepGEMM uses the E4M3 variant (four exponent bits, three mantissa bits) for the matrix inputs, matching DeepSeek-V3's training recipe [3][8]. Multiplying two FP8 matrices is fast on Hopper Tensor Cores, but two numerical problems get in the way. The first is dynamic range: a single tensor often contains values spanning many orders of magnitude, and one global scale cannot represent all of them well. DeepGEMM addresses this by consuming the fine-grained, per-tile and per-block scaling factors that DeepSeek's quantization recipe produces, so each small region of a matrix carries its own scale [1][3].

The second problem is accumulation precision. A GEMM sums many products into each output element, and on Hopper the FP8 Tensor Cores accumulate those partial products at reduced internal precision (DeepSeek measured the effective mantissa at roughly 14 bits rather than the full FP32 width), which lets rounding error build up over a long reduction [3]. DeepGEMM's answer is two-level accumulation, also called promotion. The Tensor Cores do the bulk of the multiply-accumulate work quickly, and at intervals the partial results are promoted to a higher-precision accumulator on the CUDA cores, where the running sum is kept in FP32 [1][3]. DeepSeek-V3's training used FP22 accumulation registers (one sign, eight exponent, and 13 mantissa bits) for the same purpose [3]. This recovers most of the accuracy lost to the narrow Tensor Core accumulator while keeping the Tensor Cores busy on the parts of the computation they handle best. The result is FP8 throughput with numerical behavior close enough to higher precision to train and serve a frontier model.

## Hopper-specific optimizations

DeepGEMM was written for the NVIDIA Hopper architecture and leans on hardware features specific to it. The most important is the Tensor Memory Accelerator (TMA), a dedicated unit introduced with Hopper that moves large tiles of data between global and shared memory asynchronously, without tying up the regular compute threads [1][6]. DeepGEMM uses the TMA for loads, stores, and multicast broadcasts of input tiles, and it prefetches TMA descriptors so the address generation does not stall the pipeline [1]. The scaling-factor layouts the kernels expect are described as "TMA-aligned" for exactly this reason [1].

On top of the TMA, the library uses a persistent, warp-specialized kernel design. Different warps are assigned to distinct jobs: some handle the TMA data movement, some issue the matrix-multiply-accumulate (MMA) instructions on the Tensor Cores, and some run the CUDA-core promotion step. Keeping these stages overlapping means the memory pipeline and the compute pipeline run at the same time rather than waiting on each other [1]. Two smaller tricks round out the performance work. DeepGEMM supports unaligned block sizes (for example a block of 112 rather than a power of two), which improves streaming-multiprocessor utilization on matrix shapes that do not divide evenly into standard tiles [1]. It also interleaves FFMA (fused floating-point multiply-add) instructions at the SASS assembly level, a low-level scheduling tweak DeepSeek reported as worth more than 10 percent on some shapes [1]. Because the kernels are compiled at runtime, these choices can be specialized to the exact matrix dimensions a workload uses.

Just-in-time compilation is itself a design pillar. All kernels are compiled at runtime through a lightweight JIT module, so there is no heavy build step at installation and the library only ever compiles the shapes a workload needs [1]. A May 2025 update added an NVRTC path that DeepSeek reported as up to ten times faster to compile, and a later refactor introduced a low-overhead JIT C++ module [1]. This keeps the codebase small enough that the single core kernel reads as a teaching reference as well as a production tool.

## MoE grouped and masked GEMM support

Mixture-of-experts models do not run one big matrix multiply per layer. Instead each token is routed to a subset of experts, and every expert multiplies only the tokens sent to it, so a layer turns into many smaller GEMMs of varying sizes. DeepGEMM provides grouped GEMM kernels for exactly this pattern, in two layouts [1][9].

The contiguous layout is aimed at training and prefill. Tokens for the different experts are packed back to back along the M axis (the row dimension) while the N and K dimensions stay fixed, and the kernel walks through the groups in one launch [1]. The masked layout is aimed at decoding-time inference. The number of tokens per expert is not known until routing happens at runtime, which is awkward when the kernel launch is captured ahead of time in a CUDA graph. DeepGEMM handles this with a mask that tells the kernel how many rows in each group are valid, so it can skip the padding and compute only the real work while still fitting inside a static CUDA graph [1][9]. Grouped GEMM is the operation a mixture-of-experts layer needs, and having both layouts means the same library serves the dense parts of a model and its sparse expert layers.

## Performance

DeepSeek benchmarked the launch version on NVIDIA H800 GPUs using NVCC 12.8 across the matrix shapes that show up in DeepSeek-V3 and R1 [1][9]. The headline dense-GEMM number is up to 1,358 FP8 TFLOPS on a large 4096x7168x16384 multiplication, with memory bandwidth reaching about 2,668 GB/s [9]. On a smaller but common shape, 128x7168x16384, it reaches about 645 TFLOPS [9]. Against a carefully tuned CUTLASS-based internal baseline, the dense kernels show speedups from roughly 1.4x up to about 2.7x, with the largest gains on the smaller and more irregular shapes where generic libraries leave performance unused [1][9]. The grouped MoE kernels, in both contiguous and masked layouts, run about 1.1x to 1.2x faster than the baseline [1][9].

| Shape (M x N x K) | Kernel type | Performance |
| --- | --- | --- |
| 4096 x 7168 x 16384 | Dense GEMM | ~1,358 TFLOPS, ~2,668 GB/s |
| 128 x 7168 x 16384 | Dense GEMM | ~645 TFLOPS (~1.4x speedup) |
| Various small/irregular | Dense GEMM | up to ~2.7x speedup vs CUTLASS baseline |
| Grouped (contiguous, MoE) | Grouped GEMM | ~1.1x to 1.2x speedup |
| Grouped (masked, MoE) | Grouped GEMM | ~1.1x to 1.2x speedup |

These figures come from DeepSeek's own benchmarks and depend heavily on matrix dimensions, so they are best read as representative rather than universal. They also moved over time: an April 2025 update reported DeepGEMM reaching up to 1,550 TFLOPS on H800 as the kernels were tuned further [1].

| Property | Detail |
| --- | --- |
| Developer | DeepSeek |
| Released | 26 February 2025 (Open Source Week, day 3) |
| Target hardware | NVIDIA Hopper ([H100](/wiki/nvidia_h100), H800); later NVIDIA [Blackwell](/wiki/nvidia_blackwell) (SM100) |
| Precision | FP8 (E4M3) with fine-grained scaling and higher-precision second accumulation; later FP4 and BF16 |
| Operations | Dense GEMM, grouped GEMM (contiguous and masked), later fused MoE |
| Compilation | Just-in-time, no install-time build |
| Core size | About 300 lines for the main kernel |
| License | MIT |

## Relationship to CUTLASS and DeepSeek models

DeepGEMM describes itself as drawing on ideas from NVIDIA's [CUTLASS](/wiki/cutlass) and its CuTe layout library, while deliberately avoiding a heavy dependency on them. The repository puts it plainly: it "leverages some concepts from CUTLASS and CuTe, but avoids heavy reliance on their templates or algebras" [1]. CUTLASS is a large, general, template-heavy framework that covers many data types and architectures; DeepGEMM trades that generality for a single readable kernel that does FP8 on Hopper and does it well [1][5]. The benchmark comparisons are made against an internal CUTLASS-based implementation, which is the natural baseline for this kind of work [9].

Inside DeepSeek's own stack, DeepGEMM is the FP8 matrix-multiply engine behind both training and inference of V3 and R1 [1][2]. The same fine-grained quantization and promotion techniques that the V3 technical report describes are what the public kernels implement, which is why DeepGEMM doubles as a concrete reference for how that training run was made to work in FP8 [3]. When DeepSeek shipped V3.2-Exp in September 2025 with its sparse "lightning indexer" attention, DeepGEMM gained scoring kernels for that indexer (weighted-ReLU multi-query-attention logits), and serving frameworks such as [vLLM](/wiki/vllm) and [SGLang](/wiki/sglang) wired those kernels in for day-zero support [1]. DeepGEMM also sits alongside FlashMLA and DeepEP as part of a broader pattern in 2025 of Chinese laboratories releasing not just model weights but the low-level systems software needed to train and serve them efficiently.

## Later development

DeepGEMM did not stop at the launch feature set. Through 2025 and 2026 the project added weight-gradient kernels for both dense and MoE backward passes (May 2025), the faster NVRTC compilation path, and a full refactor that brought support for both Hopper (SM90) and the newer Blackwell (SM100) architecture in July 2025 [1]. Running on Blackwell requires CUDA 12.9 or later and uses a packed UE8M0 scaling format for the fine-grained scales, where SM90 builds work with CUDA 12.3 and up [1]. A 2026 update folded in a fused MoE path with overlapped communication that DeepSeek calls Mega MoE, an FP8-times-FP4 mixed-precision GEMM, an FP4 indexer, and support for Programmatic Dependent Launch (PDL) [1]. The FP4 work connects DeepGEMM to the [microscaling formats](/wiki/microscaling_formats) and [NVFP4](/wiki/nvfp4) directions that hardware vendors are pushing for even lower-precision inference.

Through these changes the description on the repository broadened. What began as "clean and efficient FP8 GEMM kernels with fine-grained scaling" is now presented as a unified, high-performance tensor-core kernel library that brings GEMMs in FP8, FP4, and BF16, along with fused MoE kernels, into a single [CUDA](/wiki/cuda) codebase [1]. The design philosophy stayed constant: keep the core small, compile at runtime, and lean on a handful of well-understood hardware features rather than a large abstraction layer.

## Adoption and limitations

After release DeepGEMM was studied closely by the open-source community as a compact, well-documented example of production FP8 kernels, and the repository accumulated thousands of GitHub stars within its first year [1]. Inference frameworks adopted it as a building block, most visibly in the day-zero support that vLLM and SGLang built for DeepSeek-V3.2's sparse attention using DeepGEMM's indexer kernels [1].

The main limitation is hardware scope. DeepGEMM is written for NVIDIA data-center GPUs and uses architecture-specific features such as the TMA, so its kernels do not transfer to other vendors' accelerators, and the FP8 path in particular assumes the Hopper or Blackwell Tensor Cores. The benchmark numbers are DeepSeek's own and are tied to specific matrix shapes, GPUs, and compiler versions, so they should be treated as indicative rather than guaranteed for an arbitrary workload. And because the library specializes aggressively to the shapes it sees, the first call for a new shape pays a JIT compilation cost before the fast kernel is available, although caching and the faster NVRTC path reduce that overhead in practice [1].

## References

1. DeepSeek. "DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling." GitHub repository, 2025-2026. https://github.com/deepseek-ai/DeepGEMM
2. DeepSeek. "open-infra-index: Open Source Week." GitHub, 2025. https://github.com/deepseek-ai/open-infra-index
3. DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv preprint arXiv:2412.19437, 2024. https://arxiv.org/abs/2412.19437
4. MarkTechPost. "DeepSeek AI Releases DeepGEMM: An FP8 GEMM Library that Supports both Dense and MoE GEMMs Powering V3/R1 Training and Inference." 25 February 2025. https://www.marktechpost.com/2025/02/25/deepseek-ai-releases-deepgemm-an-fp8-gemm-library-that-supports-both-dense-and-moe-gemms-powering-v3-r1-training-and-inference/
5. NVIDIA. "CUTLASS: CUDA Templates for Linear Algebra Subroutines." GitHub repository. https://github.com/NVIDIA/cutlass
6. NVIDIA. "NVIDIA Hopper Architecture." NVIDIA Data Center. https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/
7. DeepSeek-AI. "DeepSeek-V3.2-Exp." GitHub repository, 2025. https://github.com/deepseek-ai/DeepSeek-V3.2-Exp
8. Micikevicius, Paulius, et al. "FP8 Formats for Deep Learning." arXiv preprint arXiv:2209.05433, 2022. https://arxiv.org/abs/2209.05433
9. DigiAlps. "DeepSeek AI Drops DeepGEMM, An FP8 GEMM Library That Powers V3 and R1 AI Models." 2025. https://digialps.com/deepseek-ai-drops-deepgemm-an-fp8-gemm-library-that-powers-v3-and-r1-ai-models/
10. DeepSeek. "DeepEP: an efficient expert-parallel communication library." GitHub, 2025. https://github.com/deepseek-ai/DeepEP
11. vLLM. "DeepSeek-V3.2-Exp in vLLM: Fine-Grained Sparse Attention in Action." vLLM Blog, 29 September 2025. https://blog.vllm.ai/2025/09/29/deepseek-v3-2.html
12. Zhang, Jinpeng. "DeepSeek Technical Analysis (5): FP8 Training." Medium, 2025. https://dataturbo.medium.com/deepseek-technical-analysis-5-fp8-training-ff34768727b8
13. DeepSeek-AI. "Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures." arXiv preprint arXiv:2505.09343, 2025. https://arxiv.org/abs/2505.09343
14. NVIDIA. "NVIDIA Hopper Tensor Memory Accelerator and FP8 Tensor Cores (Hopper Architecture Whitepaper)." NVIDIA, 2022. https://resources.nvidia.com/en-us-tensor-core

