DeepGEMM
Last reviewed
Jun 1, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 915 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 1, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 915 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepGEMM is an open-source library from DeepSeek that provides fast FP8 general matrix multiplication (GEMM) kernels for NVIDIA Hopper GPUs. It was released on 26 February 2025 as the third project in DeepSeek's "Open Source Week," a five-day series in which the company published several pieces of the training and inference infrastructure behind DeepSeek-V3 and DeepSeek-R1. The library is deliberately small, with a core kernel of roughly 300 lines of code, yet it reaches performance competitive with, and in many shapes ahead of, heavily tuned expert libraries.
DeepGEMM matters because matrix multiplication dominates the cost of both training and serving large language models, and doing that arithmetic in 8-bit floating point (FP8) roughly doubles throughput and halves memory traffic compared with 16-bit formats. The hard part is keeping FP8 numerically stable. DeepGEMM packages the techniques DeepSeek used to train a frontier model in FP8 into a clean, reusable component.
DeepSeek-V3 was one of the first widely deployed large models trained natively in FP8. Its technical report describes a fine-grained quantization scheme in which activations and weights are scaled in small blocks or tiles rather than per tensor, so that a few large values do not force the rest of a tensor into a lossy range. Reproducing that scheme efficiently requires GEMM kernels that understand per-block scaling factors and that compensate for the limited accumulation precision of Hopper's FP8 Tensor Cores.
Rather than keep those kernels internal, DeepSeek open-sourced them during Open Source Week, alongside FlashMLA (an attention-decoding kernel), DeepEP (an expert-parallel communication library for mixture-of-experts models), the DualPipe and EPLB parallelism tools, and the 3FS file system. DeepGEMM was the day-three release.
DeepGEMM focuses narrowly on FP8 GEMM and does that one thing well. Its main design choices are:
The project describes itself as drawing inspiration from NVIDIA's CUTLASS and CuTe abstractions, but it intentionally avoids a heavy dependency on them, keeping the single core kernel readable as a teaching reference as well as a production tool.
On NVIDIA H800 GPUs, DeepSeek reported that DeepGEMM reaches over 1,350 FP8 TFLOPS on suitable shapes, and that it matches or beats an internal CUTLASS-based expert baseline across a wide range of matrix sizes, with speedups reported up to roughly 2.7 times on some smaller or irregular shapes where generic libraries leave performance on the table. These figures come from DeepSeek's own benchmarks and depend heavily on matrix dimensions, so they are best read as representative rather than universal.
| Property | Detail |
|---|---|
| Developer | DeepSeek |
| Released | 26 February 2025 (Open Source Week, day 3) |
| Target hardware | NVIDIA Hopper (H800, H100) |
| Precision | FP8 with fine-grained scaling, higher-precision second accumulation |
| Operations | Dense GEMM, grouped GEMM (contiguous and masked) |
| Compilation | Just-in-time, no install-time build |
| Core size | About 300 lines for the main kernel |
| License | MIT |
DeepGEMM is one of the components that made DeepSeek-V3 notable for its low reported training cost, since efficient FP8 matrix multiplication is central to training a large model on a constrained GPU budget. After release it was studied closely by the open-source community as a compact, well-documented example of production FP8 kernels, and parts of its approach were referenced in later serving stacks and kernel libraries. Because it targets Hopper specifically, its kernels do not transfer directly to other vendors' accelerators, and follow-up work in the repository has extended it toward newer architectures and lower-precision formats such as those related to NVFP4.
DeepGEMM sits alongside FlashMLA and DeepEP as part of a broader pattern in 2025 of Chinese laboratories releasing not just model weights but the low-level systems software needed to train and serve them efficiently.