DeepEP
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,961 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,961 words
Add missing citations, update stale details, or suggest a clearer explanation.
DeepEP is an open-source GPU communication library built by DeepSeek for Mixture-of-Experts (MoE) models. It provides fast all-to-all kernels for the two communication steps an MoE layer depends on, usually called dispatch and combine, and it covers both the high-throughput case used in training and prefill and the latency-sensitive case used in decoding. DeepSeek released it on 25 February 2025, the second day of its "Open Source Week," and described it as the first open-source communication library aimed specifically at expert parallelism for MoE training and inference. [1][2][3]
The library matters because MoE only saves compute if the routing it depends on is cheap to move across GPUs. DeepEP is the piece that makes that movement fast, and it was tuned to match the exact MoE design used in DeepSeek-V3. [1][4]
A dense transformer sends every token through the same feed-forward network. A Mixture-of-Experts layer instead holds many expert networks and routes each token to only a few of them, picked by a small gating function. That lets the total parameter count grow while the compute spent per token stays roughly flat, which is the main reason MoE has become common in large language models. [5]
The catch shows up once the experts are spread across many GPUs, a setup called expert parallelism. A given GPU holds some experts but receives tokens that may belong to experts living anywhere in the cluster. So before the experts can run, every GPU has to ship each token to whichever GPU owns its chosen expert. That is an all-to-all exchange, and it happens twice per MoE layer.
The first exchange is dispatch. Each GPU sends its tokens out to the GPUs that own the selected experts, so that every expert receives the tokens routed to it. The experts then do their feed-forward computation locally. The second exchange is combine. The expert outputs travel back to the GPUs that originally held those tokens, where they are reassembled in the right order so the rest of the network can continue. [1]
Because dispatch and combine sit on the critical path of every MoE layer, their cost is felt on every forward pass in training and on every step of inference. If the all-to-all is slow, the expensive GPUs spend their time waiting on the network instead of doing math, and the efficiency that MoE promised on paper quietly disappears. DeepEP exists to keep that exchange close to the limits of the hardware. [1][3]
DeepEP splits the problem into two kernel families because training and decoding stress the network in different ways. [1]
The first set is the normal, or high-throughput, kernels. These move large batches of tokens and care most about raw bandwidth, so they suit training and the prefill phase of inference, where a long prompt is processed in one shot. They are written for asymmetric-domain bandwidth forwarding, meaning they are designed to push data efficiently from the fast NVLink domain inside a node out to the slower RDMA domain between nodes. DeepSeek aligned these kernels with the group-limited gating algorithm described in the DeepSeek-V3 paper, which restricts how many separate machines a token's experts can land on and so keeps the cross-node traffic bounded. [1][4]
The second set is the low-latency kernels. Decoding generates one token at a time, so the batches are tiny and the thing that hurts is the fixed delay of each round trip rather than total bandwidth. These kernels use pure RDMA and are stripped down to make a single dispatch or combine as quick as possible. They are the kernels you reach for when serving an interactive model, where every millisecond of added latency is felt directly by the user. [1]
| Kernel set | Best for | Network path | Optimized for |
|---|---|---|---|
| Normal (high-throughput) | Training and inference prefill | NVLink intranode, RDMA internode | Bandwidth on large batches |
| Low-latency | Inference decoding | Pure RDMA | Latency on small batches |
A modern GPU cluster has two very different kinds of links, and DeepEP is written to respect both. Inside a single server, GPUs talk over NVLink, NVIDIA's high-bandwidth on-node interconnect. Between servers, they talk over an RDMA network, typically InfiniBand, which lets one machine read and write another machine's memory without going through the operating system on each hop. The normal kernels forward across both domains, using NVLink for the intranode leg and RDMA for the internode leg, while the low-latency kernels lean on RDMA for the cross-node transfers. [1]
DeepEP also supports FP8 for the dispatch step. Sending tokens in an 8-bit floating-point format rather than 16-bit roughly halves the bytes on the wire, which is a direct win for an operation whose cost is dominated by data movement. DeepSeek's reference setting dispatches in FP8 and combines in BF16, the same low-precision direction it pushed throughout the DeepSeek-V3 work. [1][4]
The third idea is overlap. Communication and computation can run at the same time if the software is arranged carefully, so that a GPU keeps computing while more tokens are still arriving. DeepEP's low-latency path does this with a hook-based mechanism that, by DeepSeek's account, does not consume any streaming multiprocessor resources. In other words the overlap is arranged without stealing GPU compute units away from the model itself, so hiding the communication does not slow down the math it is hiding behind. [1]
DeepEP is tuned for NVIDIA Hopper GPUs, the generation that includes the H800 and H100, and DeepSeek notes that other architectures may be supported later. Its published benchmarks were run on H800 GPUs, each connected to a ConnectX-7 InfiniBand card rated at 400 Gb/s, which works out to roughly 50 GB/s of usable bandwidth per card. [1]
For the normal kernels, DeepSeek reports intranode dispatch and combine reaching roughly 153 to 158 GB/s over NVLink, and internode operation sustaining around 43 to 47 GB/s of RDMA bandwidth, which is close to what a 400 Gb/s link can deliver in practice. For the low-latency kernels, dispatch lands at about 163 microseconds for 8 experts and rises to roughly 194 microseconds at 256 experts, while combine runs from about 318 to 360 microseconds across the same range, with RDMA bandwidth in the 39 to 46 GB/s band. The low-latency figures follow a DeepSeek-V3/R1 production setting of 128 tokens per batch, a hidden size of 7168, top-8 experts, FP8 dispatch, and BF16 combine. [1]
| Operation | Measurement | Reported figure |
|---|---|---|
| Normal dispatch and combine, intranode | NVLink bandwidth | about 153 to 158 GB/s |
| Normal dispatch and combine, internode | RDMA bandwidth | about 43 to 47 GB/s |
| Low-latency dispatch | Latency, 8 to 256 experts | about 163 to 194 microseconds |
| Low-latency combine | Latency, 8 to 256 experts | about 318 to 360 microseconds |
| Low-latency kernels | RDMA bandwidth | about 39 to 46 GB/s |
The figures above are DeepSeek's own measurements on its reference setup, so real numbers depend on the cluster, the network, and the model. Running DeepEP needs Hopper GPUs, Python 3.8 or newer, CUDA 12.3 or newer, and PyTorch 2.1 or newer, plus NVLink for the intranode case and an RDMA network for the internode case. It also depends on a modified build of NVSHMEM, NVIDIA's library for GPU-initiated communication, which DeepEP uses to drive the RDMA transfers. One performance trick in the normal kernels relies on an undocumented PTX instruction that DeepSeek says it has tested on Hopper but flags as outside the formally defined behavior, with a switch to turn it off. [1]
DeepEP is one of a set of low-level tools DeepSeek published during Open Source Week, and the pieces are meant to fit together. FlashMLA, released the day before, is a decoding kernel for the multi-head latent attention used in DeepSeek-V3. DeepGEMM, released the day after, is an FP8 matrix-multiply library that includes the grouped GEMMs an MoE layer runs once its tokens have been dispatched. Later days added DualPipe, a pipeline-parallel scheduling method, and EPLB, an expert-parallel load balancer that tries to keep the experts evenly busy so no single GPU becomes the bottleneck in the all-to-all. The final day added 3FS, a parallel file system. [2][6][7]
Seen together, the suite is roughly the data path of a DeepSeek-V3 MoE layer broken into reusable parts. DeepEP moves the tokens, DeepGEMM does the expert math, EPLB decides where the experts live, and FlashMLA handles attention. DeepSeek framed the whole release as opening up infrastructure it had already tested in production rather than as a research prototype. The lower-level groundwork all of this builds on is described under CUDA. [2][3]
The reason DeepEP drew attention is that it was the first open release of a communication library aimed specifically at MoE expert parallelism, an area that had mostly lived inside the private stacks of large labs. Efficient all-to-all is one of the harder, less visible parts of training and serving an MoE model, and DeepSeek shipped a version it said it used itself, with the benchmarks to back it. [3][8]
That had a practical effect on the wider community. Anyone building or serving large MoE models on Hopper hardware suddenly had a concrete, optimized reference instead of writing the kernels from scratch, and the library slotted naturally into stacks already running DeepSeek-R1 style architectures. Cloud and serving teams picked it up quickly; the LMSYS group, for instance, reported using DeepEP as part of a large-scale expert-parallel deployment of DeepSeek across many GPUs, and Microsoft published a guide on tuning DeepEP for its Azure HPC instances. It also fed the broader read on DeepSeek through early 2025, which is that the lab competed less by spending more on hardware and more by squeezing the hardware it had, with the all-to-all layer as one clear example. [3][4][9]
DeepEP is deliberately narrow, and the constraints follow from that. It targets Hopper GPUs, so it does not cover older NVIDIA generations or other vendors' accelerators out of the box. It assumes NVLink within a node and an RDMA network between nodes, which is the layout of a serious training cluster but not of a single workstation or a commodity setup. It depends on a patched NVSHMEM, adding a build step beyond a plain pip install. And one of its bandwidth optimizations leans on undocumented GPU behavior, which DeepSeek tested and made optional but which is not a guaranteed contract from the hardware. The kernels are also shaped around DeepSeek-V3's group-limited gating, so a very different routing scheme may not map onto them as cleanly. None of this is hidden. The library is built for a specific, demanding job and is honest about where that job ends. [1]