DualPipe
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,644 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
6 citations
Review status
Source-backed
Revision
v1 · 1,644 words
Add missing citations, update stale details, or suggest a clearer explanation.
DualPipe is a bidirectional pipeline parallelism scheduling algorithm developed by DeepSeek-AI to reduce the idle time, known as the pipeline "bubble," that arises when a large neural network is split across many devices. Its distinguishing idea is to feed micro-batches into the pipeline from both ends simultaneously and to overlap the forward and backward computation of each micro-batch with the communication that the computation depends on, in particular the all-to-all dispatch and combine operations used by a mixture of experts model and the cross-node transfers of expert parallelism [1][2].
DualPipe was first described in the DeepSeek-V3 Technical Report, released on 26 December 2024 (arXiv:2412.19437) [1]. The standalone implementation was open-sourced on 27 February 2025, the fourth day of DeepSeek's "Open Source Week," alongside the EPLB expert-parallel load balancer and a set of profiling traces [2][3]. The repository is released under the MIT license, and its listed developers are Jiashi Li, Chengqi Deng, and Wenfeng Liang [2]. DualPipe is one of the engineering techniques credited with keeping the training cost of DeepSeek-V3 unusually low for a frontier-scale model [1][4].
Pipeline parallelism is a form of model parallelism in which the layers of a network are partitioned into consecutive groups called stages, and each stage is placed on a different device. A batch of data is split into smaller micro-batches that flow through the stages in sequence, so that while stage 2 processes one micro-batch, stage 1 can begin the next [5]. Because every stage must wait for input from the stage before it during the warm-up phase, and must wait for gradients to return during the cool-down phase, some devices sit idle for part of every step. This idle time is the pipeline bubble.
The GPipe schedule, introduced by Google in 2019, runs all forward passes for a batch, then all backward passes, with a flush in between [5]. It is simple but leaves a bubble whose fraction of total time is roughly (PP - 1) / M, where PP is the number of pipeline stages and M is the number of micro-batches [5]. The 1F1B ("one forward, one backward") schedule from PipeDream interleaves a backward pass after each forward pass once the pipeline is full; this does not change the asymptotic bubble fraction but lowers peak activation memory because activations are freed sooner [5]. Later work pushed the bubble smaller still: the Zero Bubble schedule from Sea AI Lab (Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin), presented at ICLR 2024, splits the backward pass into a part that computes the gradient with respect to the input (B) and a part that computes the gradient with respect to the weights (W), then reorders these finer pieces to fill more of the bubble [6].
For very large mixture of experts models a second cost compounds the bubble. Routing tokens to experts that live on other devices requires an all-to-all exchange (a "dispatch" before the experts and a "combine" after), and when experts are spread across many physical machines this traffic must cross the slower inter-node network [1]. If this communication is not hidden behind computation, it becomes a large serial overhead. DualPipe was designed to attack both problems at once.
A conventional pipeline holds one copy of each stage and flows micro-batches in a single direction. DualPipe instead places two copies of the model's stages on the device grid and injects micro-batches from both ends of the pipeline at the same time, so the two streams travel in opposite directions and meet in the middle [1][2]. Because forward work from one direction and backward work from the other are scheduled together, more of the bubble is filled. This symmetry is what gives the algorithm its name.
The core mechanism operates within a single pair of forward and backward chunks. DualPipe divides each chunk into four components, attention, all-to-all dispatch, MLP, and all-to-all combine, and for the backward direction the MLP and attention parts are further split into "backward for input" and "backward for weights," following the Zero Bubble decomposition [1]. These pieces are then manually rearranged so that a communication phase of one micro-batch runs concurrently with a computation phase of another. The DeepSeek-V3 report states that this arrangement allows both the all-to-all communication and the point-to-point pipeline communication to be fully hidden during execution [1].
To make the overlap real on hardware, DeepSeek manually controlled how the GPU's streaming multiprocessors (SMs) were divided between computation and communication. The report notes that for the cross-node all-to-all, only about 20 SMs are sufficient to saturate the bandwidth of both InfiniBand and NVLink, using roughly 10 communication channels to handle dispatch and combine [1]. Tokens are first sent over InfiniBand to the GPU with the same in-node index on each target node, then forwarded over the faster NVLink to the correct local expert, so that the InfiniBand and NVLink transfers themselves overlap [1]. The custom dispatch and combine kernels were later released separately as the DeepEP library.
DeepSeek-V3 is a 671-billion-parameter mixture of experts model (37 billion parameters active per token) trained on 14.8 trillion tokens [1]. Its training used a 16-way pipeline parallelism configuration with 64-way expert parallelism spread across 8 nodes and ZeRO-1 data parallelism, running on a cluster of 2,048 NVIDIA H800 GPUs [1]. In this setting the all-to-all traffic of expert parallelism would otherwise dominate, and DualPipe's overlap was central to keeping the GPUs busy. DeepSeek reported that the near-complete overlap of computation and communication let the model scale up experts and total parameters "without almost any communication overhead" [1].
The headline result was cost. DeepSeek reported a total pre-training cost of 2.664 million H800 GPU hours, with a further 0.119 million hours for context-length extension and 0.005 million for post-training, for 2.788 million GPU hours in all, which at an assumed rental price of $2 per GPU hour came to roughly $5.576 million [1]. DeepSeek was careful to note that this figure covers only the official training runs and excludes prior research, ablations, and data costs [1]. DualPipe was reused for the subsequent DeepSeek-R1 reasoning model, which shares the V3 base architecture [2].
DualPipe builds directly on the lineage of pipeline schedules while making a different tradeoff. The DeepSeek-V3 report compares it against 1F1B and against the memory-bounded Zero Bubble variant ZB1P [1]:
| Schedule | Pipeline bubble | Parameter copies | Activation memory |
|---|---|---|---|
| 1F1B | (PP - 1)(F + B) | 1x | PP |
| ZB1P | (PP - 1)(F + B - 2W) | 1x | PP |
| DualPipe | (PP/2 - 1)(F&B + B - 3W) | 2x | PP + 1 |
Here F is the time of a forward chunk, B the time of a backward chunk, W the time of the weight-gradient computation, and F&B a forward and backward pair that has been mutually overlapped [1]. Because the leading factor falls from (PP - 1) to (PP/2 - 1), DualPipe roughly halves the bubble relative to 1F1B and ZB1P [1].
The report also contrasts DualPipe with Chimera, an earlier bidirectional pipeline scheme. Chimera requires the number of pipeline stages and micro-batches to satisfy stricter divisibility conditions, whereas DualPipe only requires that the pipeline stages and the number of micro-batches each be divisible by two [1]. DualPipe also avoids a drawback of some Zero Bubble configurations in which activation memory grows with the number of micro-batches; in DualPipe the activation cost does not scale with the micro-batch count [1]. The cost of these gains is the extra memory described below.
A later variant, DualPipeV, applies a "cut-in-half" construction credited to Sea AI Lab (the group behind Zero Bubble) to produce a more compact V-shaped schedule. DualPipeV keeps the same bubble formula as DualPipe but halves the per-device activation memory to PP/2, making the approach cheaper to deploy [2].
The principal cost of DualPipe is memory. Because it keeps two copies of the model's stages resident on the device grid to drive the bidirectional flow, it requires storing two copies of the model parameters and the associated optimizer state, and it increases peak activation memory by roughly 1/PP relative to a single-direction baseline [1][2]. For DeepSeek-V3 this overhead was acceptable: the model uses a large number of total parameters but only a small fraction are active per token, and the 16-stage pipeline kept the per-device parameter share manageable [1]. The technique is therefore best suited to large mixture of experts models trained across many nodes, where the communication that DualPipe hides is large enough to justify the doubled parameter footprint.
A second consideration is engineering complexity. Achieving the overlap in practice depends on hand-tuned SM allocation, custom communication kernels, and a module interface (an overlapped forward-backward method) that the user must implement for the model's components, so the published schedule is a framework that must be adapted rather than a drop-in replacement [2]. The DualPipeV variant reduces the memory penalty, and the released profiling data and the companion DeepEP and EPLB tools were intended to help the wider community reproduce the gains [2][3].