DualPipe

AI Agents AI Infrastructure

10 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v2 · 1,907 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

DualPipe is a bidirectional pipeline parallelism scheduling algorithm created by DeepSeek-AI to almost fully overlap computation with communication and shrink the idle time, called the pipeline "bubble," that wastes GPU cycles when a large model is split across many devices ^[1]^[2]. It feeds micro-batches into the pipeline from both ends at once and interleaves the forward and backward passes so that the all-to-all traffic of a mixture of experts model is hidden behind useful work, at the cost of keeping two copies of the model parameters in memory ^[1]^[2]. DualPipe was introduced in the DeepSeek-V3 Technical Report on 26 December 2024 (arXiv:2412.19437) and open-sourced on 27 February 2025 during DeepSeek's "Open Source Week" ^[1]^[2]^[3].

What is DualPipe?

DualPipe is a bidirectional pipeline parallelism scheduling algorithm developed by DeepSeek-AI to reduce the idle time, known as the pipeline "bubble," that arises when a large neural network is split across many devices. Its distinguishing idea is to feed micro-batches into the pipeline from both ends simultaneously and to overlap the forward and backward computation of each micro-batch with the communication that the computation depends on, in particular the all-to-all dispatch and combine operations used by a mixture of experts model and the cross-node transfers of expert parallelism ^[1]^[2].

DualPipe was first described in the DeepSeek-V3 Technical Report, released on 26 December 2024 (arXiv:2412.19437) ^[1]. The standalone implementation was open-sourced on 27 February 2025, the fourth day of DeepSeek's "Open Source Week," alongside the EPLB expert-parallel load balancer and a set of profiling traces ^[2]^[3]. The repository is released under the MIT license, and its listed developers are Jiashi Li, Chengqi Deng, and Wenfeng Liang ^[2]. DualPipe is one of the engineering techniques credited with keeping the training cost of DeepSeek-V3 unusually low for a frontier-scale model ^[1]^[4].

What is a pipeline bubble, and why does it matter?

Pipeline parallelism is a form of model parallelism in which the layers of a network are partitioned into consecutive groups called stages, and each stage is placed on a different device. A batch of data is split into smaller micro-batches that flow through the stages in sequence, so that while stage 2 processes one micro-batch, stage 1 can begin the next ^[5]. Because every stage must wait for input from the stage before it during the warm-up phase, and must wait for gradients to return during the cool-down phase, some devices sit idle for part of every step. This idle time is the pipeline bubble.

The GPipe schedule, introduced by Google in 2019, runs all forward passes for a batch, then all backward passes, with a flush in between ^[5]. It is simple but leaves a bubble whose fraction of total time is roughly (PP - 1) / M, where PP is the number of pipeline stages and M is the number of micro-batches ^[5]. The 1F1B ("one forward, one backward") schedule from PipeDream interleaves a backward pass after each forward pass once the pipeline is full; this does not change the asymptotic bubble fraction but lowers peak activation memory because activations are freed sooner ^[5]. Later work pushed the bubble smaller still: the Zero Bubble schedule from Sea AI Lab (Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin), presented at ICLR 2024, splits the backward pass into a part that computes the gradient with respect to the input (B) and a part that computes the gradient with respect to the weights (W), then reorders these finer pieces to fill more of the bubble ^[6].

For very large mixture of experts models a second cost compounds the bubble. Routing tokens to experts that live on other devices requires an all-to-all exchange (a "dispatch" before the experts and a "combine" after), and when experts are spread across many physical machines this traffic must cross the slower inter-node network ^[1]. The DeepSeek-V3 report notes that this cross-node expert parallelism produced "an inefficient computation-to-communication ratio of approximately 1:1" ^[1]. If this communication is not hidden behind computation, it becomes a large serial overhead. DualPipe was designed to attack both problems at once.

How does DualPipe work?

How does the bidirectional schedule reduce bubbles?

A conventional pipeline holds one copy of each stage and flows micro-batches in a single direction. DualPipe instead places two copies of the model's stages on the device grid and injects micro-batches from both ends of the pipeline at the same time, so the two streams travel in opposite directions and meet in the middle ^[1]^[2]. As the DeepSeek-V3 report puts it, DualPipe "employs a bidirectional pipeline scheduling, which feeds micro-batches from both ends of the pipeline simultaneously and a significant portion of communications can be fully overlapped" ^[1]. Because forward work from one direction and backward work from the other are scheduled together, more of the bubble is filled. This symmetry is what gives the algorithm its name.

How does DualPipe overlap computation and communication?

The core mechanism operates within a single pair of forward and backward chunks. As the DeepSeek-V3 report states, "The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks" ^[1]. DualPipe divides each chunk into four components, attention, all-to-all dispatch, MLP, and all-to-all combine, and for the backward direction the MLP and attention parts are further split into "backward for input" and "backward for weights," following the Zero Bubble decomposition ^[1]. These pieces are then manually rearranged so that a communication phase of one micro-batch runs concurrently with a computation phase of another. The report states that this arrangement allows both the all-to-all communication and the point-to-point pipeline communication to be fully hidden during execution ^[1].

To make the overlap real on hardware, DeepSeek manually controlled how the GPU's streaming multiprocessors (SMs) were divided between computation and communication. The report notes that for the cross-node all-to-all, "only 20 SMs are sufficient to fully utilize the bandwidths of IB and NVLink," using roughly 10 communication channels to handle dispatch and combine ^[1]. Tokens are first sent over InfiniBand to the GPU with the same in-node index on each target node, then forwarded over the faster NVLink to the correct local expert, so that the InfiniBand and NVLink transfers themselves overlap ^[1]. The custom dispatch and combine kernels were later released separately as the DeepEP library.

What role did DualPipe play in DeepSeek-V3?

DeepSeek-V3 is a 671-billion-parameter mixture of experts model (37 billion parameters active per token) trained on 14.8 trillion tokens ^[1]. Its training used a 16-way pipeline parallelism configuration with 64-way expert parallelism spread across 8 nodes and ZeRO-1 data parallelism, running on a cluster of 2,048 NVIDIA H800 GPUs ^[1]. In this setting the all-to-all traffic of expert parallelism would otherwise dominate, and DualPipe's overlap was central to keeping the GPUs busy. The report states that the design "ensures that, as the model further scales up, as long as we maintain a constant computation-to-communication ratio, we can still employ fine-grained experts across nodes while achieving a near-zero all-to-all communication overhead" ^[1].

The headline result was cost. DeepSeek reported a total pre-training cost of 2.664 million H800 GPU hours, with a further 0.119 million hours for context-length extension and 0.005 million for post-training, for 2.788 million GPU hours in all, which at an assumed rental price of $2 per GPU hour came to roughly $5.576 million ^[1]. DeepSeek was careful to note that this figure covers only the official training runs and excludes prior research, ablations, and data costs ^[1]. DualPipe was reused for the subsequent DeepSeek-R1 reasoning model, which shares the V3 base architecture ^[2].

How does DualPipe compare to 1F1B and Zero Bubble?

DualPipe builds directly on the lineage of pipeline schedules while making a different tradeoff. The DeepSeek-V3 report compares it against 1F1B and against the memory-bounded Zero Bubble variant ZB1P ^[1]:

Schedule	Pipeline bubble	Parameter copies	Activation memory
1F1B	(PP - 1)(F + B)	1x	PP
ZB1P	(PP - 1)(F + B - 2W)	1x	PP
DualPipe	(PP/2 - 1)(F&B + B - 3W)	2x	PP + 1

Here F is the time of a forward chunk, B the time of a backward chunk, W the time of the weight-gradient computation, and F&B a forward and backward pair that has been mutually overlapped ^[1]. Because the leading factor falls from (PP - 1) to (PP/2 - 1), DualPipe roughly halves the bubble relative to 1F1B and ZB1P ^[1]. The report summarizes the net effect as follows: "compared with ZB1P and 1F1B, DualPipe significantly reduces the pipeline bubbles while only increasing the peak activation memory by 1/PP times" ^[1].

The report also contrasts DualPipe with Chimera, an earlier bidirectional pipeline scheme. Chimera requires the number of pipeline stages and micro-batches to satisfy stricter divisibility conditions, whereas DualPipe only requires that the pipeline stages and the number of micro-batches each be divisible by two ^[1]. DualPipe also avoids a drawback of some Zero Bubble configurations in which activation memory grows with the number of micro-batches; in DualPipe the activation cost does not scale with the micro-batch count ^[1]. The cost of these gains is the extra memory described below.

A later variant, DualPipeV, applies a "cut-in-half" construction credited to Sea AI Lab (the group behind Zero Bubble) to produce a more compact V-shaped schedule. DualPipeV keeps the same bubble formula as DualPipe but halves the per-device activation memory to PP/2, making the approach cheaper to deploy ^[2].

What are the tradeoffs of DualPipe?

The principal cost of DualPipe is memory. Because it keeps two copies of the model's stages resident on the device grid to drive the bidirectional flow, it requires storing two copies of the model parameters and the associated optimizer state, and it increases peak activation memory by roughly 1/PP relative to a single-direction baseline ^[1]^[2]. For DeepSeek-V3 this overhead was acceptable: the model uses a large number of total parameters but only a small fraction are active per token, and the 16-stage pipeline kept the per-device parameter share manageable ^[1]. The technique is therefore best suited to large mixture of experts models trained across many nodes, where the communication that DualPipe hides is large enough to justify the doubled parameter footprint.

A second consideration is engineering complexity. Achieving the overlap in practice depends on hand-tuned SM allocation, custom communication kernels, and a module interface (an overlapped forward-backward method) that the user must implement for the model's components, so the published schedule is a framework that must be adapted rather than a drop-in replacement ^[2]. The DualPipeV variant reduces the memory penalty, and the released profiling data and the companion DeepEP and EPLB tools were intended to help the wider community reproduce the gains ^[2]^[3].

References

DeepSeek-AI. "DeepSeek-V3 Technical Report." arXiv:2412.19437, 26 December 2024. https://arxiv.org/abs/2412.19437 ↩
DeepSeek-AI. "DualPipe: A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training." GitHub repository, 2025. https://github.com/deepseek-ai/DualPipe ↩
MarkTechPost. "DeepSeek AI Releases DualPipe: A Bidirectional Pipeline Parallelism Algorithm for Computation-Communication Overlap in V3/R1 Training." 27 February 2025. https://www.marktechpost.com/2025/02/27/deepseek-ai-releases-dualpipe-a-bidirectional-pipeline-parallelism-algorithm-for-computation-communication-overlap-in-v3-r1-training/ ↩
Lyons, Austin. "Dispelling DeepSeek Myths, Studying V3." Chipstrat, 2025. https://www.chipstrat.com/p/dispelling-deepseek-myths-studying ↩
Huang, Yanping, et al. "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." NeurIPS 2019. arXiv:1811.06965. https://arxiv.org/abs/1811.06965 ↩
Qi, Penghui, Xinyi Wan, Guangxing Huang, and Min Lin. "Zero Bubble Pipeline Parallelism." ICLR 2024. arXiv:2401.10241. https://arxiv.org/abs/2401.10241 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

DeepEP DeepGEMM FlashMLA

What is DualPipe?

What is a pipeline bubble, and why does it matter?

How does DualPipe work?

How does the bidirectional schedule reduce bubbles?

How does DualPipe overlap computation and communication?

What role did DualPipe play in DeepSeek-V3?

How does DualPipe compare to 1F1B and Zero Bubble?

What are the tradeoffs of DualPipe?

References

Improve this article

Related Articles

Kiro (AI IDE)

MCP server

Agent Payments Protocol (AP2)

OpenAI AgentKit

Parallel Web Systems

NLWeb

What links here

Related Articles

Kiro (AI IDE)

MCP server

Agent Payments Protocol (AP2)

OpenAI AgentKit

Parallel Web Systems

NLWeb

What links here