Pipeline parallelism (often abbreviated PP) is a distributed training strategy that partitions the layers of a deep neural network across multiple accelerator devices, with each device holding one contiguous block of layers called a stage. During training, mini-batches of data are split into smaller micro-batches that flow through the stages assembly-line style: while stage 2 runs the forward pass for micro-batch 1, stage 1 can already begin the forward pass for micro-batch 2. The technique is one of the three main forms of model parallelism used to train neural networks too large to fit on a single GPU, the other two being data parallelism and tensor parallelism.
Pipeline parallelism became critical in the era of giant transformer models. Google introduced the foundational system GPipe in 2018, Microsoft Research followed with PipeDream in 2019, and NVIDIA combined pipeline parallelism with tensor and data parallelism into the 3D parallelism strategy used to train trillion-parameter large language models on thousands of GPUs. Modern training stacks such as Megatron-LM, DeepSpeed, FairScale, PyTorch's native torch.distributed.pipelining package, and automated planners like Alpa and Galvatron all use it as a primitive building block.
The defining performance challenge of pipeline parallelism is the pipeline bubble, the idle time on each device when the pipeline is filling at the start of a step or draining at the end. Years of scheduling research, from naive GPipe to one-forward-one-backward (1F1B), interleaved 1F1B, and most recently zero-bubble schedules, have steadily reduced this overhead.
A single modern accelerator such as an NVIDIA H100 has roughly 80 GB of high-bandwidth memory, while a 175-billion-parameter model in 16-bit precision needs about 350 GB just for the weights, before adding optimizer state, gradients, and activations. Even with mixed precision, ZeRO sharding, and activation checkpointing, large language models with hundreds of billions of parameters cannot fit on one device. Three complementary parallelism dimensions are typically combined to spread the workload: data parallelism replicates the model and shards the data; tensor parallelism splits the weight matrices of a single layer across several devices; pipeline parallelism splits the model along its depth dimension, assigning a contiguous set of layers to each device.
Pipeline parallelism is attractive because the only required communication is the activation tensor passed from one stage to the next on the forward pass, plus the gradient tensor sent back on the backward pass. These point-to-point messages are far smaller than the all-reduce traffic of data parallelism or the all-gather and reduce-scatter traffic of tensor parallelism. As a result, pipeline parallelism scales well across slower inter-node interconnects such as InfiniBand or Ethernet, while tensor parallelism is usually confined to the high-bandwidth NVLink domain inside a single node.
In the simplest form of pipeline parallelism, a model with L layers is partitioned into P contiguous stages, with each stage assigned to a different accelerator. Training proceeds as follows for one optimization step:
Because each stage must wait for input from its predecessor, devices sit idle during the warm-up and cool-down. This idle time is the pipeline bubble.
For a synchronous pipeline with no overlap between forward and backward, the fraction of time each device spends idle is approximately:
bubble_fraction = (P - 1) / (M + P - 1)
where P is the number of stages and M is the number of micro-batches per step. The bubble shrinks as M grows (when M >> P, the fraction collapses to ~(P - 1) / M) and grows linearly in P. The original GPipe paper observed that the overhead becomes negligible when M >= 4P, a heuristic still cited today. Raising M is not free, however: each in-flight micro-batch keeps its activations in memory until its backward pass completes, so activation memory scales linearly with M. This is the central tradeoff every subsequent pipeline schedule tries to relax.
Unlike data and tensor parallelism, which require collective operations across the whole group, pipeline parallelism only needs point-to-point (P2P) sends and receives between adjacent stages: stage k sends its forward activation to stage k+1 and receives the gradient back on the backward pass. This makes pipeline parallelism the most communication-efficient of the three dimensions measured in bytes per FLOP, and lets it scale gracefully across slower inter-node interconnects such as InfiniBand HDR or 400 Gb Ethernet, whereas tensor parallelism collapses outside the NVLink domain. In NVIDIA's 3D parallelism arrangement, tensor parallelism is bound inside a single server (NVLink at roughly 900 GB/s on H100 and B200), pipeline parallelism stretches across servers within a rack or pod, and data parallelism spans the outermost cluster dimension.
The choice of how to interleave forward and backward passes for the M micro-batches across the P stages is called the pipeline schedule. Schedules differ in their bubble overhead, their peak activation memory, and the precise weight-update semantics they offer.
The original schedule introduced by GPipe runs all M forward passes first on every stage, then all M backward passes in reverse, then a single optimizer step. It is sometimes called fill-drain or F-then-B scheduling. GPipe is fully synchronous, so convergence matches a single-device baseline. The downside is activation memory: every stage must keep activations for all M in-flight micro-batches simultaneously. To control this, GPipe pairs the schedule with activation recomputation (gradient checkpointing), discarding activations after the forward pass and recomputing them on demand at the cost of roughly 30% extra FLOPs.
Microsoft's PipeDream introduced the 1F1B schedule, where each stage strictly alternates between one forward pass and one backward pass once the pipeline reaches steady state. The number of in-flight forward activations on stage k is bounded by P - k, so the deepest stage holds only one set of activations and the first stage at most P. This dramatically reduces activation memory compared to GPipe while preserving the same theoretical bubble fraction. Two main variants exist:
NVIDIA's 2021 Megatron-LM paper introduced interleaved 1F1B, which partitions each stage into multiple sub-stages called virtual pipeline chunks. With v chunks per device, the bubble fraction drops to roughly (P - 1) / (v * M + P - 1), at the cost of v times more cross-stage communication. In practice, interleaved 1F1B is the default schedule for training the largest LLMs, with NVIDIA reporting 10%+ throughput gains over standard 1F1B on GPT-3-scale models. Activation memory grows because more partial micro-batches are in flight.
The 2024 ICLR paper by Qi, Wan, Huang, and Lin introduced Zero Bubble Pipeline Parallelism (ZB-1P, ZB-2P, ZB-H), the first synchronous schedules that achieve essentially zero pipeline bubble. The key insight is that the backward pass for a transformer layer splits into two distinct computations: B, the input-gradient that the previous stage needs to continue backward, and W, the weight-gradient that only this stage uses. By scheduling W computations into the slots that would otherwise be idle, the pipeline becomes nearly bubble-free without changing the number of in-flight activations. ZB-H1 achieves zero bubble with the same memory as 1F1B; ZB-H2 reaches exact zero bubble with about double the activation memory.
The SC '21 paper by Shigang Li and Torsten Hoefler introduced Chimera, which runs two pipelines simultaneously in opposite directions, one descending from stage 0 to P-1 and another ascending. Chimera reports up to 50% fewer bubbles than synchronous 1F1B at comparable memory and 1.16x-2.34x speedups over Megatron-LM on a 1.3B-parameter GPT-2 across 2,048 Piz Daint nodes. Later work such as BitPipe combines bidirectional pipelines with interleaved scheduling.
The table below summarizes the main schedule families.
| Schedule | Year & Paper | Bubble Fraction | Peak Activation Memory | Weight Update Semantics |
|---|---|---|---|---|
| GPipe (F-then-B) | Huang et al., 2019 (NeurIPS) | (P - 1) / (M + P - 1) | M micro-batches per stage | Synchronous |
| 1F1B asynchronous | Narayanan et al., 2019 (SOSP) | ~0 (no flush) | O(P) on first stage | Asynchronous, weight stashing |
| 1F1B synchronous (DAPPLE) | Fan et al., 2021 (PPoPP) | (P - 1) / (M + P - 1) | O(P - k) on stage k | Synchronous |
| PipeDream-2BW | Narayanan et al., 2021 (ICML) | very small | 2 weight copies per stage | Asynchronous, double-buffered |
| Interleaved 1F1B | Narayanan et al., 2021 (SC) | (P - 1) / (v*M + P - 1) | v times higher than 1F1B | Synchronous |
| Chimera | Li & Hoefler, 2021 (SC) | up to 50% lower than 1F1B | Comparable to 1F1B | Synchronous |
| Zero Bubble (ZB-H1) | Qi et al., 2024 (ICLR) | ~0 | Same as 1F1B | Synchronous |
| Zero Bubble (ZB-H2) | Qi et al., 2024 (ICLR) | 0 (exact) | ~2x 1F1B | Synchronous |
The largest hidden cost of pipeline parallelism is the activation memory required to hold in-flight micro-batches between their forward and backward passes. For 1F1B on stage k of P, this peaks at roughly (P - k) micro-batch activations; for GPipe it is M activations on every stage. With long sequence lengths (32K, 128K, or longer), activations easily exceed weight memory. Three complementary techniques are typically combined:
In modern Megatron-LM and NeMo configurations, these three techniques are layered together to push the maximum micro-batch count and sequence length while still fitting in HBM.
Deciding how to split the model across stages is a one-shot optimization problem: balance per-stage compute time, balance per-stage memory, and minimize the size of the activation tensor crossing each stage boundary. Common partitioning strategies:
| Method | How It Works | When To Use |
|---|---|---|
| Uniform layers | Equal number of layers per stage | Homogeneous transformer stacks where all layers have similar cost |
| Parameter-balanced | Equal number of parameters per stage | Models with non-uniform layer sizes such as GPT with embedding layers |
| Time-balanced (profile-driven) | Equal forward time per stage measured by profiler | Mixed-architecture models such as encoder-decoders |
| Memory-balanced | Equal activation memory per stage | Long-context models where activation memory dominates |
| Optimal (DP-based, used by PipeDream and DAPPLE) | Dynamic programming over a profiled cost model | Heterogeneous clusters or non-uniform pipelines |
DeepSpeed's PipelineModule exposes partition_method arguments for parameters, uniform, and a custom callable. Megatron-LM auto-balances embeddings by placing the input embedding on stage 0 and the output embedding on the last stage, with optional weight tying that requires an extra all-reduce across the first and last stages.
Most production training frameworks expose pipeline parallelism as a first-class feature, often combined into a 3D-parallel runtime.
| Framework | Pipeline API | Schedules Supported | Notes |
|---|---|---|---|
| Megatron-LM (NVIDIA) | pipeline_model_parallel_size argument | 1F1B, interleaved 1F1B, ZB | Reference implementation for 3D parallelism, used to train Megatron-Turing NLG, GPT-3 reproductions, and BLOOM |
| DeepSpeed (Microsoft) | deepspeed.pipe.PipelineModule | F-then-B, 1F1B | Combines with ZeRO for hybrid 3D-parallel runs; uses gradient accumulation to express pipeline batches |
| FairScale (Meta) | fairscale.nn.Pipe | GPipe-style F-then-B | Single-host implementation forked from torchgpipe, later upstreamed into PyTorch |
| torchgpipe | torchgpipe.GPipe | GPipe with checkpointing | Standalone PyTorch library by Kim et al., 2020, basis for FairScale's Pipe |
PyTorch native (torch.distributed.pipelining) | pipe.build_stage() | GPipe, 1F1B, interleaved 1F1B, looped BFS | Migrated from PiPPy in 2024, supports cross-host PP and composes with DDP, FSDP, and TP |
| PipeDream / PipeDream-2BW (MSR) | Research prototype | 1F1B asynchronous, 2BW | Original 1F1B work, supports asynchronous semantics with weight stashing |
| Alpa | Compiler-driven | Inter-operator pipeline schedule, auto-found | Treats pipeline parallelism as inter-operator parallelism in a hierarchical search space |
| Galvatron / Hetu-Galvatron (PKU) | Auto-parallel planner | Hybrid PP + TP + DP | Automatically searches the 3D-parallel configuration for transformer training |
| Colossal-AI | colossalai.pipeline | 1F1B, interleaved 1F1B, ZB | User-friendly hybrid-parallel framework |
The most influential application of pipeline parallelism is 3D parallelism, the combination of pipeline, tensor, and data parallelism introduced in NVIDIA's 2021 Megatron-LM paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. The usual mapping for a transformer of D layers running on N GPUs arranged in a P x T x R mesh is:
NVIDIA's reference results trained a 1 trillion parameter GPT model on 3,072 A100 GPUs at 502 PFLOP/s aggregate (163 TFLOP/s per GPU, 52% of peak) using interleaved 1F1B with the scatter/gather optimization. Modern Llama-class training runs continue to use this template, often with fully-sharded data parallelism (FSDP) replacing classic DP on the outer dimension.
| Strength | Detail |
|---|---|
| Low communication volume | Only activations cross stage boundaries; messages are P2P, never collective. Scales well across slow interconnects. |
| Memory savings scale linearly | Each stage holds 1/P of the model weights, optimizer states, and gradients. |
| Composes cleanly | Pipeline parallelism is orthogonal to tensor and data parallelism; the three combine into 3D parallelism without conflicts. |
| Hardware-friendly | Adjacent stages can be placed in the same rack or rail; no NVLink Switch is required between distant stages. |
| Limitation | Detail |
|---|---|
| Pipeline bubble | The (P - 1) / (M + P - 1) overhead is fundamental for synchronous schedules. Mitigated, not eliminated, by interleaving and zero-bubble schedules. |
| Activation memory | Multiple in-flight micro-batches consume activation memory; full recomputation costs ~2.7% extra FLOPs for ~70% memory savings. |
| Load-balancing fragility | A single slow stage stalls the whole pipeline. Heterogeneous models (encoder-decoders, MoE) are harder to balance. |
| Optimization semantics | Asynchronous pipelines (PipeDream original, PipeDream-2BW) sacrifice exact synchronous semantics, requiring weight stashing or version buffering to converge. |
| Cross-stage dependencies | Tied embeddings (input and output sharing weights) require an extra all-reduce between the first and last stages on every step. |
Pipeline parallelism is also used at inference. For latency-bound interactive serving, the bubble cost is painful because there is no opportunity to amortize it over many micro-batches, so pipelining is typically reserved for models too large to fit in a single tensor-parallel domain. Throughput-optimized batched inference can pack many requests into the pipeline and approach training bubble efficiency. Frameworks like vLLM and NVIDIA TensorRT-LLM expose pipeline parallelism as a deployment knob, most useful for very large MoE or trillion-parameter dense models served across multiple nodes.
Consider a 70-billion-parameter dense transformer with 80 transformer layers, trained on 256 H100 GPUs arranged as T = 8, P = 4, R = 8. Each pipeline stage owns 20 layers (~17.5B parameters), and with the global batch split into M = 32 micro-batches per replica the synchronous bubble fraction is (4 - 1) / (32 + 4 - 1) = 8.6%. Interleaved 1F1B with v = 2 chunks per stage cuts that to (4 - 1) / (64 + 4 - 1) = 4.5%, and a zero-bubble schedule eliminates it entirely at the cost of additional activation memory. This kind of arithmetic guides how engineers choose (T, P, R, M, schedule) for any given model and cluster.
Pipeline parallelism for neural networks was foreshadowed by classical systolic-array architectures and by asynchronous pipelining ideas such as PipeMare in the late 2010s. The first widely cited modern systems were:
Recent research continues to refine the activation-memory tradeoff (BitPipe, breadth-first PP, Seq1F1B for long-context training) and to handle heterogeneous edge clusters (PipeEdge, CollaPipe).
A few practical rules consistently appear across NVIDIA's NeMo, Microsoft's DeepSpeed, and the Hugging Face Accelerate playbooks: