Pipeline Parallelism

Pipeline parallelism (often abbreviated PP) is a distributed training strategy that partitions the layers of a deep neural network across multiple accelerator devices, with each device holding one contiguous block of layers called a stage. During training, mini-batches of data are split into smaller micro-batches that flow through the stages assembly-line style: while stage 2 runs the forward pass for micro-batch 1, stage 1 can already begin the forward pass for micro-batch 2. The technique is one of the three main forms of model parallelism used to train neural networks too large to fit on a single GPU, the other two being data parallelism and tensor parallelism.

Pipeline parallelism became critical in the era of giant transformer models. Google introduced the foundational system GPipe in 2018, Microsoft Research followed with PipeDream in 2019, and NVIDIA combined pipeline parallelism with tensor and data parallelism into the 3D parallelism strategy used to train trillion-parameter large language models on thousands of GPUs. Modern training stacks such as Megatron-LM, DeepSpeed, FairScale, PyTorch's native torch.distributed.pipelining package, and automated planners like Alpa and Galvatron all use it as a primitive building block.

The defining performance challenge of pipeline parallelism is the pipeline bubble, the idle time on each device when the pipeline is filling at the start of a step or draining at the end. Years of scheduling research, from naive GPipe to one-forward-one-backward (1F1B), interleaved 1F1B, and most recently zero-bubble schedules, have steadily reduced this overhead.

Motivation

A single modern accelerator such as an NVIDIA H100 has roughly 80 GB of high-bandwidth memory, while a 175-billion-parameter model in 16-bit precision needs about 350 GB just for the weights, before adding optimizer state, gradients, and activations. Even with mixed precision, ZeRO sharding, and activation checkpointing, large language models with hundreds of billions of parameters cannot fit on one device. Three complementary parallelism dimensions are typically combined to spread the workload: data parallelism replicates the model and shards the data; tensor parallelism splits the weight matrices of a single layer across several devices; pipeline parallelism splits the model along its depth dimension, assigning a contiguous set of layers to each device.

Pipeline parallelism is attractive because the only required communication is the activation tensor passed from one stage to the next on the forward pass, plus the gradient tensor sent back on the backward pass. These point-to-point messages are far smaller than the all-reduce traffic of data parallelism or the all-gather and reduce-scatter traffic of tensor parallelism. As a result, pipeline parallelism scales well across slower inter-node interconnects such as InfiniBand or Ethernet, while tensor parallelism is usually confined to the high-bandwidth NVLink domain inside a single node.

Mechanics of a Pipeline

In the simplest form of pipeline parallelism, a model with L layers is partitioned into P contiguous stages, with each stage assigned to a different accelerator. Training proceeds as follows for one optimization step:

The mini-batch is split into M micro-batches, where M is typically between 4 and 64.
Stage 0 runs the forward pass on micro-batch 1, then sends the activations to stage 1 via a point-to-point send.
Stage 0 immediately begins the forward pass on micro-batch 2 while stage 1 begins micro-batch 1. The pipeline is now filling.
After the forward pass on the last stage completes for micro-batch m, that stage begins the backward pass for micro-batch m, which then propagates back through the stages in reverse order.
Each stage accumulates gradients across all M micro-batches; after the step finishes, the optimizer applies a single weight update.

Because each stage must wait for input from its predecessor, devices sit idle during the warm-up and cool-down. This idle time is the pipeline bubble.

The Bubble Formula

For a synchronous pipeline with no overlap between forward and backward, the fraction of time each device spends idle is approximately:

bubble_fraction = (P - 1) / (M + P - 1)

where P is the number of stages and M is the number of micro-batches per step. The bubble shrinks as M grows (when M >> P, the fraction collapses to ~(P - 1) / M) and grows linearly in P. The original GPipe paper observed that the overhead becomes negligible when M >= 4P, a heuristic still cited today. Raising M is not free, however: each in-flight micro-batch keeps its activations in memory until its backward pass completes, so activation memory scales linearly with M. This is the central tradeoff every subsequent pipeline schedule tries to relax.

Communication Pattern

Unlike data and tensor parallelism, which require collective operations across the whole group, pipeline parallelism only needs point-to-point (P2P) sends and receives between adjacent stages: stage k sends its forward activation to stage k+1 and receives the gradient back on the backward pass. This makes pipeline parallelism the most communication-efficient of the three dimensions measured in bytes per FLOP, and lets it scale gracefully across slower inter-node interconnects such as InfiniBand HDR or 400 Gb Ethernet, whereas tensor parallelism collapses outside the NVLink domain. In NVIDIA's 3D parallelism arrangement, tensor parallelism is bound inside a single server (NVLink at roughly 900 GB/s on H100 and B200), pipeline parallelism stretches across servers within a rack or pod, and data parallelism spans the outermost cluster dimension.

Pipeline Schedules

The choice of how to interleave forward and backward passes for the M micro-batches across the P stages is called the pipeline schedule. Schedules differ in their bubble overhead, their peak activation memory, and the precise weight-update semantics they offer.

GPipe (F-then-B)

The original schedule introduced by GPipe runs all M forward passes first on every stage, then all M backward passes in reverse, then a single optimizer step. It is sometimes called fill-drain or F-then-B scheduling. GPipe is fully synchronous, so convergence matches a single-device baseline. The downside is activation memory: every stage must keep activations for all M in-flight micro-batches simultaneously. To control this, GPipe pairs the schedule with activation recomputation (gradient checkpointing), discarding activations after the forward pass and recomputing them on demand at the cost of roughly 30% extra FLOPs.

1F1B (One-Forward-One-Backward)

Microsoft's PipeDream introduced the 1F1B schedule, where each stage strictly alternates between one forward pass and one backward pass once the pipeline reaches steady state. The number of in-flight forward activations on stage k is bounded by P - k, so the deepest stage holds only one set of activations and the first stage at most P. This dramatically reduces activation memory compared to GPipe while preserving the same theoretical bubble fraction. Two main variants exist:

Asynchronous 1F1B (PipeDream-original) allows multiple weight versions to coexist; each stage applies its update as soon as a backward pass finishes, and PipeDream avoids staleness-induced divergence with weight stashing, keeping the weight version used during a forward pass so the matching backward pass uses the same weights. Throughput is highest, but optimization semantics differ from data parallelism.
Synchronous 1F1B (DAPPLE, Megatron-LM) keeps a single weight version per stage and inserts a flush at the end of every step. The bubble fraction matches GPipe but the activation memory is much lower.

Interleaved 1F1B

NVIDIA's 2021 Megatron-LM paper introduced interleaved 1F1B, which partitions each stage into multiple sub-stages called virtual pipeline chunks. With v chunks per device, the bubble fraction drops to roughly (P - 1) / (v * M + P - 1), at the cost of v times more cross-stage communication. In practice, interleaved 1F1B is the default schedule for training the largest LLMs, with NVIDIA reporting 10%+ throughput gains over standard 1F1B on GPT-3-scale models. Activation memory grows because more partial micro-batches are in flight.

Zero Bubble Pipeline Parallelism

The 2024 ICLR paper by Qi, Wan, Huang, and Lin introduced Zero Bubble Pipeline Parallelism (ZB-1P, ZB-2P, ZB-H), the first synchronous schedules that achieve essentially zero pipeline bubble. The key insight is that the backward pass for a transformer layer splits into two distinct computations: B, the input-gradient that the previous stage needs to continue backward, and W, the weight-gradient that only this stage uses. By scheduling W computations into the slots that would otherwise be idle, the pipeline becomes nearly bubble-free without changing the number of in-flight activations. ZB-H1 achieves zero bubble with the same memory as 1F1B; ZB-H2 reaches exact zero bubble with about double the activation memory.

Chimera and Bidirectional Pipelines

The SC '21 paper by Shigang Li and Torsten Hoefler introduced Chimera, which runs two pipelines simultaneously in opposite directions, one descending from stage 0 to P-1 and another ascending. Chimera reports up to 50% fewer bubbles than synchronous 1F1B at comparable memory and 1.16x-2.34x speedups over Megatron-LM on a 1.3B-parameter GPT-2 across 2,048 Piz Daint nodes. Later work such as BitPipe combines bidirectional pipelines with interleaved scheduling.

Schedule Comparison

The table below summarizes the main schedule families.

Schedule	Year & Paper	Bubble Fraction	Peak Activation Memory	Weight Update Semantics
GPipe (F-then-B)	Huang et al., 2019 (NeurIPS)	(P - 1) / (M + P - 1)	M micro-batches per stage	Synchronous
1F1B asynchronous	Narayanan et al., 2019 (SOSP)	~0 (no flush)	O(P) on first stage	Asynchronous, weight stashing
1F1B synchronous (DAPPLE)	Fan et al., 2021 (PPoPP)	(P - 1) / (M + P - 1)	O(P - k) on stage k	Synchronous
PipeDream-2BW	Narayanan et al., 2021 (ICML)	very small	2 weight copies per stage	Asynchronous, double-buffered
Interleaved 1F1B	Narayanan et al., 2021 (SC)	(P - 1) / (v*M + P - 1)	v times higher than 1F1B	Synchronous
Chimera	Li & Hoefler, 2021 (SC)	up to 50% lower than 1F1B	Comparable to 1F1B	Synchronous
Zero Bubble (ZB-H1)	Qi et al., 2024 (ICLR)	~0	Same as 1F1B	Synchronous
Zero Bubble (ZB-H2)	Qi et al., 2024 (ICLR)	0 (exact)	~2x 1F1B	Synchronous

Activation Memory and Recomputation

The largest hidden cost of pipeline parallelism is the activation memory required to hold in-flight micro-batches between their forward and backward passes. For 1F1B on stage k of P, this peaks at roughly (P - k) micro-batch activations; for GPipe it is M activations on every stage. With long sequence lengths (32K, 128K, or longer), activations easily exceed weight memory. Three complementary techniques are typically combined:

Full activation recomputation (gradient checkpointing) discards intermediate activations after the forward pass and recomputes them on demand. NVIDIA's measurements on GPT-3 report a 70% activation-memory reduction at 2.7% extra FLOPs.
Selective activation recomputation (NVIDIA, MLSys 2023) recomputes only the most memory-intensive but compute-cheap operations (notably attention softmax and dropout), giving roughly 5x activation memory savings with much smaller runtime overhead than full recomputation.
Sequence parallelism splits the activation tensor along the sequence dimension across the tensor parallelism group for layer-norms and dropouts that fall outside the standard tensor-parallel splits, eliminating duplicated activation memory across tensor-parallel ranks.

In modern Megatron-LM and NeMo configurations, these three techniques are layered together to push the maximum micro-batch count and sequence length while still fitting in HBM.

Stage Partitioning

Deciding how to split the model across stages is a one-shot optimization problem: balance per-stage compute time, balance per-stage memory, and minimize the size of the activation tensor crossing each stage boundary. Common partitioning strategies:

Method	How It Works	When To Use
Uniform layers	Equal number of layers per stage	Homogeneous transformer stacks where all layers have similar cost
Parameter-balanced	Equal number of parameters per stage	Models with non-uniform layer sizes such as GPT with embedding layers
Time-balanced (profile-driven)	Equal forward time per stage measured by profiler	Mixed-architecture models such as encoder-decoders
Memory-balanced	Equal activation memory per stage	Long-context models where activation memory dominates
Optimal (DP-based, used by PipeDream and DAPPLE)	Dynamic programming over a profiled cost model	Heterogeneous clusters or non-uniform pipelines

DeepSpeed's PipelineModule exposes partition_method arguments for parameters, uniform, and a custom callable. Megatron-LM auto-balances embeddings by placing the input embedding on stage 0 and the output embedding on the last stage, with optional weight tying that requires an extra all-reduce across the first and last stages.

Frameworks and Implementations

Most production training frameworks expose pipeline parallelism as a first-class feature, often combined into a 3D-parallel runtime.

Framework	Pipeline API	Schedules Supported	Notes
Megatron-LM (NVIDIA)	`pipeline_model_parallel_size` argument	1F1B, interleaved 1F1B, ZB	Reference implementation for 3D parallelism, used to train Megatron-Turing NLG, GPT-3 reproductions, and BLOOM
DeepSpeed (Microsoft)	`deepspeed.pipe.PipelineModule`	F-then-B, 1F1B	Combines with ZeRO for hybrid 3D-parallel runs; uses gradient accumulation to express pipeline batches
FairScale (Meta)	`fairscale.nn.Pipe`	GPipe-style F-then-B	Single-host implementation forked from torchgpipe, later upstreamed into PyTorch
torchgpipe	`torchgpipe.GPipe`	GPipe with checkpointing	Standalone PyTorch library by Kim et al., 2020, basis for FairScale's Pipe
PyTorch native (`torch.distributed.pipelining`)	`pipe.build_stage()`	GPipe, 1F1B, interleaved 1F1B, looped BFS	Migrated from PiPPy in 2024, supports cross-host PP and composes with DDP, FSDP, and TP
PipeDream / PipeDream-2BW (MSR)	Research prototype	1F1B asynchronous, 2BW	Original 1F1B work, supports asynchronous semantics with weight stashing
Alpa	Compiler-driven	Inter-operator pipeline schedule, auto-found	Treats pipeline parallelism as inter-operator parallelism in a hierarchical search space
Galvatron / Hetu-Galvatron (PKU)	Auto-parallel planner	Hybrid PP + TP + DP	Automatically searches the 3D-parallel configuration for transformer training
Colossal-AI	`colossalai.pipeline`	1F1B, interleaved 1F1B, ZB	User-friendly hybrid-parallel framework

3D Parallelism

The most influential application of pipeline parallelism is 3D parallelism, the combination of pipeline, tensor, and data parallelism introduced in NVIDIA's 2021 Megatron-LM paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. The usual mapping for a transformer of D layers running on N GPUs arranged in a P x T x R mesh is:

Tensor parallelism (T) is bound inside a single server. Each transformer block's QKV projection, attention output, and MLP weights are sharded across T GPUs sharing an NVLink switch. T typically equals the number of GPUs in one node (8 for HGX H100 or HGX B200).
Pipeline parallelism (P) crosses servers within a rack or rail. The D transformer layers are split into P groups of D/P contiguous layers, each on a separate server. P is usually a small number (4 to 16) to keep the bubble manageable.
Data parallelism (R) spans the remaining outer dimension. Each (P x T) slice forms one model replica, and R replicas process different data shards. ZeRO-style optimizer sharding can be layered on top to reduce memory.

NVIDIA's reference results trained a 1 trillion parameter GPT model on 3,072 A100 GPUs at 502 PFLOP/s aggregate (163 TFLOP/s per GPU, 52% of peak) using interleaved 1F1B with the scatter/gather optimization. Modern Llama-class training runs continue to use this template, often with fully-sharded data parallelism (FSDP) replacing classic DP on the outer dimension.

Choosing the Parallelism Sizes

Maximize T inside the NVLink domain. Tensor parallelism communication is the heaviest, so T = 8 inside one HGX node is typical; going beyond requires NVLink Switch fabric.
Choose P large enough to fit the model in memory with whatever activation recomputation you can afford.
Set M (micro-batches per step) so the bubble is small. M >= 4P keeps the synchronous bubble below ~20%; interleaved schedules tolerate smaller M.
R fills the rest of the cluster. ZeRO or FSDP let each replica use less memory while contributing to the same effective batch.

Pros and Cons

Strength	Detail
Low communication volume	Only activations cross stage boundaries; messages are P2P, never collective. Scales well across slow interconnects.
Memory savings scale linearly	Each stage holds 1/P of the model weights, optimizer states, and gradients.
Composes cleanly	Pipeline parallelism is orthogonal to tensor and data parallelism; the three combine into 3D parallelism without conflicts.
Hardware-friendly	Adjacent stages can be placed in the same rack or rail; no NVLink Switch is required between distant stages.

Limitation	Detail
Pipeline bubble	The (P - 1) / (M + P - 1) overhead is fundamental for synchronous schedules. Mitigated, not eliminated, by interleaving and zero-bubble schedules.
Activation memory	Multiple in-flight micro-batches consume activation memory; full recomputation costs ~2.7% extra FLOPs for ~70% memory savings.
Load-balancing fragility	A single slow stage stalls the whole pipeline. Heterogeneous models (encoder-decoders, MoE) are harder to balance.
Optimization semantics	Asynchronous pipelines (PipeDream original, PipeDream-2BW) sacrifice exact synchronous semantics, requiring weight stashing or version buffering to converge.
Cross-stage dependencies	Tied embeddings (input and output sharing weights) require an extra all-reduce between the first and last stages on every step.

Pipeline Parallelism in Inference

Pipeline parallelism is also used at inference. For latency-bound interactive serving, the bubble cost is painful because there is no opportunity to amortize it over many micro-batches, so pipelining is typically reserved for models too large to fit in a single tensor-parallel domain. Throughput-optimized batched inference can pack many requests into the pipeline and approach training bubble efficiency. Frameworks like vLLM and NVIDIA TensorRT-LLM expose pipeline parallelism as a deployment knob, most useful for very large MoE or trillion-parameter dense models served across multiple nodes.

Worked Example

Consider a 70-billion-parameter dense transformer with 80 transformer layers, trained on 256 H100 GPUs arranged as T = 8, P = 4, R = 8. Each pipeline stage owns 20 layers (~17.5B parameters), and with the global batch split into M = 32 micro-batches per replica the synchronous bubble fraction is (4 - 1) / (32 + 4 - 1) = 8.6%. Interleaved 1F1B with v = 2 chunks per stage cuts that to (4 - 1) / (64 + 4 - 1) = 4.5%, and a zero-bubble schedule eliminates it entirely at the cost of additional activation memory. This kind of arithmetic guides how engineers choose (T, P, R, M, schedule) for any given model and cluster.

History

Pipeline parallelism for neural networks was foreshadowed by classical systolic-array architectures and by asynchronous pipelining ideas such as PipeMare in the late 2010s. The first widely cited modern systems were:

GPipe (2018): Google's NeurIPS 2019 paper by Yanping Huang and colleagues. GPipe scaled an AmoebaNet image classifier from 557M to 1.8B parameters and a Transformer language model up to 83.9B parameters using 128 partitions, with near-linear speedup.
PipeDream (2019): Microsoft Research's SOSP 2019 paper by Deepak Narayanan and colleagues introduced 1F1B with weight stashing, achieving up to 5.3x speedup over data-parallel training. The follow-up PipeDream-2BW (ICML 2021) kept only two weight versions and reached 20x faster GPT and BERT training.
torchgpipe (2020) by Chiheon Kim and colleagues at Kakao Brain re-implemented GPipe in PyTorch eager mode; FairScale forked it and PyTorch upstreamed the API in version 1.8.
DAPPLE (PPoPP 2021) by Fan, Rong, and colleagues at Alibaba combined synchronous 1F1B with a partition planner.
Megatron-LM 3D (SC 2021) by Narayanan, Shoeybi, Casper, and others combined pipeline, tensor, and data parallelism and introduced interleaved 1F1B. This paper has become the canonical reference for trillion-parameter training.
Chimera (SC 2021) by Li and Hoefler introduced bidirectional pipelines that cut bubble counts by up to 50%.
Alpa (OSDI 2022) by Zheng and colleagues at UC Berkeley treats pipeline parallelism as inter-operator parallelism in a two-level optimization problem and automatically generates 3D parallelism plans.
Galvatron (VLDB 2022) from PKU's DAIR Lab proposed an automatic search procedure for hybrid parallelism strategies, integrating with Megatron-LM and DeepSpeed.
Zero Bubble PP (ICLR 2024) by Qi, Wan, Huang, and Lin demonstrated that synchronous pipelines can achieve essentially zero bubble by splitting the backward pass into input-gradient and weight-gradient computations.

Recent research continues to refine the activation-memory tradeoff (BitPipe, breadth-first PP, Seq1F1B for long-context training) and to handle heterogeneous edge clusters (PipeEdge, CollaPipe).

Best Practices

A few practical rules consistently appear across NVIDIA's NeMo, Microsoft's DeepSpeed, and the Hugging Face Accelerate playbooks:

Set M >= 4P as a starting point for synchronous schedules; raise it until activation memory becomes the bottleneck.
Always pair pipeline parallelism with activation recomputation for transformer training; selective recomputation is preferred when tensor parallelism is also used.
Place tensor parallelism inside NVLink, pipeline across nodes. Putting tensor parallelism across nodes wastes bandwidth; putting pipeline inside a node wastes NVLink capacity.
Profile your stage balance. Uniform partitioning is rarely optimal because the first stage holds the embedding table and the last holds the LM head and loss.
Use interleaved 1F1B for very large P. Once P >= 8 the bubble penalty grows enough to justify the extra communication.
Be cautious with asynchronous pipelines. Weight-stashing schemes change optimization semantics; production frontier-model runs prefer synchronous (including zero-bubble) schedules.

References

Huang, Y., et al. (2019). *GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism*. NeurIPS 2019. arXiv:1811.06965. https://arxiv.org/abs/1811.06965
Narayanan, D., Harlap, A., Phanishayee, A., et al. (2019). *PipeDream: Generalized Pipeline Parallelism for DNN Training*. SOSP '19. https://dl.acm.org/doi/10.1145/3341301.3359646
Narayanan, D., Phanishayee, A., Shi, K., Chen, X., & Zaharia, M. (2021). *Memory-Efficient Pipeline-Parallel DNN Training*. ICML 2021. arXiv:2006.09503. https://arxiv.org/abs/2006.09503
Narayanan, D., Shoeybi, M., Casper, J., et al. (2021). *Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM*. SC '21. arXiv:2104.04473. https://arxiv.org/abs/2104.04473
Kim, C., et al. (2020). *torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models*. arXiv:2004.09910. https://arxiv.org/abs/2004.09910
Fan, S., Rong, Y., Meng, C., et al. (2021). *DAPPLE: A Pipelined Data Parallel Approach for Training Large Models*. PPoPP '21. https://dl.acm.org/doi/10.1145/3437801.3441593
Li, S., & Hoefler, T. (2021). *Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines*. SC '21. arXiv:2107.06925. https://arxiv.org/abs/2107.06925
Korthikanti, V., Casper, J., Lym, S., et al. (2023). *Reducing Activation Recomputation in Large Transformer Models*. MLSys 2023. arXiv:2205.05198. https://arxiv.org/abs/2205.05198
Qi, P., Wan, X., Huang, G., & Lin, M. (2024). *Zero Bubble Pipeline Parallelism*. ICLR 2024. arXiv:2401.10241. https://arxiv.org/abs/2401.10241
Zheng, L., Li, Z., Zhang, H., et al. (2022). *Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning*. OSDI 2022. https://www.usenix.org/conference/osdi22/presentation/zheng-lianmin
Miao, X., Wang, Y., Jiang, Y., et al. (2022). *Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism*. VLDB 16(3). arXiv:2211.13878. https://arxiv.org/abs/2211.13878
NVIDIA Corporation. (2021). *Scaling Language Model Training to a Trillion Parameters Using Megatron*. NVIDIA Developer Blog. https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/
Microsoft DeepSpeed Team. *Pipeline Parallelism Tutorial*. https://www.deepspeed.ai/tutorials/pipeline/
PyTorch Team. *torch.distributed.pipelining: Pipeline Parallelism*. https://docs.pytorch.org/docs/stable/distributed.pipelining.html
Facebook AI Research. *FairScale Pipeline Parallelism*. https://fairscale.readthedocs.io/en/latest/deep_dive/pipeline_parallelism.html

Pipeline Parallelism

Motivation

Mechanics of a Pipeline

The Bubble Formula

Communication Pattern

Pipeline Schedules

GPipe (F-then-B)

1F1B (One-Forward-One-Backward)

Interleaved 1F1B

Zero Bubble Pipeline Parallelism

Chimera and Bidirectional Pipelines

Schedule Comparison

Activation Memory and Recomputation

Stage Partitioning

Frameworks and Implementations

3D Parallelism

Choosing the Parallelism Sizes

Pros and Cons

Pipeline Parallelism in Inference

Worked Example

History

Best Practices

See Also

References

Improve this article

Related Articles

DeepSeek 3.0

MCP server

Tensor Parallelism

Fully Sharded Data Parallel (FSDP)

Context engineering

Reasoning models

Pipeline Parallelism

Motivation

Mechanics of a Pipeline

The Bubble Formula

Communication Pattern

Pipeline Schedules

GPipe (F-then-B)

1F1B (One-Forward-One-Backward)

Interleaved 1F1B

Zero Bubble Pipeline Parallelism

Chimera and Bidirectional Pipelines

Schedule Comparison

Activation Memory and Recomputation

Stage Partitioning

Frameworks and Implementations

3D Parallelism

Choosing the Parallelism Sizes

Pros and Cons

Pipeline Parallelism in Inference

Worked Example

History

Best Practices

See Also

References

Related Articles

DeepSeek 3.0

MCP server

Tensor Parallelism

Fully Sharded Data Parallel (FSDP)

Context engineering

Reasoning models