# Pipelining

> Source: https://aiwiki.ai/wiki/pipelining
> Updated: 2026-06-09
> Categories: MLOps, Machine Learning, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# Pipelining

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**Pipelining** is a term used in two distinct senses within machine learning and artificial intelligence. The first refers to the orchestration of an end-to-end [machine learning](/wiki/machine_learning) workflow as a chained sequence of stages such as data ingestion, preprocessing, feature engineering, training, evaluation, and deployment. The second refers to **pipeline parallelism**, a strategy used in [distributed training](/wiki/distributed_training) of large neural networks where the model itself is partitioned across devices and micro-batches flow through the resulting stages, similar in spirit to instruction pipelining in classical computer architecture.

This article describes both meanings. The first part covers ML workflow pipelining, including the [scikit-learn](/wiki/scikit-learn) `Pipeline` class, [TensorFlow Extended](/wiki/tensorflow) (TFX), [Kubeflow](/wiki/kubeflow) Pipelines, [MLflow](/wiki/mlflow), [Apache Airflow](/wiki/mlops), and other workflow systems. The second part covers pipeline parallelism, including foundational systems such as GPipe and PipeDream, the 1F1B and interleaved 1F1B schedules used in [Megatron-LM](/wiki/distributed_training), [DeepSpeed](/wiki/deepspeed) pipeline parallelism, and how pipeline parallelism is combined with [data parallelism](/wiki/data_parallelism) and [tensor parallelism](/wiki/distributed_training) when training [large language models](/wiki/large_language_model) such as [GPT-4](/wiki/gpt4), [PaLM](/wiki/palm), and [LLaMA](/wiki/llama).

## Disambiguation

The two senses are quite different in scope and audience.

| Sense | Domain | Granularity | Typical users |
|---|---|---|---|
| ML workflow pipelining | [MLOps](/wiki/mlops), data engineering | Stages of an ML lifecycle (ingest, train, evaluate, serve) | Data scientists, ML engineers |
| Pipeline parallelism | [Distributed training](/wiki/distributed_training) | Layers or sub-modules of a single neural network | Systems researchers, large-scale training engineers |

The two ideas share a metaphor (sequential stages handing data to the next stage) but operate at completely different levels. A workflow pipeline runs once per training job and may take hours or days, with each stage scheduled by an orchestrator. A pipeline-parallel forward pass runs millions of times during a single training job and is scheduled at the level of individual micro-batches by the deep learning runtime.

## Part 1: machine learning workflow pipelining

Pipelining in this sense refers to chaining together the discrete steps of a machine learning workflow, from data preprocessing and feature extraction to model training, validation, and deployment, into a single coherent and reproducible system. Pipelining is used to simplify implementation, manage complex projects, support reproducibility, and automate retraining.

### Data preprocessing and feature extraction

Before a machine learning model can be trained, raw data must be transformed into a format the model can process. Common preprocessing steps include:

- Data cleaning: removing or correcting inconsistencies, missing values, and errors in the data.
- [Feature engineering](/wiki/feature_engineering): creating new, informative features from the raw data that can help improve the model's performance.
- Feature scaling: standardizing or normalizing features so they share a similar scale, which can improve training stability and convergence.
- Encoding: converting categorical variables into numerical form via [one-hot encoding](/wiki/one-hot_encoding), ordinal encoding, or learned [embeddings](/wiki/embeddings).

These preprocessing tasks are typically combined into a single pipeline so that the same transformations are applied consistently to training, validation, and inference data.

### Model training and evaluation

Once the data is preprocessed and features are extracted, the next step is to train a [machine learning](/wiki/machine_learning) model. This usually involves:

- Model selection: choosing an appropriate algorithm or architecture, such as a [Support Vector Machine](/wiki/support_vector_machine_svm), [Random Forest](/wiki/random_forest), [gradient boosted tree](/wiki/gradient_boosting), or [deep neural network](/wiki/deep_neural_network).
- [Hyperparameter](/wiki/hyperparameter) tuning: adjusting the algorithm's parameters to optimize performance, often via grid search, random search, or [Bayesian optimization](/wiki/bayesian_optimization).
- Model evaluation: assessing performance using metrics such as [accuracy](/wiki/accuracy), [precision](/wiki/precision), [recall](/wiki/recall), [F1 score](/wiki/f1_score), [AUC](/wiki/auc), or domain-specific scores.

Incorporating these steps into a pipeline allows researchers to compare configurations consistently and ensures evaluation is unbiased.

### Cross-validation and model selection

To select the most suitable model and hyperparameters for a given problem, practitioners often use [cross-validation](/wiki/cross-validation). This involves splitting the dataset into multiple subsets, training on a portion of the data, and evaluating on the remainder. The process is repeated multiple times with different splits.

Placing cross-validation inside the pipeline (rather than around it) is important to avoid data leakage.[16] If preprocessing steps such as scaling are fit on the full dataset before cross-validation splits are created, statistics from the held-out fold will leak into the training fold, biasing the score upward. Pipelining ensures that the same samples used to fit the transformer are the only ones used to train the predictor for that fold.

### scikit-learn Pipeline

The `Pipeline` class in `sklearn.pipeline` is one of the most widely used implementations of ML workflow pipelining.[14] A `Pipeline` chains a list of named transformers, ending in a final estimator.[15] Every step except the last must implement `fit` and `transform`; the last step must implement `fit` and either `predict`, `transform`, or `fit_predict`.[15]

Key features of `sklearn.pipeline.Pipeline`:

- A single call to `fit` runs every transformer's `fit_transform` in sequence and then fits the final estimator on the transformed output.
- A single call to `predict` runs every transformer's `transform` and feeds the result into the final estimator's `predict`.
- Hyperparameters of any step can be addressed using the `step_name__param_name` syntax, allowing grid search across all stages simultaneously.[15]
- Fitted transformers can be cached to disk via the `memory` argument, which avoids refitting expensive preprocessors during repeated grid searches.[15]
- Setting `verbose=True` prints elapsed time per step.

A minimal example:

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA()),
    ("clf", LogisticRegression(max_iter=1000)),
])

param_grid = {
    "pca__n_components": [5, 10, 20],
    "clf__C": [0.1, 1.0, 10.0],
}

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)
```

For heterogeneous tabular data, scikit-learn provides two related composers.[16] `ColumnTransformer` applies different transformers to different columns and concatenates the results, useful when numeric features need scaling while categorical features need encoding.[17] `FeatureUnion` applies multiple transformers to the same input and concatenates their outputs, useful for combining different feature extraction strategies on the same column. Both compose with `Pipeline`. The helper `make_pipeline` constructs a `Pipeline` with auto-generated step names, and `make_column_transformer` does the same for `ColumnTransformer`.[16]

### TensorFlow Extended (TFX)

TensorFlow Extended (TFX) is an end-to-end platform from Google for deploying production ML pipelines.[18] A TFX pipeline is a directed sequence of components that pass artifacts (datasets, schemas, models, evaluations) between steps and record metadata in an ML Metadata store.[20]

Standard TFX components include:

| Component | Purpose |
|---|---|
| `ExampleGen` | Ingests data from sources such as CSV, TFRecord, BigQuery, or custom inputs and splits it into train and eval sets. |
| `StatisticsGen` | Computes feature statistics over each split using TensorFlow Data Validation (TFDV). |
| `SchemaGen` | Infers a schema describing expected feature types, ranges, and presence. |
| `ExampleValidator` | Detects anomalies and training/serving drift by comparing statistics to the schema. |
| `Transform` | Performs full-pass preprocessing using TensorFlow Transform (TFT), producing a graph that is reused at serving time to avoid [training-serving skew](/wiki/training-serving_skew). |
| `Trainer` | Trains a TensorFlow model with a user-supplied `run_fn`, optionally on accelerators or distributed clusters. |
| `Tuner` | Runs hyperparameter search via KerasTuner. |
| `Evaluator` | Computes sliced metrics with TensorFlow Model Analysis (TFMA) and decides whether the candidate model is good enough to be blessed. |
| `InfraValidator` | Loads the model in a sandboxed serving binary to confirm it can be served. |
| `Pusher` | Pushes blessed models to TensorFlow Serving, TFLite, or another deployment target. |
| `BulkInferrer` | Runs batch inference using a blessed model. |

TFX pipelines can be authored once and executed on multiple orchestrators including Apache Airflow, [Apache Beam](/wiki/mlops), Kubeflow Pipelines, and Vertex AI Pipelines, which allows local development on a workstation to be promoted to a managed cloud environment without rewriting the pipeline code.[19]

### Kubeflow Pipelines

Kubeflow Pipelines is a Kubernetes-native platform for building and running portable ML workflows.[21] Pipelines are defined in Python using the Kubeflow Pipelines SDK (KFP) and compiled into a static workflow specification that the controller executes by spinning up containers as Kubernetes pods.[21] Each component runs in its own container, which makes language-agnostic steps easy to mix and supports GPU-heavy workloads, parallel hyperparameter search, and large training jobs on shared clusters.

Kubeflow also includes Notebooks for in-cluster JupyterLab, Katib for hyperparameter and neural architecture search, and KServe for model serving. Vertex AI Pipelines on Google Cloud is a managed service that runs KFP-compatible pipelines without requiring users to operate their own Kubernetes cluster.

### MLflow

[MLflow](/wiki/mlflow), originally developed at Databricks, is an open-source platform for managing the ML lifecycle.[22] It is organized around four components: MLflow Tracking for logging parameters, metrics, and artifacts of each run; MLflow Projects for packaging code in a reproducible format; MLflow Models for a standard model packaging format with multiple flavors (sklearn, PyTorch, TensorFlow, ONNX, custom Python); and MLflow Model Registry for versioning and stage promotion (Staging, Production, Archived).[22] MLflow is often paired with Airflow as a tracking and registry layer in MLOps stacks.

### Apache Airflow

Apache Airflow is a general-purpose workflow orchestrator that defines pipelines as directed acyclic graphs (DAGs) in Python.[23] Although Airflow is not ML-specific, it is widely used in production data and ML stacks because it integrates with most data warehouses, supports rich scheduling, and has mature operators for cloud services.[23] ML teams often use Airflow to orchestrate retraining DAGs, with MLflow handling experiment tracking and a model registry.

### Other ML pipeline frameworks

There is a long tail of frameworks targeting ML workflow pipelining. They differ in scope, orchestration model, and how opinionated they are about MLOps practices.

| Tool | Origin | Strengths | Notes |
|---|---|---|---|
| [Kubeflow](/wiki/kubeflow) Pipelines | Google | Kubernetes-native; scales to clusters; ML-specific components | Requires Kubernetes infrastructure |
| TFX | Google | Tight integration with TensorFlow; production-grade components | Best when the model is a TensorFlow model |
| MLflow | Databricks | Tracking, model packaging, registry; tool-agnostic | Lighter than full pipeline systems |
| Apache Airflow | Airbnb | General-purpose DAG orchestration; large ecosystem | Not ML-specific |
| Prefect | Prefect Technologies | Pythonic flows; dynamic DAGs; managed cloud | Successor to Airflow-style orchestration |
| Kedro | QuantumBlack (McKinsey) | Software engineering structure; data catalog; modular pipelines | Not an orchestrator on its own |
| ZenML | ZenML GmbH | Pluggable stack; integrates 50+ MLOps tools; cloud-agnostic | Framework rather than runtime |
| Metaflow | Netflix | Pythonic flows; versioning; AWS-native scaling | Battle-tested at Netflix scale |
| Dagster | Elementl | Asset-based pipelines; type checks; rich UI | Strong data-engineering features |
| TorchX | PyTorch | Job specification for distributed PyTorch training | Lower level than full ML pipelines |

Kedro emphasizes software engineering practices such as modularity, separation of concerns, and a versioned data catalog, but it does not provide orchestration on its own. ZenML is structured around a pluggable stack model in which artifact stores, orchestrators, and deployers are interchangeable plugins. Metaflow, originating at Netflix, is known for its Pythonic decorator-based flows and tight AWS integration. TorchX is a job launcher for distributed PyTorch training and inference rather than a full ML lifecycle pipeline.

### Feature stores

A feature store is a data platform that centralizes the definition, storage, serving, and monitoring of [features](/wiki/feature) used by ML models. It addresses two recurring problems: training/serving skew, where a feature is computed differently in batch training and online inference, and feature reuse, where similar features are recomputed in many places. A feature store typically exposes an offline store optimized for training (warehouses such as Snowflake, BigQuery, or Redshift) and an online store optimized for low-latency lookups during inference (key-value stores such as Redis, DynamoDB, or Postgres).

Notable systems include Feast, an open-source feature store with a pluggable architecture;[25] Tecton, a managed platform that adds real-time streaming feature pipelines and was the original maintainer of Feast;[26] and Hopsworks, an enterprise feature store with strong ties to Apache Hudi. Major cloud vendors offer feature stores as part of their managed ML platforms, including Vertex AI Feature Store, Amazon SageMaker Feature Store, and Databricks Feature Store.

### CI/CD for machine learning (MLOps)

Continuous integration and continuous delivery for ML extends classical software CI/CD with the realities of data and models. Three things must be versioned and tested together: code, data, and models. A typical MLOps pipeline triggers when any of the three changes (a code commit, a new data partition, or a periodic retraining schedule), then runs the workflow pipeline end to end, validates the candidate model against a baseline, and either promotes it to production or rejects it.

Google Cloud's MLOps maturity model describes three levels: level 0 with manual handoffs between data scientists and ML engineers; level 1 with automated retraining pipelines but manual deployment; and level 2 with automated CI/CD/CT (continuous training) pipelines that retrain, validate, and redeploy without human intervention.[24] Tools such as Kubeflow, MLflow, Seldon Core, BentoML, GitHub Actions, and GitLab CI are commonly combined to implement these flows.

### Use cases for workflow pipelining

ML workflow pipelines are valuable in any setting where models are retrained, evaluated, or redeployed more than once. Recommendation systems at large platforms retrain daily or hourly. Fraud detection systems retrain on new labeled fraud as it arrives. Demand forecasting models for retail and logistics retrain on the latest sales data. Computer vision models for self-driving and robotics retrain whenever new edge cases are collected. In each case, pipelining ensures that the chain from raw data to deployed model can be replayed reliably.

## Part 2: pipeline parallelism in deep learning

The second meaning of pipelining concerns how a single neural network is trained across multiple accelerators. **Pipeline parallelism** is a form of [model parallelism](/wiki/model_parallelism) in which the layers of a model are partitioned into a sequence of stages, each placed on a separate device, and a mini-batch of training data is split into smaller micro-batches that flow through the stages in pipelined fashion, similar to instruction pipelining in a CPU.

When a model has too many parameters to fit on a single accelerator, the choices are roughly: shard parameters across devices (data-parallel sharding such as ZeRO or FSDP), split each layer's tensors across devices (tensor parallelism), or split the model into vertical stages and pipeline micro-batches through them. In practice, all three are used together for the largest models.

### Why pipeline parallelism

A naive form of model parallelism, where layer 1 runs on device 1, then sends its activations to device 2 for layer 2, and so on, leaves all but one device idle at any moment. If the four stages each take time `t`, a single forward pass takes `4t` and the average device utilization is only 25%. Pipeline parallelism fixes this by splitting the mini-batch into many micro-batches so that, in the steady state, all stages are busy on different micro-batches at the same time.

The efficiency gain is not free. Compared to a non-pipelined run, every pipelined run incurs **bubble overhead**: idle time at the beginning while the pipeline fills, and at the end while it drains. For a synchronous pipeline with `p` stages and `m` micro-batches, the fraction of time spent in the bubble is approximately:

```
bubble_fraction ≈ (p - 1) / (m + p - 1)
```

This follows from observing that the warm-up and cool-down together contribute `2(p-1)` stage-time units of bubble, while the steady state contributes `2(m-1)` plus the initial fill, for total work of `2(m + p - 1)` per device. Increasing the number of micro-batches `m` shrinks the bubble fraction, but only at the cost of more activation memory because more in-flight activations must be retained for the backward pass.

### GPipe

GPipe is a pipeline parallelism library introduced by Yanping Huang and colleagues at Google in the 2018 arXiv paper *GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism*, which appeared at NeurIPS 2019.[1] GPipe partitions a sequential network into cells placed on separate accelerators and uses a synchronous batch-splitting algorithm: a mini-batch is divided into micro-batches, all forward passes for the mini-batch are run through the pipeline, then all backward passes, and finally a single synchronous parameter update.[1]

GPipe demonstrated near-linear speedup on two flagship workloads: a 557-million-parameter AmoebaNet vision model that achieved 84.4% top-1 accuracy on ImageNet-2012, and a single 6-billion-parameter, 128-layer Transformer trained for multilingual neural machine translation across more than 100 languages.[1] To reduce activation memory, GPipe uses **rematerialization** (also known as gradient checkpointing): only the inputs to each stage are stored during the forward pass, and intermediate activations are recomputed during the backward pass.[1]

A simplified pseudocode view of GPipe-style training:

```
# Partition model into p stages: stage[0..p-1]
# Split mini-batch X into m micro-batches: x[0..m-1]

# Forward pass
for i in 0..m-1:
    a = x[i]
    for s in 0..p-1:
        a = stage[s].forward(a)         # runs on device s
    activations[i] = a
    loss[i] = loss_fn(a, y[i])

# Backward pass (after all forwards complete)
for i in m-1..0:
    g = grad_loss(loss[i])
    for s in p-1..0:
        g = stage[s].backward(g)        # runs on device s

# Single synchronous optimizer step
optimizer.step()
```

The scheme is synchronous and produces gradients identical to a non-pipelined run on the same global batch. Open implementations include `torchgpipe` from Kakao Brain, which was later folded into PyTorch.[27]

### PipeDream and 1F1B scheduling

*PipeDream: Generalized Pipeline Parallelism for DNN Training* by Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil Gibbons, and Matei Zaharia, published at SOSP 2019, generalized pipeline parallelism in two important ways.[2] First, it introduced the **1F1B (one-forward-one-backward)** schedule, in which each worker alternates between forward and backward passes after the pipeline reaches steady state, instead of running all forward passes first as in GPipe.[2] This significantly reduces the number of in-flight activations a worker must retain, because each forward result is consumed by a backward pass much sooner.

Second, PipeDream introduced **weight stashing**: because forward passes for different micro-batches may run against different versions of the parameters, the runtime stores the version used during a micro-batch's forward pass and reuses it during the corresponding backward pass to keep gradients numerically consistent.[2] Combined with 1F1B, this enables **asynchronous** pipeline parallelism that avoids the synchronous flush at the end of each mini-batch.[30]

PipeDream reported up to 5.3x speedup over data parallelism for several deep models on small clusters where communication was a bottleneck.[2] Its main downsides are extra memory for stashed weights and slightly different optimization semantics from data parallelism.

### PipeDream-2BW

*Memory-Efficient Pipeline-Parallel DNN Training* (Narayanan et al., ICML 2021) introduced **PipeDream-2BW**, which uses a double-buffered weight update scheme.[3] Instead of stashing one weight version per in-flight micro-batch, PipeDream-2BW maintains exactly two weight versions per worker: a current version used by newly admitted micro-batches and a shadow version used by micro-batches still in flight.[3] Gradients are accumulated across a configurable number of micro-batches before each weight update, which preserves data-parallel-like semantics while keeping memory bounded. The paper reports end-to-end speedups of 1.3x to 20x for various GPT models versus an optimized model-parallel baseline, and up to 3.2x faster than GPipe.[3]

A related variant, **PipeDream-Flush**, retains the 1F1B schedule but inserts periodic pipeline flushes so that all workers use the same weight version, recovering bit-for-bit data-parallel semantics at the cost of reintroducing a small bubble per flush.[3]

### Megatron-LM and interleaved 1F1B

Megatron-LM, NVIDIA's framework for large transformer training described in *Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM* (Narayanan et al., SC 2021), introduced the **interleaved 1F1B** schedule.[4] Instead of giving each device a single contiguous block of layers, the model is divided into more chunks than there are devices and the chunks are assigned in an interleaved (round-robin) pattern.[4] With four devices and eight layers, for example, device 0 gets layers 0 and 4, device 1 gets layers 1 and 5, device 2 gets layers 2 and 6, and device 3 gets layers 3 and 7.

The interleaved schedule maintains the same bubble time as PipeDream-Flush but reduces the number of micro-batches whose activations must be stashed, making it more memory-efficient than GPipe in the presence of many micro-batches.[4] Megatron-LM combines interleaved 1F1B with tensor parallelism inside each transformer layer and data parallelism across replicas to scale to thousands of GPUs.[4]

### DeepSpeed pipeline parallelism

[DeepSpeed](/wiki/deepspeed), Microsoft's deep learning optimization library, implements pipeline parallelism using gradient accumulation across micro-batches and integrates it with two other strategies.[6] **ZeRO** (Zero Redundancy Optimizer) is a memory-efficient form of data parallelism that partitions optimizer state (Stage 1), gradients (Stage 2), and parameters (Stage 3) across data-parallel ranks instead of replicating them on every rank, giving access to the aggregate GPU memory of the cluster.[5] **3D parallelism** combines ZeRO-style data parallelism, pipeline parallelism, and tensor parallelism (typically borrowed from Megatron-LM) to scale models past one trillion parameters.[7] The Megatron-DeepSpeed integration was used to train models such as BLOOM-176B and Megatron-Turing NLG 530B.[28]

### PyTorch pipelining

PyTorch initially shipped a pipeline parallelism API as `torch.distributed.pipeline.sync`, derived from `torchgpipe`.[27] A separate research-oriented project, **PiPPy** (Pipeline Parallelism for PyTorch), provided more flexible model splitting via `torch.fx` and supported asynchronous schedules.[10] PiPPy has since been folded into PyTorch as `torch.distributed.pipelining`, which is the recommended API as of PyTorch 2.x.[8] The new package handles automatic model splitting into stages, manages micro-batch communication, and provides multiple schedule implementations including `ScheduleGPipe`, `Schedule1F1B`, `ScheduleInterleaved1F1B`, and `ScheduleLoopedBFS`.[8] It composes with [DDP](/wiki/data_parallelism), [FSDP](/wiki/data_parallelism), and tensor parallelism, allowing PyTorch users to assemble 3D parallelism in pure PyTorch.[9]

### Comparison of pipeline schedules

| Schedule | First proposed by | Synchronous? | Bubble fraction | Activation memory | Notes |
|---|---|---|---|---|---|
| GPipe | Huang et al., 2018 | Yes | (p-1)/(m+p-1) | High; stashes activations for all m micro-batches | Bit-for-bit equivalent to non-pipelined SGD; uses rematerialization |
| PipeDream (async) | Narayanan et al., SOSP 2019 | No | Negligible in steady state | Lower; 1F1B reduces in-flight activations | Stashes multiple weight versions; weight-update semantics differ |
| PipeDream-Flush (1F1B) | Narayanan et al., 2020 | Yes | Same as GPipe | Lower than GPipe (1F1B shape) | Flushes periodically to keep one weight version |
| PipeDream-2BW | Narayanan et al., ICML 2021 | Approximately | Negligible in steady state | Bounded; only two weight versions | Designed for very large models like GPT-3 |
| Interleaved 1F1B | Narayanan et al., SC 2021 (Megatron-LM) | Yes | Smaller than 1F1B by factor of v (chunks per device) | Slightly higher than 1F1B | Standard for modern LLM training |

### Pipeline parallelism vs. other parallelism strategies

Pipeline parallelism is one of several axes along which a training job can be parallelized. Modern large-model training combines several at once.

| Strategy | What is split | Communication pattern | Typical scale |
|---|---|---|---|
| [Data parallelism](/wiki/data_parallelism) | The mini-batch (each replica trains on a different shard) | All-reduce of gradients | Tens to thousands of devices |
| Tensor parallelism | Individual weight matrices within a layer | All-reduce of partial activations and gradients within each layer | Within a node, typically 4 or 8 GPUs |
| Pipeline parallelism | Layers (or contiguous groups of layers) into stages | Point-to-point activation and gradient sends between adjacent stages | Across nodes, typically 4 to 64 stages |
| [Sequence parallelism](/wiki/sequence_parallelism) | The sequence dimension within selected operators | Reduce-scatter and all-gather along the sequence | Combined with tensor parallelism |
| Expert parallelism (MoE) | Different experts of a [Mixture of Experts](/wiki/mixture_of_experts) layer | All-to-all of tokens to experts | Across nodes |

In modern LLM stacks, the typical configuration is tensor parallelism within a server (where bandwidth is highest), pipeline parallelism across servers within a rack or pod, and data parallelism across racks. Sequence parallelism is layered on top of tensor parallelism to reduce activation memory in LayerNorm and dropout.[11]

### Use cases in large language model training

Pipeline parallelism is a standard ingredient in training large language models, although the exact configuration varies by system.

- **Megatron-Turing NLG 530B** (2022) was trained on 2,240 NVIDIA A100 GPUs using Megatron-DeepSpeed with 8-way tensor parallelism, 35-way pipeline parallelism, and the remaining dimension as data parallelism.[28]
- **BLOOM-176B** (2022), trained by the BigScience collaboration, used 4-way tensor parallelism and 12-way pipeline parallelism on 384 A100 GPUs across 48 nodes.[29]
- **PaLM 540B** (Google, 2022) is a notable counterexample: Google reported training PaLM on 6,144 [TPU v4](/wiki/google_tpu_v4) chips across two pods using only data and model parallelism *without* pipeline parallelism.[12] The paper argues that high-bandwidth TPU interconnects and a parallel formulation of attention and feed-forward layers made pipeline parallelism unnecessary at this scale.[12]
- **LLaMA 3** (Meta, 2024) used four-dimensional parallelism: fully sharded data parallelism, tensor parallelism, pipeline parallelism, and [context parallelism](/wiki/context_parallelism).[13] The LLaMA 3 training paper notes that the team removed one layer from the first and last pipeline stages to compensate for the additional embedding and output projection work on those stages, balancing computation across pipeline ranks.[13]
- **GPT-4** (OpenAI, 2023): OpenAI has not publicly disclosed GPT-4's training configuration. Independent reporting has suggested a Mixture-of-Experts architecture and 3D parallelism, but specifics are not officially confirmed.

### Limitations of pipeline parallelism

Pipeline parallelism is not free.

- **Bubble overhead** is unavoidable in synchronous schedules and grows with the number of stages relative to micro-batches.
- **Load imbalance** across stages can dominate the bubble. Embedding layers, output projections, and final losses make the first and last stages heavier than middle stages, which is why production systems often hand-tune the layer assignment per stage.
- **Activation memory** scales with the number of in-flight micro-batches, so reducing the bubble by raising `m` competes with memory budgets unless rematerialization or 1F1B-style schedules are used.
- **Implementation complexity** is high. Pipeline parallelism interacts with optimizer state, mixed precision, gradient accumulation, gradient clipping, and checkpointing in nontrivial ways, and combining it with tensor parallelism, ZeRO, and FSDP requires careful engineering.
- **Fault tolerance** is harder than in pure data parallelism: a failed device in a pipeline stage can stall the whole pipeline rather than just dropping one replica.
- **Asynchronous schedules** trade strict gradient correctness for throughput. The optimization landscape for asynchronous pipelines is less well understood, and most large-scale runs use synchronous variants such as 1F1B-Flush or interleaved 1F1B.

## History

The concept of pipelining originates from computer architecture. Pipelined CPUs date back to the IBM 7030 Stretch (1961) and were popularized by RISC processors in the 1980s. Pipelining as a general principle for chaining computational stages was applied to data processing systems for decades before it appeared in machine learning frameworks.

In the ML ecosystem, scikit-learn introduced its `Pipeline` class in 2010 as part of its 0.4 release, providing a clean composition primitive for sequential transformers and a final estimator.[14] TFX evolved at Google starting around 2017 from internal systems used by Google Search, YouTube, and Google Play, and was released as open source in 2019. Kubeflow Pipelines was announced in 2018, MLflow in June 2018, and Apache Airflow predates ML by several years (Airbnb open-sourced it in 2015).

Pipeline parallelism in the deep learning sense traces to GPipe (Huang et al., 2018)[1] and PipeDream (Harlap et al., 2018;[30] Narayanan et al., SOSP 2019[2]). The 1F1B schedule popularized by PipeDream and the interleaved 1F1B schedule introduced in Megatron-LM (Narayanan et al., 2021)[4] became standard ingredients in large language model training systems by 2022, when models such as GPT-3 successors, BLOOM, and Megatron-Turing NLG were being trained at scale. PyTorch's official pipeline parallelism API matured from `torchgpipe` and `torch.distributed.pipeline` through PiPPy into `torch.distributed.pipelining` over 2020 to 2024.

## Explain Like I'm 5 (ELI5)

Imagine making a sandwich with several steps in order: take out the bread, put on some mayo, add lettuce and tomatoes, then close it with another slice of bread. **ML workflow pipelining** is like writing down those steps so that anyone (including a computer) can follow them in the same order every time, for every sandwich.

Now imagine a sandwich shop with four cooks standing in a line. The first cook only does bread, the second only mayo, the third only lettuce and tomatoes, and the fourth only the top slice. If each cook works on their own sandwich at the same time, the shop makes four sandwiches in roughly the time it used to take to make one. That is **pipeline parallelism**: instead of one big computer doing everything, you cut the work into stations and let each station work on a different piece at the same time. The catch is that at the very start the last cook has nothing to do, and at the very end the first cook is finished before everyone else. Those quiet moments are called "bubbles," and a lot of clever scheduling tries to make them as small as possible.

## See also

- [Distributed training](/wiki/distributed_training)
- [Model parallelism](/wiki/model_parallelism)
- [Data parallelism](/wiki/data_parallelism)
- [DeepSpeed](/wiki/deepspeed)
- [MLflow](/wiki/mlflow)
- [Kubeflow](/wiki/kubeflow)
- [scikit-learn](/wiki/scikit-learn)
- [MLOps](/wiki/mlops)
- [Mixture of experts](/wiki/mixture_of_experts)

## References

1. Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M. X., Chen, D., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., & Chen, Z. (2018). "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." arXiv:1811.06965. Presented at NeurIPS 2019. https://arxiv.org/abs/1811.06965
2. Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., & Zaharia, M. (2019). "PipeDream: Generalized Pipeline Parallelism for DNN Training." In *Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19)*. https://www.microsoft.com/en-us/research/publication/pipedream-generalized-pipeline-parallelism-for-dnn-training/
3. Narayanan, D., Phanishayee, A., Shi, K., Chen, X., & Zaharia, M. (2021). "Memory-Efficient Pipeline-Parallel DNN Training." In *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2006.09503. https://arxiv.org/abs/2006.09503
4. Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., & Zaharia, M. (2021). "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM." In *Proceedings of SC '21*. https://arxiv.org/abs/2104.04473
5. Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." In *Proceedings of SC '20*. https://arxiv.org/abs/1910.02054
6. DeepSpeed Team. "Pipeline Parallelism." DeepSpeed documentation. https://www.deepspeed.ai/tutorials/pipeline/
7. DeepSpeed Team. "Training Overview and Features." https://www.deepspeed.ai/training/
8. PyTorch Team. "Pipeline Parallelism (`torch.distributed.pipelining`)." PyTorch documentation. https://docs.pytorch.org/docs/stable/distributed.pipelining.html
9. PyTorch Team. "Introduction to Distributed Pipeline Parallelism." PyTorch tutorials. https://docs.pytorch.org/tutorials/intermediate/pipelining_tutorial.html
10. PyTorch project. "PiPPy: Pipeline Parallelism for PyTorch." GitHub repository. https://github.com/pytorch/PiPPy
11. Korthikanti, V., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., & Catanzaro, B. (2022). "Reducing Activation Recomputation in Large Transformer Models." arXiv:2205.05198. (Introduces sequence parallelism as a complement to tensor parallelism.) https://arxiv.org/abs/2205.05198
12. Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. https://arxiv.org/abs/2204.02311
13. Meta AI. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. https://arxiv.org/abs/2407.21783
14. Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830. https://jmlr.org/papers/v12/pedregosa11a.html
15. scikit-learn developers. "Pipeline." scikit-learn 1.8 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
16. scikit-learn developers. "Pipelines and composite estimators." scikit-learn user guide. https://scikit-learn.org/stable/modules/compose.html
17. scikit-learn developers. "ColumnTransformer." scikit-learn 1.8 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
18. TensorFlow Team. "TFX: ML Production Pipelines." https://www.tensorflow.org/tfx
19. TensorFlow Team. "The TFX User Guide." https://www.tensorflow.org/tfx/guide
20. TensorFlow Team. "Understanding TFX Pipelines." https://www.tensorflow.org/tfx/guide/understanding_tfx_pipelines
21. Kubeflow Project. "Kubeflow Pipelines." https://www.kubeflow.org/docs/components/pipelines/
22. MLflow Project. "MLflow Documentation." https://mlflow.org/docs/latest/index.html
23. Apache Software Foundation. "Apache Airflow Documentation." https://airflow.apache.org/docs/
24. Google Cloud. "MLOps: Continuous delivery and automation pipelines in machine learning." Google Cloud Architecture Center. https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
25. Feast Authors. "Feast: The Open Source Feature Store for Machine Learning." https://feast.dev/
26. Tecton. "Feature Store for Machine Learning." https://www.tecton.ai/feature-store/
27. Kakao Brain. "torchgpipe: A GPipe implementation in PyTorch." GitHub repository. https://github.com/kakaobrain/torchgpipe
28. Smith, S., Patwary, M., Norick, B., et al. (2022). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model." arXiv:2201.11990. https://arxiv.org/abs/2201.11990
29. BigScience Workshop. (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv:2211.05100. https://arxiv.org/abs/2211.05100
30. Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., & Gibbons, P. B. (2018). "PipeDream: Fast and Efficient Pipeline Parallel DNN Training." arXiv:1806.03377. https://arxiv.org/abs/1806.03377

