Pipelining

See also: Machine learning terms

Pipelining is a term used in two distinct senses within machine learning and artificial intelligence. The first refers to the orchestration of an end-to-end machine learning workflow as a chained sequence of stages such as data ingestion, preprocessing, feature engineering, training, evaluation, and deployment. The second refers to pipeline parallelism, a strategy used in distributed training of large neural networks where the model itself is partitioned across devices and micro-batches flow through the resulting stages, similar in spirit to instruction pipelining in classical computer architecture.

This article describes both meanings. The first part covers ML workflow pipelining, including the scikit-learn Pipeline class, TensorFlow Extended (TFX), Kubeflow Pipelines, MLflow, Apache Airflow, and other workflow systems. The second part covers pipeline parallelism, including foundational systems such as GPipe and PipeDream, the 1F1B and interleaved 1F1B schedules used in Megatron-LM, DeepSpeed pipeline parallelism, and how pipeline parallelism is combined with data parallelism and tensor parallelism when training large language models such as GPT-4, PaLM, and LLaMA.

Disambiguation

The two senses are quite different in scope and audience.

Sense	Domain	Granularity	Typical users
ML workflow pipelining	MLOps, data engineering	Stages of an ML lifecycle (ingest, train, evaluate, serve)	Data scientists, ML engineers
Pipeline parallelism	Distributed training	Layers or sub-modules of a single neural network	Systems researchers, large-scale training engineers

The two ideas share a metaphor (sequential stages handing data to the next stage) but operate at completely different levels. A workflow pipeline runs once per training job and may take hours or days, with each stage scheduled by an orchestrator. A pipeline-parallel forward pass runs millions of times during a single training job and is scheduled at the level of individual micro-batches by the deep learning runtime.

Part 1: machine learning workflow pipelining

Pipelining in this sense refers to chaining together the discrete steps of a machine learning workflow, from data preprocessing and feature extraction to model training, validation, and deployment, into a single coherent and reproducible system. Pipelining is used to simplify implementation, manage complex projects, support reproducibility, and automate retraining.

Data preprocessing and feature extraction

Before a machine learning model can be trained, raw data must be transformed into a format the model can process. Common preprocessing steps include:

Data cleaning: removing or correcting inconsistencies, missing values, and errors in the data.
Feature engineering: creating new, informative features from the raw data that can help improve the model's performance.
Feature scaling: standardizing or normalizing features so they share a similar scale, which can improve training stability and convergence.
Encoding: converting categorical variables into numerical form via one-hot encoding, ordinal encoding, or learned embeddings.

These preprocessing tasks are typically combined into a single pipeline so that the same transformations are applied consistently to training, validation, and inference data.

Model training and evaluation

Once the data is preprocessed and features are extracted, the next step is to train a machine learning model. This usually involves:

Model selection: choosing an appropriate algorithm or architecture, such as a Support Vector Machine, Random Forest, gradient boosted tree, or deep neural network.
Hyperparameter tuning: adjusting the algorithm's parameters to optimize performance, often via grid search, random search, or Bayesian optimization.
Model evaluation: assessing performance using metrics such as accuracy, precision, recall, F1 score, AUC, or domain-specific scores.

Incorporating these steps into a pipeline allows researchers to compare configurations consistently and ensures evaluation is unbiased.

Cross-validation and model selection

To select the most suitable model and hyperparameters for a given problem, practitioners often use cross-validation. This involves splitting the dataset into multiple subsets, training on a portion of the data, and evaluating on the remainder. The process is repeated multiple times with different splits.

Placing cross-validation inside the pipeline (rather than around it) is important to avoid data leakage. If preprocessing steps such as scaling are fit on the full dataset before cross-validation splits are created, statistics from the held-out fold will leak into the training fold, biasing the score upward. Pipelining ensures that the same samples used to fit the transformer are the only ones used to train the predictor for that fold.

scikit-learn Pipeline

The Pipeline class in sklearn.pipeline is one of the most widely used implementations of ML workflow pipelining. A Pipeline chains a list of named transformers, ending in a final estimator. Every step except the last must implement fit and transform; the last step must implement fit and either predict, transform, or fit_predict.

Key features of sklearn.pipeline.Pipeline:

A single call to fit runs every transformer's fit_transform in sequence and then fits the final estimator on the transformed output.
A single call to predict runs every transformer's transform and feeds the result into the final estimator's predict.
Hyperparameters of any step can be addressed using the step_name__param_name syntax, allowing grid search across all stages simultaneously.
Fitted transformers can be cached to disk via the memory argument, which avoids refitting expensive preprocessors during repeated grid searches.
Setting verbose=True prints elapsed time per step.

A minimal example:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA()),
    ("clf", LogisticRegression(max_iter=1000)),
])

param_grid = {
    "pca__n_components": [5, 10, 20],
    "clf__C": [0.1, 1.0, 10.0],
}

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)

For heterogeneous tabular data, scikit-learn provides two related composers. ColumnTransformer applies different transformers to different columns and concatenates the results, useful when numeric features need scaling while categorical features need encoding. FeatureUnion applies multiple transformers to the same input and concatenates their outputs, useful for combining different feature extraction strategies on the same column. Both compose with Pipeline. The helper make_pipeline constructs a Pipeline with auto-generated step names, and make_column_transformer does the same for ColumnTransformer.

TensorFlow Extended (TFX)

TensorFlow Extended (TFX) is an end-to-end platform from Google for deploying production ML pipelines. A TFX pipeline is a directed sequence of components that pass artifacts (datasets, schemas, models, evaluations) between steps and record metadata in an ML Metadata store.

Standard TFX components include:

Component	Purpose
`ExampleGen`	Ingests data from sources such as CSV, TFRecord, BigQuery, or custom inputs and splits it into train and eval sets.
`StatisticsGen`	Computes feature statistics over each split using TensorFlow Data Validation (TFDV).
`SchemaGen`	Infers a schema describing expected feature types, ranges, and presence.
`ExampleValidator`	Detects anomalies and training/serving drift by comparing statistics to the schema.
`Transform`	Performs full-pass preprocessing using TensorFlow Transform (TFT), producing a graph that is reused at serving time to avoid training-serving skew.
`Trainer`	Trains a TensorFlow model with a user-supplied `run_fn`, optionally on accelerators or distributed clusters.
`Tuner`	Runs hyperparameter search via KerasTuner.
`Evaluator`	Computes sliced metrics with TensorFlow Model Analysis (TFMA) and decides whether the candidate model is good enough to be blessed.
`InfraValidator`	Loads the model in a sandboxed serving binary to confirm it can be served.
`Pusher`	Pushes blessed models to TensorFlow Serving, TFLite, or another deployment target.
`BulkInferrer`	Runs batch inference using a blessed model.

TFX pipelines can be authored once and executed on multiple orchestrators including Apache Airflow, Apache Beam, Kubeflow Pipelines, and Vertex AI Pipelines, which allows local development on a workstation to be promoted to a managed cloud environment without rewriting the pipeline code.

Kubeflow Pipelines

Kubeflow Pipelines is a Kubernetes-native platform for building and running portable ML workflows. Pipelines are defined in Python using the Kubeflow Pipelines SDK (KFP) and compiled into a static workflow specification that the controller executes by spinning up containers as Kubernetes pods. Each component runs in its own container, which makes language-agnostic steps easy to mix and supports GPU-heavy workloads, parallel hyperparameter search, and large training jobs on shared clusters.

Kubeflow also includes Notebooks for in-cluster JupyterLab, Katib for hyperparameter and neural architecture search, and KServe for model serving. Vertex AI Pipelines on Google Cloud is a managed service that runs KFP-compatible pipelines without requiring users to operate their own Kubernetes cluster.

MLflow

MLflow, originally developed at Databricks, is an open-source platform for managing the ML lifecycle. It is organized around four components: MLflow Tracking for logging parameters, metrics, and artifacts of each run; MLflow Projects for packaging code in a reproducible format; MLflow Models for a standard model packaging format with multiple flavors (sklearn, PyTorch, TensorFlow, ONNX, custom Python); and MLflow Model Registry for versioning and stage promotion (Staging, Production, Archived). MLflow is often paired with Airflow as a tracking and registry layer in MLOps stacks.

Apache Airflow

Apache Airflow is a general-purpose workflow orchestrator that defines pipelines as directed acyclic graphs (DAGs) in Python. Although Airflow is not ML-specific, it is widely used in production data and ML stacks because it integrates with most data warehouses, supports rich scheduling, and has mature operators for cloud services. ML teams often use Airflow to orchestrate retraining DAGs, with MLflow handling experiment tracking and a model registry.

Other ML pipeline frameworks

There is a long tail of frameworks targeting ML workflow pipelining. They differ in scope, orchestration model, and how opinionated they are about MLOps practices.

Tool	Origin	Strengths	Notes
Kubeflow Pipelines	Google	Kubernetes-native; scales to clusters; ML-specific components	Requires Kubernetes infrastructure
TFX	Google	Tight integration with TensorFlow; production-grade components	Best when the model is a TensorFlow model
MLflow	Databricks	Tracking, model packaging, registry; tool-agnostic	Lighter than full pipeline systems
Apache Airflow	Airbnb	General-purpose DAG orchestration; large ecosystem	Not ML-specific
Prefect	Prefect Technologies	Pythonic flows; dynamic DAGs; managed cloud	Successor to Airflow-style orchestration
Kedro	QuantumBlack (McKinsey)	Software engineering structure; data catalog; modular pipelines	Not an orchestrator on its own
ZenML	ZenML GmbH	Pluggable stack; integrates 50+ MLOps tools; cloud-agnostic	Framework rather than runtime
Metaflow	Netflix	Pythonic flows; versioning; AWS-native scaling	Battle-tested at Netflix scale
Dagster	Elementl	Asset-based pipelines; type checks; rich UI	Strong data-engineering features
TorchX	PyTorch	Job specification for distributed PyTorch training	Lower level than full ML pipelines

Kedro emphasizes software engineering practices such as modularity, separation of concerns, and a versioned data catalog, but it does not provide orchestration on its own. ZenML is structured around a pluggable stack model in which artifact stores, orchestrators, and deployers are interchangeable plugins. Metaflow, originating at Netflix, is known for its Pythonic decorator-based flows and tight AWS integration. TorchX is a job launcher for distributed PyTorch training and inference rather than a full ML lifecycle pipeline.

Feature stores

A feature store is a data platform that centralizes the definition, storage, serving, and monitoring of features used by ML models. It addresses two recurring problems: training/serving skew, where a feature is computed differently in batch training and online inference, and feature reuse, where similar features are recomputed in many places. A feature store typically exposes an offline store optimized for training (warehouses such as Snowflake, BigQuery, or Redshift) and an online store optimized for low-latency lookups during inference (key-value stores such as Redis, DynamoDB, or Postgres).

Notable systems include Feast, an open-source feature store with a pluggable architecture; Tecton, a managed platform that adds real-time streaming feature pipelines and was the original maintainer of Feast; and Hopsworks, an enterprise feature store with strong ties to Apache Hudi. Major cloud vendors offer feature stores as part of their managed ML platforms, including Vertex AI Feature Store, Amazon SageMaker Feature Store, and Databricks Feature Store.

CI/CD for machine learning (MLOps)

Continuous integration and continuous delivery for ML extends classical software CI/CD with the realities of data and models. Three things must be versioned and tested together: code, data, and models. A typical MLOps pipeline triggers when any of the three changes (a code commit, a new data partition, or a periodic retraining schedule), then runs the workflow pipeline end to end, validates the candidate model against a baseline, and either promotes it to production or rejects it.

Google Cloud's MLOps maturity model describes three levels: level 0 with manual handoffs between data scientists and ML engineers; level 1 with automated retraining pipelines but manual deployment; and level 2 with automated CI/CD/CT (continuous training) pipelines that retrain, validate, and redeploy without human intervention. Tools such as Kubeflow, MLflow, Seldon Core, BentoML, GitHub Actions, and GitLab CI are commonly combined to implement these flows.

Use cases for workflow pipelining

ML workflow pipelines are valuable in any setting where models are retrained, evaluated, or redeployed more than once. Recommendation systems at large platforms retrain daily or hourly. Fraud detection systems retrain on new labeled fraud as it arrives. Demand forecasting models for retail and logistics retrain on the latest sales data. Computer vision models for self-driving and robotics retrain whenever new edge cases are collected. In each case, pipelining ensures that the chain from raw data to deployed model can be replayed reliably.

Part 2: pipeline parallelism in deep learning

The second meaning of pipelining concerns how a single neural network is trained across multiple accelerators. Pipeline parallelism is a form of model parallelism in which the layers of a model are partitioned into a sequence of stages, each placed on a separate device, and a mini-batch of training data is split into smaller micro-batches that flow through the stages in pipelined fashion, similar to instruction pipelining in a CPU.

When a model has too many parameters to fit on a single accelerator, the choices are roughly: shard parameters across devices (data-parallel sharding such as ZeRO or FSDP), split each layer's tensors across devices (tensor parallelism), or split the model into vertical stages and pipeline micro-batches through them. In practice, all three are used together for the largest models.

Why pipeline parallelism

A naive form of model parallelism, where layer 1 runs on device 1, then sends its activations to device 2 for layer 2, and so on, leaves all but one device idle at any moment. If the four stages each take time t, a single forward pass takes 4t and the average device utilization is only 25%. Pipeline parallelism fixes this by splitting the mini-batch into many micro-batches so that, in the steady state, all stages are busy on different micro-batches at the same time.

The efficiency gain is not free. Compared to a non-pipelined run, every pipelined run incurs bubble overhead: idle time at the beginning while the pipeline fills, and at the end while it drains. For a synchronous pipeline with p stages and m micro-batches, the fraction of time spent in the bubble is approximately:

bubble_fraction ≈ (p - 1) / (m + p - 1)

This follows from observing that the warm-up and cool-down together contribute 2(p-1) stage-time units of bubble, while the steady state contributes 2(m-1) plus the initial fill, for total work of 2(m + p - 1) per device. Increasing the number of micro-batches m shrinks the bubble fraction, but only at the cost of more activation memory because more in-flight activations must be retained for the backward pass.

GPipe

GPipe is a pipeline parallelism library introduced by Yanping Huang and colleagues at Google in the 2018 arXiv paper GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, which appeared at NeurIPS 2019. GPipe partitions a sequential network into cells placed on separate accelerators and uses a synchronous batch-splitting algorithm: a mini-batch is divided into micro-batches, all forward passes for the mini-batch are run through the pipeline, then all backward passes, and finally a single synchronous parameter update.

GPipe demonstrated near-linear speedup on two flagship workloads: a 557-million-parameter AmoebaNet vision model that achieved 84.4% top-1 accuracy on ImageNet-2012, and a single 6-billion-parameter, 128-layer Transformer trained for multilingual neural machine translation across more than 100 languages. To reduce activation memory, GPipe uses rematerialization (also known as gradient checkpointing): only the inputs to each stage are stored during the forward pass, and intermediate activations are recomputed during the backward pass.

A simplified pseudocode view of GPipe-style training:

# Partition model into p stages: stage[0..p-1]
# Split mini-batch X into m micro-batches: x[0..m-1]

# Forward pass
for i in 0..m-1:
    a = x[i]
    for s in 0..p-1:
        a = stage[s].forward(a)         # runs on device s
    activations[i] = a
    loss[i] = loss_fn(a, y[i])

# Backward pass (after all forwards complete)
for i in m-1..0:
    g = grad_loss(loss[i])
    for s in p-1..0:
        g = stage[s].backward(g)        # runs on device s

# Single synchronous optimizer step
optimizer.step()

The scheme is synchronous and produces gradients identical to a non-pipelined run on the same global batch. Open implementations include torchgpipe from Kakao Brain, which was later folded into PyTorch.

PipeDream and 1F1B scheduling

PipeDream: Generalized Pipeline Parallelism for DNN Training by Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil Gibbons, and Matei Zaharia, published at SOSP 2019, generalized pipeline parallelism in two important ways. First, it introduced the 1F1B (one-forward-one-backward) schedule, in which each worker alternates between forward and backward passes after the pipeline reaches steady state, instead of running all forward passes first as in GPipe. This significantly reduces the number of in-flight activations a worker must retain, because each forward result is consumed by a backward pass much sooner.

Second, PipeDream introduced weight stashing: because forward passes for different micro-batches may run against different versions of the parameters, the runtime stores the version used during a micro-batch's forward pass and reuses it during the corresponding backward pass to keep gradients numerically consistent. Combined with 1F1B, this enables asynchronous pipeline parallelism that avoids the synchronous flush at the end of each mini-batch.

PipeDream reported up to 5.3x speedup over data parallelism for several deep models on small clusters where communication was a bottleneck. Its main downsides are extra memory for stashed weights and slightly different optimization semantics from data parallelism.

PipeDream-2BW

Memory-Efficient Pipeline-Parallel DNN Training (Narayanan et al., ICML 2021) introduced PipeDream-2BW, which uses a double-buffered weight update scheme. Instead of stashing one weight version per in-flight micro-batch, PipeDream-2BW maintains exactly two weight versions per worker: a current version used by newly admitted micro-batches and a shadow version used by micro-batches still in flight. Gradients are accumulated across a configurable number of micro-batches before each weight update, which preserves data-parallel-like semantics while keeping memory bounded. The paper reports end-to-end speedups of 1.3x to 20x for various GPT models versus an optimized model-parallel baseline, and up to 3.2x faster than GPipe.

A related variant, PipeDream-Flush, retains the 1F1B schedule but inserts periodic pipeline flushes so that all workers use the same weight version, recovering bit-for-bit data-parallel semantics at the cost of reintroducing a small bubble per flush.

Megatron-LM and interleaved 1F1B

Megatron-LM, NVIDIA's framework for large transformer training described in Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (Narayanan et al., SC 2021), introduced the interleaved 1F1B schedule. Instead of giving each device a single contiguous block of layers, the model is divided into more chunks than there are devices and the chunks are assigned in an interleaved (round-robin) pattern. With four devices and eight layers, for example, device 0 gets layers 0 and 4, device 1 gets layers 1 and 5, device 2 gets layers 2 and 6, and device 3 gets layers 3 and 7.

The interleaved schedule maintains the same bubble time as PipeDream-Flush but reduces the number of micro-batches whose activations must be stashed, making it more memory-efficient than GPipe in the presence of many micro-batches. Megatron-LM combines interleaved 1F1B with tensor parallelism inside each transformer layer and data parallelism across replicas to scale to thousands of GPUs.

DeepSpeed pipeline parallelism

DeepSpeed, Microsoft's deep learning optimization library, implements pipeline parallelism using gradient accumulation across micro-batches and integrates it with two other strategies. ZeRO (Zero Redundancy Optimizer) is a memory-efficient form of data parallelism that partitions optimizer state (Stage 1), gradients (Stage 2), and parameters (Stage 3) across data-parallel ranks instead of replicating them on every rank, giving access to the aggregate GPU memory of the cluster. 3D parallelism combines ZeRO-style data parallelism, pipeline parallelism, and tensor parallelism (typically borrowed from Megatron-LM) to scale models past one trillion parameters. The Megatron-DeepSpeed integration was used to train models such as BLOOM-176B and Megatron-Turing NLG 530B.

PyTorch pipelining

PyTorch initially shipped a pipeline parallelism API as torch.distributed.pipeline.sync, derived from torchgpipe. A separate research-oriented project, PiPPy (Pipeline Parallelism for PyTorch), provided more flexible model splitting via torch.fx and supported asynchronous schedules. PiPPy has since been folded into PyTorch as torch.distributed.pipelining, which is the recommended API as of PyTorch 2.x. The new package handles automatic model splitting into stages, manages micro-batch communication, and provides multiple schedule implementations including ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, and ScheduleLoopedBFS. It composes with DDP, FSDP, and tensor parallelism, allowing PyTorch users to assemble 3D parallelism in pure PyTorch.

Comparison of pipeline schedules

Schedule	First proposed by	Synchronous?	Bubble fraction	Activation memory	Notes
GPipe	Huang et al., 2018	Yes	(p-1)/(m+p-1)	High; stashes activations for all m micro-batches	Bit-for-bit equivalent to non-pipelined SGD; uses rematerialization
PipeDream (async)	Narayanan et al., SOSP 2019	No	Negligible in steady state	Lower; 1F1B reduces in-flight activations	Stashes multiple weight versions; weight-update semantics differ
PipeDream-Flush (1F1B)	Narayanan et al., 2020	Yes	Same as GPipe	Lower than GPipe (1F1B shape)	Flushes periodically to keep one weight version
PipeDream-2BW	Narayanan et al., ICML 2021	Approximately	Negligible in steady state	Bounded; only two weight versions	Designed for very large models like GPT-3
Interleaved 1F1B	Narayanan et al., SC 2021 (Megatron-LM)	Yes	Smaller than 1F1B by factor of v (chunks per device)	Slightly higher than 1F1B	Standard for modern LLM training

Pipeline parallelism vs. other parallelism strategies

Pipeline parallelism is one of several axes along which a training job can be parallelized. Modern large-model training combines several at once.

Strategy	What is split	Communication pattern	Typical scale
Data parallelism	The mini-batch (each replica trains on a different shard)	All-reduce of gradients	Tens to thousands of devices
Tensor parallelism	Individual weight matrices within a layer	All-reduce of partial activations and gradients within each layer	Within a node, typically 4 or 8 GPUs
Pipeline parallelism	Layers (or contiguous groups of layers) into stages	Point-to-point activation and gradient sends between adjacent stages	Across nodes, typically 4 to 64 stages
Sequence parallelism	The sequence dimension within selected operators	Reduce-scatter and all-gather along the sequence	Combined with tensor parallelism
Expert parallelism (MoE)	Different experts of a Mixture of Experts layer	All-to-all of tokens to experts	Across nodes

In modern LLM stacks, the typical configuration is tensor parallelism within a server (where bandwidth is highest), pipeline parallelism across servers within a rack or pod, and data parallelism across racks. Sequence parallelism is layered on top of tensor parallelism to reduce activation memory in LayerNorm and dropout.

Use cases in large language model training

Pipeline parallelism is a standard ingredient in training large language models, although the exact configuration varies by system.

Megatron-Turing NLG 530B (2022) was trained on 2,240 NVIDIA A100 GPUs using Megatron-DeepSpeed with 8-way tensor parallelism, 35-way pipeline parallelism, and the remaining dimension as data parallelism.
BLOOM-176B (2022), trained by the BigScience collaboration, used 4-way tensor parallelism and 12-way pipeline parallelism on 384 A100 GPUs across 48 nodes.
PaLM 540B (Google, 2022) is a notable counterexample: Google reported training PaLM on 6,144 TPU v4 chips across two pods using only data and model parallelism without pipeline parallelism. The paper argues that high-bandwidth TPU interconnects and a parallel formulation of attention and feed-forward layers made pipeline parallelism unnecessary at this scale.
LLaMA 3 (Meta, 2024) used four-dimensional parallelism: fully sharded data parallelism, tensor parallelism, pipeline parallelism, and context parallelism. The LLaMA 3 training paper notes that the team removed one layer from the first and last pipeline stages to compensate for the additional embedding and output projection work on those stages, balancing computation across pipeline ranks.
GPT-4 (OpenAI, 2023): OpenAI has not publicly disclosed GPT-4's training configuration. Independent reporting has suggested a Mixture-of-Experts architecture and 3D parallelism, but specifics are not officially confirmed.

Limitations of pipeline parallelism

Pipeline parallelism is not free.

Bubble overhead is unavoidable in synchronous schedules and grows with the number of stages relative to micro-batches.
Load imbalance across stages can dominate the bubble. Embedding layers, output projections, and final losses make the first and last stages heavier than middle stages, which is why production systems often hand-tune the layer assignment per stage.
Activation memory scales with the number of in-flight micro-batches, so reducing the bubble by raising m competes with memory budgets unless rematerialization or 1F1B-style schedules are used.
Implementation complexity is high. Pipeline parallelism interacts with optimizer state, mixed precision, gradient accumulation, gradient clipping, and checkpointing in nontrivial ways, and combining it with tensor parallelism, ZeRO, and FSDP requires careful engineering.
Fault tolerance is harder than in pure data parallelism: a failed device in a pipeline stage can stall the whole pipeline rather than just dropping one replica.
Asynchronous schedules trade strict gradient correctness for throughput. The optimization landscape for asynchronous pipelines is less well understood, and most large-scale runs use synchronous variants such as 1F1B-Flush or interleaved 1F1B.

History

The concept of pipelining originates from computer architecture. Pipelined CPUs date back to the IBM 7030 Stretch (1961) and were popularized by RISC processors in the 1980s. Pipelining as a general principle for chaining computational stages was applied to data processing systems for decades before it appeared in machine learning frameworks.

In the ML ecosystem, scikit-learn introduced its Pipeline class in 2010 as part of its 0.4 release, providing a clean composition primitive for sequential transformers and a final estimator. TFX evolved at Google starting around 2017 from internal systems used by Google Search, YouTube, and Google Play, and was released as open source in 2019. Kubeflow Pipelines was announced in 2018, MLflow in June 2018, and Apache Airflow predates ML by several years (Airbnb open-sourced it in 2015).

Pipeline parallelism in the deep learning sense traces to GPipe (Huang et al., 2018) and PipeDream (Harlap et al., 2018; Narayanan et al., SOSP 2019). The 1F1B schedule popularized by PipeDream and the interleaved 1F1B schedule introduced in Megatron-LM (Narayanan et al., 2021) became standard ingredients in large language model training systems by 2022, when models such as GPT-3 successors, BLOOM, and Megatron-Turing NLG were being trained at scale. PyTorch's official pipeline parallelism API matured from torchgpipe and torch.distributed.pipeline through PiPPy into torch.distributed.pipelining over 2020 to 2024.

Explain Like I'm 5 (ELI5)

Imagine making a sandwich with several steps in order: take out the bread, put on some mayo, add lettuce and tomatoes, then close it with another slice of bread. ML workflow pipelining is like writing down those steps so that anyone (including a computer) can follow them in the same order every time, for every sandwich.

Now imagine a sandwich shop with four cooks standing in a line. The first cook only does bread, the second only mayo, the third only lettuce and tomatoes, and the fourth only the top slice. If each cook works on their own sandwich at the same time, the shop makes four sandwiches in roughly the time it used to take to make one. That is pipeline parallelism: instead of one big computer doing everything, you cut the work into stations and let each station work on a different piece at the same time. The catch is that at the very start the last cook has nothing to do, and at the very end the first cook is finished before everyone else. Those quiet moments are called "bubbles," and a lot of clever scheduling tries to make them as small as possible.

References

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, M. X., Chen, D., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., & Chen, Z. (2018). "GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism." arXiv:1811.06965. Presented at NeurIPS 2019. https://arxiv.org/abs/1811.06965
Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., & Zaharia, M. (2019). "PipeDream: Generalized Pipeline Parallelism for DNN Training." In *Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP '19)*. https://www.microsoft.com/en-us/research/publication/pipedream-generalized-pipeline-parallelism-for-dnn-training/
Narayanan, D., Phanishayee, A., Shi, K., Chen, X., & Zaharia, M. (2021). "Memory-Efficient Pipeline-Parallel DNN Training." In *Proceedings of the 38th International Conference on Machine Learning (ICML)*. arXiv:2006.09503. https://arxiv.org/abs/2006.09503
Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., & Zaharia, M. (2021). "Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM." In *Proceedings of SC '21*. https://arxiv.org/abs/2104.04473
Rajbhandari, S., Rasley, J., Ruwase, O., & He, Y. (2020). "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models." In *Proceedings of SC '20*. https://arxiv.org/abs/1910.02054
DeepSpeed Team. "Pipeline Parallelism." DeepSpeed documentation. https://www.deepspeed.ai/tutorials/pipeline/
DeepSpeed Team. "Training Overview and Features." https://www.deepspeed.ai/training/
PyTorch Team. "Pipeline Parallelism (`torch.distributed.pipelining`)." PyTorch documentation. https://docs.pytorch.org/docs/stable/distributed.pipelining.html
PyTorch Team. "Introduction to Distributed Pipeline Parallelism." PyTorch tutorials. https://docs.pytorch.org/tutorials/intermediate/pipelining_tutorial.html
PyTorch project. "PiPPy: Pipeline Parallelism for PyTorch." GitHub repository. https://github.com/pytorch/PiPPy
Korthikanti, V., Casper, J., Lym, S., McAfee, L., Andersch, M., Shoeybi, M., & Catanzaro, B. (2022). "Reducing Activation Recomputation in Large Transformer Models." arXiv:2205.05198. (Introduces sequence parallelism as a complement to tensor parallelism.) https://arxiv.org/abs/2205.05198
Chowdhery, A., Narang, S., Devlin, J., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." arXiv:2204.02311. https://arxiv.org/abs/2204.02311
Meta AI. (2024). "The Llama 3 Herd of Models." arXiv:2407.21783. https://arxiv.org/abs/2407.21783
Pedregosa, F., et al. (2011). "Scikit-learn: Machine Learning in Python." *Journal of Machine Learning Research*, 12, 2825-2830. https://jmlr.org/papers/v12/pedregosa11a.html
scikit-learn developers. "Pipeline." scikit-learn 1.8 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
scikit-learn developers. "Pipelines and composite estimators." scikit-learn user guide. https://scikit-learn.org/stable/modules/compose.html
scikit-learn developers. "ColumnTransformer." scikit-learn 1.8 documentation. https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
TensorFlow Team. "TFX: ML Production Pipelines." https://www.tensorflow.org/tfx
TensorFlow Team. "The TFX User Guide." https://www.tensorflow.org/tfx/guide
TensorFlow Team. "Understanding TFX Pipelines." https://www.tensorflow.org/tfx/guide/understanding_tfx_pipelines
Kubeflow Project. "Kubeflow Pipelines." https://www.kubeflow.org/docs/components/pipelines/
MLflow Project. "MLflow Documentation." https://mlflow.org/docs/latest/index.html
Apache Software Foundation. "Apache Airflow Documentation." https://airflow.apache.org/docs/
Google Cloud. "MLOps: Continuous delivery and automation pipelines in machine learning." Google Cloud Architecture Center. https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
Feast Authors. "Feast: The Open Source Feature Store for Machine Learning." https://feast.dev/
Tecton. "Feature Store for Machine Learning." https://www.tecton.ai/feature-store/
Kakao Brain. "torchgpipe: A GPipe implementation in PyTorch." GitHub repository. https://github.com/kakaobrain/torchgpipe
Smith, S., Patwary, M., Norick, B., et al. (2022). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model." arXiv:2201.11990. https://arxiv.org/abs/2201.11990
BigScience Workshop. (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv:2211.05100. https://arxiv.org/abs/2211.05100
Harlap, A., Narayanan, D., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., & Gibbons, P. B. (2018). "PipeDream: Fast and Efficient Pipeline Parallel DNN Training." arXiv:1806.03377. https://arxiv.org/abs/1806.03377

Pipelining

Disambiguation

Part 1: machine learning workflow pipelining

Data preprocessing and feature extraction

Model training and evaluation

Cross-validation and model selection

scikit-learn Pipeline

TensorFlow Extended (TFX)

Kubeflow Pipelines

MLflow

Apache Airflow

Other ML pipeline frameworks

Feature stores

CI/CD for machine learning (MLOps)

Use cases for workflow pipelining

Part 2: pipeline parallelism in deep learning

Why pipeline parallelism

GPipe

PipeDream and 1F1B scheduling

PipeDream-2BW

Megatron-LM and interleaved 1F1B

DeepSpeed pipeline parallelism

PyTorch pipelining

Comparison of pipeline schedules

Pipeline parallelism vs. other parallelism strategies

Use cases in large language model training

Limitations of pipeline parallelism

History

Explain Like I'm 5 (ELI5)

See also

References

Improve this article

Related Articles

ARC-AGI 2

Parameter Server (PS)

DeepSpeed

Dynamic model

Pipeline

Training-Serving Skew

Pipelining

Disambiguation

Part 1: machine learning workflow pipelining

Data preprocessing and feature extraction

Model training and evaluation

Cross-validation and model selection

scikit-learn Pipeline

TensorFlow Extended (TFX)

Kubeflow Pipelines

MLflow

Apache Airflow

Other ML pipeline frameworks

Feature stores

CI/CD for machine learning (MLOps)

Use cases for workflow pipelining

Part 2: pipeline parallelism in deep learning

Why pipeline parallelism

GPipe

PipeDream and 1F1B scheduling

PipeDream-2BW

Megatron-LM and interleaved 1F1B

DeepSpeed pipeline parallelism

PyTorch pipelining

Comparison of pipeline schedules

Pipeline parallelism vs. other parallelism strategies

Use cases in large language model training

Limitations of pipeline parallelism

History

Explain Like I'm 5 (ELI5)

See also

References

Related Articles

ARC-AGI 2

Parameter Server (PS)

DeepSpeed

Dynamic model

Pipeline

Training-Serving Skew