Pipelining
Last reviewed
Apr 30, 2026
Sources
30 citations
Review status
Source-backed
Revision
v4 · 5,301 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
30 citations
Review status
Source-backed
Revision
v4 · 5,301 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Machine learning terms
Pipelining is a term used in two distinct senses within machine learning and artificial intelligence. The first refers to the orchestration of an end-to-end machine learning workflow as a chained sequence of stages such as data ingestion, preprocessing, feature engineering, training, evaluation, and deployment. The second refers to pipeline parallelism, a strategy used in distributed training of large neural networks where the model itself is partitioned across devices and micro-batches flow through the resulting stages, similar in spirit to instruction pipelining in classical computer architecture.
This article describes both meanings. The first part covers ML workflow pipelining, including the scikit-learn Pipeline class, TensorFlow Extended (TFX), Kubeflow Pipelines, MLflow, Apache Airflow, and other workflow systems. The second part covers pipeline parallelism, including foundational systems such as GPipe and PipeDream, the 1F1B and interleaved 1F1B schedules used in Megatron-LM, DeepSpeed pipeline parallelism, and how pipeline parallelism is combined with data parallelism and tensor parallelism when training large language models such as GPT-4, PaLM, and LLaMA.
The two senses are quite different in scope and audience.
| Sense | Domain | Granularity | Typical users |
|---|---|---|---|
| ML workflow pipelining | MLOps, data engineering | Stages of an ML lifecycle (ingest, train, evaluate, serve) | Data scientists, ML engineers |
| Pipeline parallelism | Distributed training | Layers or sub-modules of a single neural network | Systems researchers, large-scale training engineers |
The two ideas share a metaphor (sequential stages handing data to the next stage) but operate at completely different levels. A workflow pipeline runs once per training job and may take hours or days, with each stage scheduled by an orchestrator. A pipeline-parallel forward pass runs millions of times during a single training job and is scheduled at the level of individual micro-batches by the deep learning runtime.
Pipelining in this sense refers to chaining together the discrete steps of a machine learning workflow, from data preprocessing and feature extraction to model training, validation, and deployment, into a single coherent and reproducible system. Pipelining is used to simplify implementation, manage complex projects, support reproducibility, and automate retraining.
Before a machine learning model can be trained, raw data must be transformed into a format the model can process. Common preprocessing steps include:
These preprocessing tasks are typically combined into a single pipeline so that the same transformations are applied consistently to training, validation, and inference data.
Once the data is preprocessed and features are extracted, the next step is to train a machine learning model. This usually involves:
Incorporating these steps into a pipeline allows researchers to compare configurations consistently and ensures evaluation is unbiased.
To select the most suitable model and hyperparameters for a given problem, practitioners often use cross-validation. This involves splitting the dataset into multiple subsets, training on a portion of the data, and evaluating on the remainder. The process is repeated multiple times with different splits.
Placing cross-validation inside the pipeline (rather than around it) is important to avoid data leakage. If preprocessing steps such as scaling are fit on the full dataset before cross-validation splits are created, statistics from the held-out fold will leak into the training fold, biasing the score upward. Pipelining ensures that the same samples used to fit the transformer are the only ones used to train the predictor for that fold.
The Pipeline class in sklearn.pipeline is one of the most widely used implementations of ML workflow pipelining. A Pipeline chains a list of named transformers, ending in a final estimator. Every step except the last must implement fit and transform; the last step must implement fit and either predict, transform, or fit_predict.
Key features of sklearn.pipeline.Pipeline:
fit runs every transformer's fit_transform in sequence and then fits the final estimator on the transformed output.predict runs every transformer's transform and feeds the result into the final estimator's predict.step_name__param_name syntax, allowing grid search across all stages simultaneously.memory argument, which avoids refitting expensive preprocessors during repeated grid searches.verbose=True prints elapsed time per step.A minimal example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([
("scaler", StandardScaler()),
("pca", PCA()),
("clf", LogisticRegression(max_iter=1000)),
])
param_grid = {
"pca__n_components": [5, 10, 20],
"clf__C": [0.1, 1.0, 10.0],
}
grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)
For heterogeneous tabular data, scikit-learn provides two related composers. ColumnTransformer applies different transformers to different columns and concatenates the results, useful when numeric features need scaling while categorical features need encoding. FeatureUnion applies multiple transformers to the same input and concatenates their outputs, useful for combining different feature extraction strategies on the same column. Both compose with Pipeline. The helper make_pipeline constructs a Pipeline with auto-generated step names, and make_column_transformer does the same for ColumnTransformer.
TensorFlow Extended (TFX) is an end-to-end platform from Google for deploying production ML pipelines. A TFX pipeline is a directed sequence of components that pass artifacts (datasets, schemas, models, evaluations) between steps and record metadata in an ML Metadata store.
Standard TFX components include:
| Component | Purpose |
|---|---|
ExampleGen | Ingests data from sources such as CSV, TFRecord, BigQuery, or custom inputs and splits it into train and eval sets. |
StatisticsGen | Computes feature statistics over each split using TensorFlow Data Validation (TFDV). |
SchemaGen | Infers a schema describing expected feature types, ranges, and presence. |
ExampleValidator | Detects anomalies and training/serving drift by comparing statistics to the schema. |
Transform | Performs full-pass preprocessing using TensorFlow Transform (TFT), producing a graph that is reused at serving time to avoid training-serving skew. |
Trainer | Trains a TensorFlow model with a user-supplied run_fn, optionally on accelerators or distributed clusters. |
Tuner | Runs hyperparameter search via KerasTuner. |
Evaluator | Computes sliced metrics with TensorFlow Model Analysis (TFMA) and decides whether the candidate model is good enough to be blessed. |
InfraValidator | Loads the model in a sandboxed serving binary to confirm it can be served. |
Pusher | Pushes blessed models to TensorFlow Serving, TFLite, or another deployment target. |
BulkInferrer | Runs batch inference using a blessed model. |
TFX pipelines can be authored once and executed on multiple orchestrators including Apache Airflow, Apache Beam, Kubeflow Pipelines, and Vertex AI Pipelines, which allows local development on a workstation to be promoted to a managed cloud environment without rewriting the pipeline code.
Kubeflow Pipelines is a Kubernetes-native platform for building and running portable ML workflows. Pipelines are defined in Python using the Kubeflow Pipelines SDK (KFP) and compiled into a static workflow specification that the controller executes by spinning up containers as Kubernetes pods. Each component runs in its own container, which makes language-agnostic steps easy to mix and supports GPU-heavy workloads, parallel hyperparameter search, and large training jobs on shared clusters.
Kubeflow also includes Notebooks for in-cluster JupyterLab, Katib for hyperparameter and neural architecture search, and KServe for model serving. Vertex AI Pipelines on Google Cloud is a managed service that runs KFP-compatible pipelines without requiring users to operate their own Kubernetes cluster.
MLflow, originally developed at Databricks, is an open-source platform for managing the ML lifecycle. It is organized around four components: MLflow Tracking for logging parameters, metrics, and artifacts of each run; MLflow Projects for packaging code in a reproducible format; MLflow Models for a standard model packaging format with multiple flavors (sklearn, PyTorch, TensorFlow, ONNX, custom Python); and MLflow Model Registry for versioning and stage promotion (Staging, Production, Archived). MLflow is often paired with Airflow as a tracking and registry layer in MLOps stacks.
Apache Airflow is a general-purpose workflow orchestrator that defines pipelines as directed acyclic graphs (DAGs) in Python. Although Airflow is not ML-specific, it is widely used in production data and ML stacks because it integrates with most data warehouses, supports rich scheduling, and has mature operators for cloud services. ML teams often use Airflow to orchestrate retraining DAGs, with MLflow handling experiment tracking and a model registry.
There is a long tail of frameworks targeting ML workflow pipelining. They differ in scope, orchestration model, and how opinionated they are about MLOps practices.
| Tool | Origin | Strengths | Notes |
|---|---|---|---|
| Kubeflow Pipelines | Kubernetes-native; scales to clusters; ML-specific components | Requires Kubernetes infrastructure | |
| TFX | Tight integration with TensorFlow; production-grade components | Best when the model is a TensorFlow model | |
| MLflow | Databricks | Tracking, model packaging, registry; tool-agnostic | Lighter than full pipeline systems |
| Apache Airflow | Airbnb | General-purpose DAG orchestration; large ecosystem | Not ML-specific |
| Prefect | Prefect Technologies | Pythonic flows; dynamic DAGs; managed cloud | Successor to Airflow-style orchestration |
| Kedro | QuantumBlack (McKinsey) | Software engineering structure; data catalog; modular pipelines | Not an orchestrator on its own |
| ZenML | ZenML GmbH | Pluggable stack; integrates 50+ MLOps tools; cloud-agnostic | Framework rather than runtime |
| Metaflow | Netflix | Pythonic flows; versioning; AWS-native scaling | Battle-tested at Netflix scale |
| Dagster | Elementl | Asset-based pipelines; type checks; rich UI | Strong data-engineering features |
| TorchX | PyTorch | Job specification for distributed PyTorch training | Lower level than full ML pipelines |
Kedro emphasizes software engineering practices such as modularity, separation of concerns, and a versioned data catalog, but it does not provide orchestration on its own. ZenML is structured around a pluggable stack model in which artifact stores, orchestrators, and deployers are interchangeable plugins. Metaflow, originating at Netflix, is known for its Pythonic decorator-based flows and tight AWS integration. TorchX is a job launcher for distributed PyTorch training and inference rather than a full ML lifecycle pipeline.
A feature store is a data platform that centralizes the definition, storage, serving, and monitoring of features used by ML models. It addresses two recurring problems: training/serving skew, where a feature is computed differently in batch training and online inference, and feature reuse, where similar features are recomputed in many places. A feature store typically exposes an offline store optimized for training (warehouses such as Snowflake, BigQuery, or Redshift) and an online store optimized for low-latency lookups during inference (key-value stores such as Redis, DynamoDB, or Postgres).
Notable systems include Feast, an open-source feature store with a pluggable architecture; Tecton, a managed platform that adds real-time streaming feature pipelines and was the original maintainer of Feast; and Hopsworks, an enterprise feature store with strong ties to Apache Hudi. Major cloud vendors offer feature stores as part of their managed ML platforms, including Vertex AI Feature Store, Amazon SageMaker Feature Store, and Databricks Feature Store.
Continuous integration and continuous delivery for ML extends classical software CI/CD with the realities of data and models. Three things must be versioned and tested together: code, data, and models. A typical MLOps pipeline triggers when any of the three changes (a code commit, a new data partition, or a periodic retraining schedule), then runs the workflow pipeline end to end, validates the candidate model against a baseline, and either promotes it to production or rejects it.
Google Cloud's MLOps maturity model describes three levels: level 0 with manual handoffs between data scientists and ML engineers; level 1 with automated retraining pipelines but manual deployment; and level 2 with automated CI/CD/CT (continuous training) pipelines that retrain, validate, and redeploy without human intervention. Tools such as Kubeflow, MLflow, Seldon Core, BentoML, GitHub Actions, and GitLab CI are commonly combined to implement these flows.
ML workflow pipelines are valuable in any setting where models are retrained, evaluated, or redeployed more than once. Recommendation systems at large platforms retrain daily or hourly. Fraud detection systems retrain on new labeled fraud as it arrives. Demand forecasting models for retail and logistics retrain on the latest sales data. Computer vision models for self-driving and robotics retrain whenever new edge cases are collected. In each case, pipelining ensures that the chain from raw data to deployed model can be replayed reliably.
The second meaning of pipelining concerns how a single neural network is trained across multiple accelerators. Pipeline parallelism is a form of model parallelism in which the layers of a model are partitioned into a sequence of stages, each placed on a separate device, and a mini-batch of training data is split into smaller micro-batches that flow through the stages in pipelined fashion, similar to instruction pipelining in a CPU.
When a model has too many parameters to fit on a single accelerator, the choices are roughly: shard parameters across devices (data-parallel sharding such as ZeRO or FSDP), split each layer's tensors across devices (tensor parallelism), or split the model into vertical stages and pipeline micro-batches through them. In practice, all three are used together for the largest models.
A naive form of model parallelism, where layer 1 runs on device 1, then sends its activations to device 2 for layer 2, and so on, leaves all but one device idle at any moment. If the four stages each take time t, a single forward pass takes 4t and the average device utilization is only 25%. Pipeline parallelism fixes this by splitting the mini-batch into many micro-batches so that, in the steady state, all stages are busy on different micro-batches at the same time.
The efficiency gain is not free. Compared to a non-pipelined run, every pipelined run incurs bubble overhead: idle time at the beginning while the pipeline fills, and at the end while it drains. For a synchronous pipeline with p stages and m micro-batches, the fraction of time spent in the bubble is approximately:
bubble_fraction ≈ (p - 1) / (m + p - 1)
This follows from observing that the warm-up and cool-down together contribute 2(p-1) stage-time units of bubble, while the steady state contributes 2(m-1) plus the initial fill, for total work of 2(m + p - 1) per device. Increasing the number of micro-batches m shrinks the bubble fraction, but only at the cost of more activation memory because more in-flight activations must be retained for the backward pass.
GPipe is a pipeline parallelism library introduced by Yanping Huang and colleagues at Google in the 2018 arXiv paper GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism, which appeared at NeurIPS 2019. GPipe partitions a sequential network into cells placed on separate accelerators and uses a synchronous batch-splitting algorithm: a mini-batch is divided into micro-batches, all forward passes for the mini-batch are run through the pipeline, then all backward passes, and finally a single synchronous parameter update.
GPipe demonstrated near-linear speedup on two flagship workloads: a 557-million-parameter AmoebaNet vision model that achieved 84.4% top-1 accuracy on ImageNet-2012, and a single 6-billion-parameter, 128-layer Transformer trained for multilingual neural machine translation across more than 100 languages. To reduce activation memory, GPipe uses rematerialization (also known as gradient checkpointing): only the inputs to each stage are stored during the forward pass, and intermediate activations are recomputed during the backward pass.
A simplified pseudocode view of GPipe-style training:
# Partition model into p stages: stage[0..p-1]
# Split mini-batch X into m micro-batches: x[0..m-1]
# Forward pass
for i in 0..m-1:
a = x[i]
for s in 0..p-1:
a = stage[s].forward(a) # runs on device s
activations[i] = a
loss[i] = loss_fn(a, y[i])
# Backward pass (after all forwards complete)
for i in m-1..0:
g = grad_loss(loss[i])
for s in p-1..0:
g = stage[s].backward(g) # runs on device s
# Single synchronous optimizer step
optimizer.step()
The scheme is synchronous and produces gradients identical to a non-pipelined run on the same global batch. Open implementations include torchgpipe from Kakao Brain, which was later folded into PyTorch.
PipeDream: Generalized Pipeline Parallelism for DNN Training by Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil Gibbons, and Matei Zaharia, published at SOSP 2019, generalized pipeline parallelism in two important ways. First, it introduced the 1F1B (one-forward-one-backward) schedule, in which each worker alternates between forward and backward passes after the pipeline reaches steady state, instead of running all forward passes first as in GPipe. This significantly reduces the number of in-flight activations a worker must retain, because each forward result is consumed by a backward pass much sooner.
Second, PipeDream introduced weight stashing: because forward passes for different micro-batches may run against different versions of the parameters, the runtime stores the version used during a micro-batch's forward pass and reuses it during the corresponding backward pass to keep gradients numerically consistent. Combined with 1F1B, this enables asynchronous pipeline parallelism that avoids the synchronous flush at the end of each mini-batch.
PipeDream reported up to 5.3x speedup over data parallelism for several deep models on small clusters where communication was a bottleneck. Its main downsides are extra memory for stashed weights and slightly different optimization semantics from data parallelism.
Memory-Efficient Pipeline-Parallel DNN Training (Narayanan et al., ICML 2021) introduced PipeDream-2BW, which uses a double-buffered weight update scheme. Instead of stashing one weight version per in-flight micro-batch, PipeDream-2BW maintains exactly two weight versions per worker: a current version used by newly admitted micro-batches and a shadow version used by micro-batches still in flight. Gradients are accumulated across a configurable number of micro-batches before each weight update, which preserves data-parallel-like semantics while keeping memory bounded. The paper reports end-to-end speedups of 1.3x to 20x for various GPT models versus an optimized model-parallel baseline, and up to 3.2x faster than GPipe.
A related variant, PipeDream-Flush, retains the 1F1B schedule but inserts periodic pipeline flushes so that all workers use the same weight version, recovering bit-for-bit data-parallel semantics at the cost of reintroducing a small bubble per flush.
Megatron-LM, NVIDIA's framework for large transformer training described in Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM (Narayanan et al., SC 2021), introduced the interleaved 1F1B schedule. Instead of giving each device a single contiguous block of layers, the model is divided into more chunks than there are devices and the chunks are assigned in an interleaved (round-robin) pattern. With four devices and eight layers, for example, device 0 gets layers 0 and 4, device 1 gets layers 1 and 5, device 2 gets layers 2 and 6, and device 3 gets layers 3 and 7.
The interleaved schedule maintains the same bubble time as PipeDream-Flush but reduces the number of micro-batches whose activations must be stashed, making it more memory-efficient than GPipe in the presence of many micro-batches. Megatron-LM combines interleaved 1F1B with tensor parallelism inside each transformer layer and data parallelism across replicas to scale to thousands of GPUs.
DeepSpeed, Microsoft's deep learning optimization library, implements pipeline parallelism using gradient accumulation across micro-batches and integrates it with two other strategies. ZeRO (Zero Redundancy Optimizer) is a memory-efficient form of data parallelism that partitions optimizer state (Stage 1), gradients (Stage 2), and parameters (Stage 3) across data-parallel ranks instead of replicating them on every rank, giving access to the aggregate GPU memory of the cluster. 3D parallelism combines ZeRO-style data parallelism, pipeline parallelism, and tensor parallelism (typically borrowed from Megatron-LM) to scale models past one trillion parameters. The Megatron-DeepSpeed integration was used to train models such as BLOOM-176B and Megatron-Turing NLG 530B.
PyTorch initially shipped a pipeline parallelism API as torch.distributed.pipeline.sync, derived from torchgpipe. A separate research-oriented project, PiPPy (Pipeline Parallelism for PyTorch), provided more flexible model splitting via torch.fx and supported asynchronous schedules. PiPPy has since been folded into PyTorch as torch.distributed.pipelining, which is the recommended API as of PyTorch 2.x. The new package handles automatic model splitting into stages, manages micro-batch communication, and provides multiple schedule implementations including ScheduleGPipe, Schedule1F1B, ScheduleInterleaved1F1B, and ScheduleLoopedBFS. It composes with DDP, FSDP, and tensor parallelism, allowing PyTorch users to assemble 3D parallelism in pure PyTorch.
| Schedule | First proposed by | Synchronous? | Bubble fraction | Activation memory | Notes |
|---|---|---|---|---|---|
| GPipe | Huang et al., 2018 | Yes | (p-1)/(m+p-1) | High; stashes activations for all m micro-batches | Bit-for-bit equivalent to non-pipelined SGD; uses rematerialization |
| PipeDream (async) | Narayanan et al., SOSP 2019 | No | Negligible in steady state | Lower; 1F1B reduces in-flight activations | Stashes multiple weight versions; weight-update semantics differ |
| PipeDream-Flush (1F1B) | Narayanan et al., 2020 | Yes | Same as GPipe | Lower than GPipe (1F1B shape) | Flushes periodically to keep one weight version |
| PipeDream-2BW | Narayanan et al., ICML 2021 | Approximately | Negligible in steady state | Bounded; only two weight versions | Designed for very large models like GPT-3 |
| Interleaved 1F1B | Narayanan et al., SC 2021 (Megatron-LM) | Yes | Smaller than 1F1B by factor of v (chunks per device) | Slightly higher than 1F1B | Standard for modern LLM training |
Pipeline parallelism is one of several axes along which a training job can be parallelized. Modern large-model training combines several at once.
| Strategy | What is split | Communication pattern | Typical scale |
|---|---|---|---|
| Data parallelism | The mini-batch (each replica trains on a different shard) | All-reduce of gradients | Tens to thousands of devices |
| Tensor parallelism | Individual weight matrices within a layer | All-reduce of partial activations and gradients within each layer | Within a node, typically 4 or 8 GPUs |
| Pipeline parallelism | Layers (or contiguous groups of layers) into stages | Point-to-point activation and gradient sends between adjacent stages | Across nodes, typically 4 to 64 stages |
| Sequence parallelism | The sequence dimension within selected operators | Reduce-scatter and all-gather along the sequence | Combined with tensor parallelism |
| Expert parallelism (MoE) | Different experts of a Mixture of Experts layer | All-to-all of tokens to experts | Across nodes |
In modern LLM stacks, the typical configuration is tensor parallelism within a server (where bandwidth is highest), pipeline parallelism across servers within a rack or pod, and data parallelism across racks. Sequence parallelism is layered on top of tensor parallelism to reduce activation memory in LayerNorm and dropout.
Pipeline parallelism is a standard ingredient in training large language models, although the exact configuration varies by system.
Pipeline parallelism is not free.
m competes with memory budgets unless rematerialization or 1F1B-style schedules are used.The concept of pipelining originates from computer architecture. Pipelined CPUs date back to the IBM 7030 Stretch (1961) and were popularized by RISC processors in the 1980s. Pipelining as a general principle for chaining computational stages was applied to data processing systems for decades before it appeared in machine learning frameworks.
In the ML ecosystem, scikit-learn introduced its Pipeline class in 2010 as part of its 0.4 release, providing a clean composition primitive for sequential transformers and a final estimator. TFX evolved at Google starting around 2017 from internal systems used by Google Search, YouTube, and Google Play, and was released as open source in 2019. Kubeflow Pipelines was announced in 2018, MLflow in June 2018, and Apache Airflow predates ML by several years (Airbnb open-sourced it in 2015).
Pipeline parallelism in the deep learning sense traces to GPipe (Huang et al., 2018) and PipeDream (Harlap et al., 2018; Narayanan et al., SOSP 2019). The 1F1B schedule popularized by PipeDream and the interleaved 1F1B schedule introduced in Megatron-LM (Narayanan et al., 2021) became standard ingredients in large language model training systems by 2022, when models such as GPT-3 successors, BLOOM, and Megatron-Turing NLG were being trained at scale. PyTorch's official pipeline parallelism API matured from torchgpipe and torch.distributed.pipeline through PiPPy into torch.distributed.pipelining over 2020 to 2024.
Imagine making a sandwich with several steps in order: take out the bread, put on some mayo, add lettuce and tomatoes, then close it with another slice of bread. ML workflow pipelining is like writing down those steps so that anyone (including a computer) can follow them in the same order every time, for every sandwich.
Now imagine a sandwich shop with four cooks standing in a line. The first cook only does bread, the second only mayo, the third only lettuce and tomatoes, and the fourth only the top slice. If each cook works on their own sandwich at the same time, the shop makes four sandwiches in roughly the time it used to take to make one. That is pipeline parallelism: instead of one big computer doing everything, you cut the work into stations and let each station work on a different piece at the same time. The catch is that at the very start the last cook has nothing to do, and at the very end the first cook is finished before everyone else. Those quiet moments are called "bubbles," and a lot of clever scheduling tries to make them as small as possible.