Pipeline

See also: Machine learning, MLOps, Model deployment

A pipeline in machine learning is a sequence of data processing steps chained together to form an end-to-end workflow. Each step in the pipeline takes input from the previous step, performs a specific transformation or computation, and passes its output to the next step. Pipelines are fundamental to modern ML practice because they enforce reproducibility, reduce manual errors, and make complex workflows easier to manage, test, and deploy.

The concept of a pipeline draws from software engineering and manufacturing, where assembly lines process raw materials through a series of stations. In ML, the "raw material" is data, and the "stations" include everything from data cleaning to model training to production serving. By formalizing these steps into a pipeline, teams ensure that every run follows the same process and produces consistent, auditable results.

Core Stages of a Machine Learning Pipeline

A typical ML pipeline consists of several distinct stages. While the exact steps vary depending on the project, most pipelines include the following components.

Data Ingestion

The pipeline begins by collecting raw data from one or more sources. These sources can include databases, APIs, data lakes, streaming platforms like Apache Kafka, or flat files such as CSV and Parquet. The ingestion step handles connection logic, schema validation, and initial quality checks to confirm that the incoming data meets expected formats and constraints.

Data Preprocessing

Preprocessing is often the most time-consuming stage. It involves cleaning and transforming raw data into a structured format suitable for analysis. Common preprocessing tasks include:

Handling missing values through imputation or removal
Removing duplicate records
Encoding categorical variables (one-hot encoding, label encoding, or target encoding)
Scaling and normalizing numerical features using techniques like min-max scaling or standardization
Parsing dates, text fields, and other unstructured data
Detecting and handling outliers

Preprocessing quality directly affects downstream model performance, making this stage critical to the overall pipeline.

Feature Engineering

Feature engineering involves creating new input variables or modifying existing ones to improve model predictive power. Examples include generating polynomial features, computing rolling averages, extracting text embeddings, creating interaction terms between variables, and applying domain-specific transformations. Feature engineering often requires deep understanding of the problem domain and is considered one of the most impactful steps in the pipeline.

Model Training

During training, a machine learning algorithm fits a model to the prepared data. This step involves selecting an appropriate algorithm (linear regression, random forests, neural networks, gradient boosting, and so on), configuring hyperparameters, and running the optimization process. Many pipelines integrate hyperparameter tuning methods such as grid search, random search, or Bayesian optimization at this stage.

Model Evaluation

After training, the model is evaluated on a held-out validation or test set using metrics appropriate to the task. For classification problems, common metrics include accuracy, precision, recall, F1 score, and area under the ROC curve. For regression, metrics like mean squared error, mean absolute error, and R-squared are typical. The evaluation stage may also include fairness audits, bias detection, and error analysis to ensure the model behaves as expected across different subgroups.

Model Deployment

Once a model passes evaluation, it is deployed to a production environment where it can serve predictions. Deployment can take several forms: a REST API endpoint, a batch prediction job, an embedded model within a mobile app, or integration into an existing software system. The deployment stage also involves packaging the model with its dependencies, setting up version control, and configuring scaling and load balancing.

Monitoring and Maintenance

After deployment, continuous monitoring tracks model performance, data drift, and system health. Models can degrade over time as the distribution of incoming data shifts away from the training distribution, a phenomenon known as concept drift. Monitoring systems detect these changes and can trigger automated retraining pipelines to keep the model current.

Training Pipeline vs. Inference Pipeline

Machine learning systems typically require two separate but related pipelines.

Aspect	Training Pipeline	Inference Pipeline
Purpose	Discover patterns in historical data and produce a trained model	Apply the trained model to new data to generate predictions
Execution frequency	Runs periodically (daily, weekly) or on trigger events	Runs continuously in production, often millions of times per day
Compute requirements	Requires large datasets, significant GPU/TPU compute, and substantial processing time	Requires less compute; optimized for low latency (milliseconds to seconds)
Data volume	Processes entire training datasets (gigabytes to terabytes)	Processes individual requests or small batches
Output	A serialized model artifact (weights, parameters)	Predictions, scores, or classifications
Learning	The model updates its parameters through backpropagation or other optimization	The model is frozen; no parameter updates occur during inference
Infrastructure	Often runs on cloud GPU clusters or dedicated training servers	May run on edge devices, CPUs, or lightweight inference servers

A well-designed system keeps these two pipelines decoupled but connected through a shared model registry and feature store, ensuring consistency between the features used during training and those used during inference.

Scikit-learn Pipeline

The scikit-learn library provides a Pipeline class that is one of the most widely used implementations for building ML pipelines in Python. It chains multiple processing steps (transformers) with a final estimator (classifier or regressor) into a single object that can be fitted and used for predictions.

How It Works

A scikit-learn Pipeline is constructed from a list of (name, estimator) tuples. All steps except the last must implement a transform() method (making them transformers). The last step can be any estimator. When fit() is called on the pipeline, each transformer in sequence calls fit_transform() on the data, and the final estimator calls fit(). When predict() is called, each transformer calls transform(), and the final estimator calls predict().

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC

# Explicit construction
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('reduce_dim', PCA(n_components=10)),
    ('clf', SVC())
])

# Shorthand with auto-generated names
pipe = make_pipeline(StandardScaler(), PCA(n_components=10), SVC())

pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)

Preventing Data Leakage

One of the most important benefits of using a scikit-learn Pipeline is its ability to prevent data leakage. Data leakage occurs when information from the test set (or future data) inadvertently influences the training process, leading to overly optimistic performance estimates that do not generalize to real-world data.

Without a pipeline, a common mistake is to fit a scaler or other transformer on the entire dataset before splitting into training and test sets. This allows statistics from the test set (such as the mean and standard deviation) to "leak" into the training process.

# BAD: data leakage - scaler learns from test data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)   # fits on ALL data including test
X_train, X_test = train_test_split(X_scaled)

# GOOD: pipeline ensures scaler only fits on training data
pipe = make_pipeline(StandardScaler(), SVC())
pipe.fit(X_train, y_train)           # scaler fits ONLY on X_train
score = pipe.score(X_test, y_test)   # scaler transforms X_test only

When used with cross-validation functions like cross_val_score, the pipeline ensures that transformers are fit exclusively on each training fold, and the validation fold is only transformed (never fitted), preserving a clean separation between training and evaluation data.

ColumnTransformer and FeatureUnion

Scikit-learn extends the basic Pipeline with two additional composition tools:

ColumnTransformer applies different transformations to different columns of the input data. For example, numerical columns might be scaled while categorical columns are one-hot encoded.
FeatureUnion runs multiple transformers in parallel and concatenates their outputs side by side, enabling the combination of different feature extraction strategies.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'income']),
    ('cat', OneHotEncoder(), ['city', 'gender'])
])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('clf', LogisticRegression())
])

Pipeline Orchestration and DAGs

At the infrastructure level, ML pipelines are often represented as Directed Acyclic Graphs (DAGs). In a DAG, each node represents a processing step and edges define the dependencies between steps. The "acyclic" constraint means there are no circular dependencies, so the pipeline has a clear execution order. DAG-based orchestration allows the system to determine which steps can run in parallel and which must wait for upstream steps to complete.

Pipeline orchestrators are tools that manage the scheduling, execution, monitoring, and error handling of these DAG-based workflows. They handle concerns like retrying failed steps, managing compute resources, passing data between steps, and sending alerts when something goes wrong.

Major Pipeline Frameworks and Tools

Several frameworks have emerged to support building, orchestrating, and managing ML pipelines at scale.

TensorFlow Extended (TFX)

TFX is Google's open-source, end-to-end platform for deploying production ML pipelines. It is designed specifically for TensorFlow models and provides a set of standardized components:

TFX Component	Purpose
ExampleGen	Ingests and splits data into training and evaluation sets
StatisticsGen	Computes descriptive statistics over the dataset
SchemaGen	Infers the data schema (types, ranges, domains)
ExampleValidator	Detects anomalies and drift in incoming data
Transform	Performs feature engineering using TensorFlow Transform
Trainer	Trains TensorFlow/Keras models
Tuner	Runs hyperparameter tuning
Evaluator	Validates model quality against baseline metrics
Pusher	Deploys validated models to a serving infrastructure

TFX pipelines can be orchestrated using Apache Airflow, Apache Beam, or Kubeflow Pipelines. The platform excels in scenarios requiring enterprise-grade reliability and can handle real-time serving at scale.

Kubeflow Pipelines

Kubeflow Pipelines is a Kubernetes-native platform for building and deploying portable ML workflows. Each pipeline step runs in its own container, providing strong isolation and reproducibility. The system uses Argo Workflows under the hood to orchestrate Kubernetes Pods that carry out each step.

Key features include:

A Python SDK for defining pipelines as code
A web UI for visualizing pipeline runs, comparing experiments, and inspecting outputs
Built-in support for distributed training, hyperparameter tuning (via Katib), and model serving (via KServe)
Integration with feature stores (Feast) and model registries

Kubeflow is well suited for organizations that already run Kubernetes and need fine-grained control over resource allocation and scaling.

MLflow

MLflow is a framework-agnostic, open-source platform that manages the entire ML lifecycle. Unlike TFX and Kubeflow, which focus heavily on pipeline orchestration, MLflow provides a broader set of capabilities:

Tracking: Log parameters, metrics, and artifacts for every experiment run
Projects: Package ML code in a reproducible format
Models: Serve models across multiple deployment targets
Model Registry: Version, stage, and manage models through their lifecycle

MLflow supports Python, R, and Java, and integrates with virtually any ML framework. Many teams combine MLflow with a dedicated orchestrator (like Airflow or Kubeflow) to get the best of both worlds: MLflow handles experiment tracking and model management, while the orchestrator handles scheduling and execution.

Apache Airflow

Apache Airflow is a general-purpose workflow orchestration platform widely adopted for ML pipelines. Pipelines are defined as Python DAGs, and Airflow provides a scheduler, a web UI for monitoring, and a plugin system for integrating with external services.

A typical ML pipeline in Airflow might define tasks for data extraction, preprocessing, model training, evaluation, and deployment, with dependencies ensuring they run in the correct order. Airflow's distributed architecture consists of a scheduler that orchestrates workflow execution, executors responsible for running task instances on workers, and a metadata database that stores state and history.

Airflow is tool-agnostic and can orchestrate actions across any service with an API, making it a flexible backbone for MLOps workflows. However, Airflow was originally designed for data engineering workflows, so it lacks some ML-specific features (like built-in experiment tracking) that purpose-built ML platforms provide.

Comparison of Pipeline Frameworks

Feature	scikit-learn Pipeline	TFX	Kubeflow Pipelines	MLflow	Apache Airflow
Primary use case	Single-machine ML workflows	TensorFlow production pipelines	Kubernetes-native ML	Experiment tracking and lifecycle	General workflow orchestration
Framework support	Scikit-learn estimators	TensorFlow/Keras	Any (containerized)	Any	Any
Orchestration	In-process (Python)	Airflow, Beam, Kubeflow	Argo Workflows on K8s	External (pluggable)	Built-in scheduler
Scalability	Single machine	Distributed (Beam)	Kubernetes clusters	Varies by deployment	Distributed workers
Data leakage prevention	Built-in (fit/transform)	Manual	Manual	Manual	Manual
Model registry	No	Limited	Via integration	Built-in	No
Learning curve	Low	High	High	Medium	Medium

Feature Stores

A feature store is a centralized repository that manages feature engineering outputs for ML models. It serves as a single source of truth for feature definitions, ensuring that the same features are used consistently during both training and inference. Without a feature store, teams often end up with duplicated feature computation code and inconsistencies between training and serving environments, a problem known as training-serving skew.

A feature store typically has two storage layers:

Offline store: Holds historical feature data for batch training jobs, usually backed by a data warehouse or data lake
Online store: Serves precomputed features at low latency for real-time inference, often backed by a key-value store like Redis or DynamoDB

Popular feature store implementations include Feast (open source), Tecton, Hopsworks, and built-in feature stores from cloud providers such as Amazon SageMaker Feature Store, Google Vertex AI Feature Store, and Databricks Feature Store.

Model Registry

A model registry is a centralized system for storing, versioning, and managing trained ML models throughout their lifecycle. It acts as the bridge between training pipelines and deployment pipelines, providing a structured way to track which models are in production, which are in staging, and which are experimental.

Key capabilities of a model registry include:

Version control: Automatically tracks each model version along with its training parameters, metrics, and data lineage
Stage management: Models can be assigned stages (e.g., "staging," "production," "archived") to manage promotion workflows
Traceability: Each registered model version links back to the specific training run, dataset, and code that produced it, enabling full reproducibility
Governance: Role-based access controls and approval workflows ensure that only validated models reach production

MLflow Model Registry, Amazon SageMaker Model Registry, Google Vertex AI Model Registry, and Azure ML Model Registry are widely used implementations.

CI/CD for Machine Learning

Continuous Integration and Continuous Deployment (CI/CD) practices, originally developed for traditional software, have been adapted for ML pipelines. However, ML CI/CD differs in important ways from conventional software CI/CD because in ML, the final product depends on both code and data. A change in either one can alter model behavior.

Google has defined three levels of MLOps maturity that describe how organizations adopt pipeline automation:

Maturity Level	Description	Characteristics
Level 0: Manual	All steps performed manually	Data scientists hand off models to engineers; infrequent deployments; no monitoring
Level 1: Pipeline Automation	Training pipeline is automated	Continuous training with fresh data; feature stores and metadata management; automated data and model validation
Level 2: CI/CD Automation	Full automation of build, test, and deployment	Rapid experimentation; automated testing of data, features, and models; robust deployment strategies

Key CI/CD Components for ML

Data validation: Automated checks for schema changes, missing values, and data drift
Model validation: Performance benchmarking against baseline models before promotion
Deployment strategies: Blue-green deployment (two identical environments with traffic switching), canary releases (routing a small percentage of traffic to the new model), and shadow deployment (running the new model alongside the old one without affecting users)
Automated retraining: Pipelines that trigger when performance drops below a threshold or new data arrives

Pipeline Monitoring

Monitoring a deployed ML pipeline involves tracking several dimensions beyond traditional software monitoring:

Data quality: Monitoring input data distributions for drift, missing values, and schema violations
Model performance: Tracking prediction accuracy, latency, and throughput over time
Feature drift: Detecting when the statistical properties of input features change relative to the training distribution
Concept drift: Identifying when the relationship between input features and the target variable shifts
System health: Monitoring CPU, memory, GPU utilization, and pipeline execution times

Tools such as Evidently AI, WhyLabs, Arize AI, and Amazon SageMaker Model Monitor provide specialized capabilities for ML monitoring.

Best Practices for ML Pipelines

Modularize pipeline steps. Each step should have a single responsibility, clearly defined inputs and outputs, and be independently testable.
Version everything. Track versions of data, code, models, and pipeline configurations. Tools like DVC (Data Version Control) complement Git for data and model versioning.
Automate testing. Include data validation tests, unit tests for transformation logic, integration tests for the full pipeline, and model performance tests.
Use containers. Package each pipeline step in a Docker container to ensure consistent execution across development, staging, and production environments.
Implement idempotency. Pipeline steps should produce the same output when run multiple times with the same input, making reruns and debugging safer.
Log metadata extensively. Record parameters, metrics, data lineage, and execution times for every pipeline run to support debugging and compliance.
Separate concerns. Keep training pipelines, inference pipelines, and monitoring systems as independent components connected through well-defined interfaces.
Plan for failure. Design pipelines with retry logic, graceful error handling, and alerting so that transient failures do not require manual intervention.

Explain Like I'm 5 (ELI5)

Imagine you are making cookies. A machine learning pipeline is like the recipe and all the steps you follow to make the cookies. First, you gather and prepare the ingredients, which is like collecting and cleaning data. Then you mix them together in just the right way, which is like feature engineering. Next, you put the cookies in the oven and bake them, which is like training the model. You taste one cookie to check if they are good, which is like evaluation. Finally, you put all the cookies on a plate for your friends, which is like deployment. If you write down every step and follow it the same way each time, your cookies will always come out right. That is what a pipeline does for machine learning: it writes down all the steps so the computer follows them in the same order every time.

References

Scikit-learn documentation. "Pipelines and composite estimators." https://scikit-learn.org/stable/modules/compose.html
Google Cloud Architecture Center. "MLOps: Continuous delivery and automation pipelines in machine learning." https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
Valohai. "What is a Machine Learning Pipeline?" https://valohai.com/machine-learning-pipeline/
Kubeflow documentation. "Pipelines Overview." https://www.kubeflow.org/docs/components/pipelines/overview/
MLflow documentation. "MLflow Model Registry." https://mlflow.org/docs/latest/ml/model-registry/
TensorFlow. "TFX: TensorFlow Extended." https://www.tensorflow.org/tfx
Apache Airflow documentation. https://airflow.apache.org/
Snowflake. "Feature Store for Machine Learning." https://www.snowflake.com/en/fundamentals/feature-store/
DataCamp. "A Beginner's Guide to CI/CD for Machine Learning." https://www.datacamp.com/tutorial/ci-cd-for-machine-learning
Scikit-learn documentation. "Common pitfalls and recommended practices." https://scikit-learn.org/stable/common_pitfalls.html

Core Stages of a Machine Learning Pipeline

Data Ingestion

Data Preprocessing

Feature Engineering

Model Training

Model Evaluation

Model Deployment

Monitoring and Maintenance

Training Pipeline vs. Inference Pipeline

Scikit-learn Pipeline

How It Works

Preventing Data Leakage

ColumnTransformer and FeatureUnion

Pipeline Orchestration and DAGs

Major Pipeline Frameworks and Tools

TensorFlow Extended (TFX)

Kubeflow Pipelines

MLflow

Apache Airflow

Comparison of Pipeline Frameworks

Feature Stores

Model Registry

CI/CD for Machine Learning

Key CI/CD Components for ML

Pipeline Monitoring

Best Practices for ML Pipelines

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

ARC-AGI 2

Dynamic model

Pipelining

Training-Serving Skew

MLflow

Kubeflow

Core Stages of a Machine Learning Pipeline

Data Ingestion

Data Preprocessing

Feature Engineering

Model Training

Model Evaluation

Model Deployment

Monitoring and Maintenance

Training Pipeline vs. Inference Pipeline

Scikit-learn Pipeline

How It Works

Preventing Data Leakage

ColumnTransformer and FeatureUnion

Pipeline Orchestration and DAGs

Major Pipeline Frameworks and Tools

TensorFlow Extended (TFX)

Kubeflow Pipelines

MLflow

Apache Airflow

Comparison of Pipeline Frameworks

Feature Stores

Model Registry

CI/CD for Machine Learning

Key CI/CD Components for ML

Pipeline Monitoring

Best Practices for ML Pipelines

Explain Like I'm 5 (ELI5)

References

Related Articles

ARC-AGI 2

Dynamic model

Pipelining

Training-Serving Skew

MLflow

Kubeflow