See also: Machine learning, MLOps, Model deployment
A pipeline in machine learning is a sequence of data processing steps chained together to form an end-to-end workflow. Each step in the pipeline takes input from the previous step, performs a specific transformation or computation, and passes its output to the next step. Pipelines are fundamental to modern ML practice because they enforce reproducibility, reduce manual errors, and make complex workflows easier to manage, test, and deploy.
The concept of a pipeline draws from software engineering and manufacturing, where assembly lines process raw materials through a series of stations. In ML, the "raw material" is data, and the "stations" include everything from data cleaning to model training to production serving. By formalizing these steps into a pipeline, teams ensure that every run follows the same process and produces consistent, auditable results.
A typical ML pipeline consists of several distinct stages. While the exact steps vary depending on the project, most pipelines include the following components.
The pipeline begins by collecting raw data from one or more sources. These sources can include databases, APIs, data lakes, streaming platforms like Apache Kafka, or flat files such as CSV and Parquet. The ingestion step handles connection logic, schema validation, and initial quality checks to confirm that the incoming data meets expected formats and constraints.
Preprocessing is often the most time-consuming stage. It involves cleaning and transforming raw data into a structured format suitable for analysis. Common preprocessing tasks include:
Preprocessing quality directly affects downstream model performance, making this stage critical to the overall pipeline.
Feature engineering involves creating new input variables or modifying existing ones to improve model predictive power. Examples include generating polynomial features, computing rolling averages, extracting text embeddings, creating interaction terms between variables, and applying domain-specific transformations. Feature engineering often requires deep understanding of the problem domain and is considered one of the most impactful steps in the pipeline.
During training, a machine learning algorithm fits a model to the prepared data. This step involves selecting an appropriate algorithm (linear regression, random forests, neural networks, gradient boosting, and so on), configuring hyperparameters, and running the optimization process. Many pipelines integrate hyperparameter tuning methods such as grid search, random search, or Bayesian optimization at this stage.
After training, the model is evaluated on a held-out validation or test set using metrics appropriate to the task. For classification problems, common metrics include accuracy, precision, recall, F1 score, and area under the ROC curve. For regression, metrics like mean squared error, mean absolute error, and R-squared are typical. The evaluation stage may also include fairness audits, bias detection, and error analysis to ensure the model behaves as expected across different subgroups.
Once a model passes evaluation, it is deployed to a production environment where it can serve predictions. Deployment can take several forms: a REST API endpoint, a batch prediction job, an embedded model within a mobile app, or integration into an existing software system. The deployment stage also involves packaging the model with its dependencies, setting up version control, and configuring scaling and load balancing.
After deployment, continuous monitoring tracks model performance, data drift, and system health. Models can degrade over time as the distribution of incoming data shifts away from the training distribution, a phenomenon known as concept drift. Monitoring systems detect these changes and can trigger automated retraining pipelines to keep the model current.
Machine learning systems typically require two separate but related pipelines.
| Aspect | Training Pipeline | Inference Pipeline |
|---|---|---|
| Purpose | Discover patterns in historical data and produce a trained model | Apply the trained model to new data to generate predictions |
| Execution frequency | Runs periodically (daily, weekly) or on trigger events | Runs continuously in production, often millions of times per day |
| Compute requirements | Requires large datasets, significant GPU/TPU compute, and substantial processing time | Requires less compute; optimized for low latency (milliseconds to seconds) |
| Data volume | Processes entire training datasets (gigabytes to terabytes) | Processes individual requests or small batches |
| Output | A serialized model artifact (weights, parameters) | Predictions, scores, or classifications |
| Learning | The model updates its parameters through backpropagation or other optimization | The model is frozen; no parameter updates occur during inference |
| Infrastructure | Often runs on cloud GPU clusters or dedicated training servers | May run on edge devices, CPUs, or lightweight inference servers |
A well-designed system keeps these two pipelines decoupled but connected through a shared model registry and feature store, ensuring consistency between the features used during training and those used during inference.
The scikit-learn library provides a Pipeline class that is one of the most widely used implementations for building ML pipelines in Python. It chains multiple processing steps (transformers) with a final estimator (classifier or regressor) into a single object that can be fitted and used for predictions.
A scikit-learn Pipeline is constructed from a list of (name, estimator) tuples. All steps except the last must implement a transform() method (making them transformers). The last step can be any estimator. When fit() is called on the pipeline, each transformer in sequence calls fit_transform() on the data, and the final estimator calls fit(). When predict() is called, each transformer calls transform(), and the final estimator calls predict().
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
# Explicit construction
pipe = Pipeline([
('scaler', StandardScaler()),
('reduce_dim', PCA(n_components=10)),
('clf', SVC())
])
# Shorthand with auto-generated names
pipe = make_pipeline(StandardScaler(), PCA(n_components=10), SVC())
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
One of the most important benefits of using a scikit-learn Pipeline is its ability to prevent data leakage. Data leakage occurs when information from the test set (or future data) inadvertently influences the training process, leading to overly optimistic performance estimates that do not generalize to real-world data.
Without a pipeline, a common mistake is to fit a scaler or other transformer on the entire dataset before splitting into training and test sets. This allows statistics from the test set (such as the mean and standard deviation) to "leak" into the training process.
# BAD: data leakage - scaler learns from test data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # fits on ALL data including test
X_train, X_test = train_test_split(X_scaled)
# GOOD: pipeline ensures scaler only fits on training data
pipe = make_pipeline(StandardScaler(), SVC())
pipe.fit(X_train, y_train) # scaler fits ONLY on X_train
score = pipe.score(X_test, y_test) # scaler transforms X_test only
When used with cross-validation functions like cross_val_score, the pipeline ensures that transformers are fit exclusively on each training fold, and the validation fold is only transformed (never fitted), preserving a clean separation between training and evaluation data.
Scikit-learn extends the basic Pipeline with two additional composition tools:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
preprocessor = ColumnTransformer([
('num', StandardScaler(), ['age', 'income']),
('cat', OneHotEncoder(), ['city', 'gender'])
])
pipe = Pipeline([
('preprocessor', preprocessor),
('clf', LogisticRegression())
])
At the infrastructure level, ML pipelines are often represented as Directed Acyclic Graphs (DAGs). In a DAG, each node represents a processing step and edges define the dependencies between steps. The "acyclic" constraint means there are no circular dependencies, so the pipeline has a clear execution order. DAG-based orchestration allows the system to determine which steps can run in parallel and which must wait for upstream steps to complete.
Pipeline orchestrators are tools that manage the scheduling, execution, monitoring, and error handling of these DAG-based workflows. They handle concerns like retrying failed steps, managing compute resources, passing data between steps, and sending alerts when something goes wrong.
Several frameworks have emerged to support building, orchestrating, and managing ML pipelines at scale.
TFX is Google's open-source, end-to-end platform for deploying production ML pipelines. It is designed specifically for TensorFlow models and provides a set of standardized components:
| TFX Component | Purpose |
|---|---|
| ExampleGen | Ingests and splits data into training and evaluation sets |
| StatisticsGen | Computes descriptive statistics over the dataset |
| SchemaGen | Infers the data schema (types, ranges, domains) |
| ExampleValidator | Detects anomalies and drift in incoming data |
| Transform | Performs feature engineering using TensorFlow Transform |
| Trainer | Trains TensorFlow/Keras models |
| Tuner | Runs hyperparameter tuning |
| Evaluator | Validates model quality against baseline metrics |
| Pusher | Deploys validated models to a serving infrastructure |
TFX pipelines can be orchestrated using Apache Airflow, Apache Beam, or Kubeflow Pipelines. The platform excels in scenarios requiring enterprise-grade reliability and can handle real-time serving at scale.
Kubeflow Pipelines is a Kubernetes-native platform for building and deploying portable ML workflows. Each pipeline step runs in its own container, providing strong isolation and reproducibility. The system uses Argo Workflows under the hood to orchestrate Kubernetes Pods that carry out each step.
Key features include:
Kubeflow is well suited for organizations that already run Kubernetes and need fine-grained control over resource allocation and scaling.
MLflow is a framework-agnostic, open-source platform that manages the entire ML lifecycle. Unlike TFX and Kubeflow, which focus heavily on pipeline orchestration, MLflow provides a broader set of capabilities:
MLflow supports Python, R, and Java, and integrates with virtually any ML framework. Many teams combine MLflow with a dedicated orchestrator (like Airflow or Kubeflow) to get the best of both worlds: MLflow handles experiment tracking and model management, while the orchestrator handles scheduling and execution.
Apache Airflow is a general-purpose workflow orchestration platform widely adopted for ML pipelines. Pipelines are defined as Python DAGs, and Airflow provides a scheduler, a web UI for monitoring, and a plugin system for integrating with external services.
A typical ML pipeline in Airflow might define tasks for data extraction, preprocessing, model training, evaluation, and deployment, with dependencies ensuring they run in the correct order. Airflow's distributed architecture consists of a scheduler that orchestrates workflow execution, executors responsible for running task instances on workers, and a metadata database that stores state and history.
Airflow is tool-agnostic and can orchestrate actions across any service with an API, making it a flexible backbone for MLOps workflows. However, Airflow was originally designed for data engineering workflows, so it lacks some ML-specific features (like built-in experiment tracking) that purpose-built ML platforms provide.
| Feature | scikit-learn Pipeline | TFX | Kubeflow Pipelines | MLflow | Apache Airflow |
|---|---|---|---|---|---|
| Primary use case | Single-machine ML workflows | TensorFlow production pipelines | Kubernetes-native ML | Experiment tracking and lifecycle | General workflow orchestration |
| Framework support | Scikit-learn estimators | TensorFlow/Keras | Any (containerized) | Any | Any |
| Orchestration | In-process (Python) | Airflow, Beam, Kubeflow | Argo Workflows on K8s | External (pluggable) | Built-in scheduler |
| Scalability | Single machine | Distributed (Beam) | Kubernetes clusters | Varies by deployment | Distributed workers |
| Data leakage prevention | Built-in (fit/transform) | Manual | Manual | Manual | Manual |
| Model registry | No | Limited | Via integration | Built-in | No |
| Learning curve | Low | High | High | Medium | Medium |
A feature store is a centralized repository that manages feature engineering outputs for ML models. It serves as a single source of truth for feature definitions, ensuring that the same features are used consistently during both training and inference. Without a feature store, teams often end up with duplicated feature computation code and inconsistencies between training and serving environments, a problem known as training-serving skew.
A feature store typically has two storage layers:
Popular feature store implementations include Feast (open source), Tecton, Hopsworks, and built-in feature stores from cloud providers such as Amazon SageMaker Feature Store, Google Vertex AI Feature Store, and Databricks Feature Store.
A model registry is a centralized system for storing, versioning, and managing trained ML models throughout their lifecycle. It acts as the bridge between training pipelines and deployment pipelines, providing a structured way to track which models are in production, which are in staging, and which are experimental.
Key capabilities of a model registry include:
MLflow Model Registry, Amazon SageMaker Model Registry, Google Vertex AI Model Registry, and Azure ML Model Registry are widely used implementations.
Continuous Integration and Continuous Deployment (CI/CD) practices, originally developed for traditional software, have been adapted for ML pipelines. However, ML CI/CD differs in important ways from conventional software CI/CD because in ML, the final product depends on both code and data. A change in either one can alter model behavior.
Google has defined three levels of MLOps maturity that describe how organizations adopt pipeline automation:
| Maturity Level | Description | Characteristics |
|---|---|---|
| Level 0: Manual | All steps performed manually | Data scientists hand off models to engineers; infrequent deployments; no monitoring |
| Level 1: Pipeline Automation | Training pipeline is automated | Continuous training with fresh data; feature stores and metadata management; automated data and model validation |
| Level 2: CI/CD Automation | Full automation of build, test, and deployment | Rapid experimentation; automated testing of data, features, and models; robust deployment strategies |
Monitoring a deployed ML pipeline involves tracking several dimensions beyond traditional software monitoring:
Tools such as Evidently AI, WhyLabs, Arize AI, and Amazon SageMaker Model Monitor provide specialized capabilities for ML monitoring.
Imagine you are making cookies. A machine learning pipeline is like the recipe and all the steps you follow to make the cookies. First, you gather and prepare the ingredients, which is like collecting and cleaning data. Then you mix them together in just the right way, which is like feature engineering. Next, you put the cookies in the oven and bake them, which is like training the model. You taste one cookie to check if they are good, which is like evaluation. Finally, you put all the cookies on a plate for your friends, which is like deployment. If you write down every step and follow it the same way each time, your cookies will always come out right. That is what a pipeline does for machine learning: it writes down all the steps so the computer follows them in the same order every time.