# Pipeline

> Source: https://aiwiki.ai/wiki/pipeline
> Updated: 2026-04-06
> Categories: MLOps, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

*See also: [Machine learning](/wiki/machine_learning), [MLOps](/wiki/mlops), [Model deployment](/wiki/model_deployment)*

A **pipeline** in [machine learning](/wiki/machine_learning) is a sequence of data processing steps chained together to form an end-to-end workflow. Each step in the pipeline takes input from the previous step, performs a specific transformation or computation, and passes its output to the next step. Pipelines are fundamental to modern ML practice because they enforce reproducibility, reduce manual errors, and make complex workflows easier to manage, test, and deploy.

The concept of a pipeline draws from software engineering and manufacturing, where assembly lines process raw materials through a series of stations. In ML, the "raw material" is data, and the "stations" include everything from data cleaning to model [training](/wiki/training) to production [serving](/wiki/serving). By formalizing these steps into a pipeline, teams ensure that every run follows the same process and produces consistent, auditable results.

## Core Stages of a Machine Learning Pipeline

A typical ML pipeline consists of several distinct stages. While the exact steps vary depending on the project, most pipelines include the following components.

### Data Ingestion

The pipeline begins by collecting raw data from one or more sources. These sources can include databases, APIs, data lakes, streaming platforms like Apache Kafka, or flat files such as CSV and Parquet. The ingestion step handles connection logic, schema validation, and initial quality checks to confirm that the incoming data meets expected formats and constraints.

### Data Preprocessing

[Preprocessing](/wiki/preprocessing) is often the most time-consuming stage. It involves cleaning and transforming raw data into a structured format suitable for analysis. Common preprocessing tasks include:

- Handling missing values through imputation or removal
- Removing duplicate records
- Encoding categorical variables (one-hot encoding, label encoding, or target encoding)
- Scaling and normalizing numerical features using techniques like min-max scaling or standardization
- Parsing dates, text fields, and other unstructured data
- Detecting and handling outliers

Preprocessing quality directly affects downstream model performance, making this stage critical to the overall pipeline.

### Feature Engineering

[Feature engineering](/wiki/feature_engineering) involves creating new input variables or modifying existing ones to improve model predictive power. Examples include generating polynomial features, computing rolling averages, extracting text embeddings, creating interaction terms between variables, and applying domain-specific transformations. Feature engineering often requires deep understanding of the problem domain and is considered one of the most impactful steps in the pipeline.

### Model Training

During [training](/wiki/training), a machine learning algorithm fits a model to the prepared data. This step involves selecting an appropriate algorithm (linear regression, random forests, neural networks, gradient boosting, and so on), configuring hyperparameters, and running the optimization process. Many pipelines integrate hyperparameter tuning methods such as grid search, random search, or Bayesian optimization at this stage.

### Model Evaluation

After training, the model is evaluated on a held-out validation or test set using metrics appropriate to the task. For classification problems, common metrics include accuracy, precision, recall, F1 score, and area under the ROC curve. For regression, metrics like mean squared error, mean absolute error, and R-squared are typical. The evaluation stage may also include fairness audits, bias detection, and error analysis to ensure the model behaves as expected across different subgroups.

### Model Deployment

Once a model passes evaluation, it is deployed to a production environment where it can serve predictions. Deployment can take several forms: a REST API endpoint, a batch prediction job, an embedded model within a mobile app, or integration into an existing software system. The deployment stage also involves packaging the model with its dependencies, setting up version control, and configuring scaling and load balancing.

### Monitoring and Maintenance

After deployment, continuous monitoring tracks model performance, data drift, and system health. Models can degrade over time as the distribution of incoming data shifts away from the training distribution, a phenomenon known as concept drift. Monitoring systems detect these changes and can trigger automated retraining pipelines to keep the model current.

## Training Pipeline vs. Inference Pipeline

Machine learning systems typically require two separate but related pipelines.

| Aspect | Training Pipeline | Inference Pipeline |
|---|---|---|
| **Purpose** | Discover patterns in historical data and produce a trained model | Apply the trained model to new data to generate predictions |
| **Execution frequency** | Runs periodically (daily, weekly) or on trigger events | Runs continuously in production, often millions of times per day |
| **Compute requirements** | Requires large datasets, significant GPU/TPU compute, and substantial processing time | Requires less compute; optimized for low latency (milliseconds to seconds) |
| **Data volume** | Processes entire training datasets (gigabytes to terabytes) | Processes individual requests or small batches |
| **Output** | A serialized model artifact (weights, parameters) | Predictions, scores, or classifications |
| **Learning** | The model updates its parameters through backpropagation or other optimization | The model is frozen; no parameter updates occur during [inference](/wiki/inference) |
| **Infrastructure** | Often runs on cloud GPU clusters or dedicated training servers | May run on edge devices, CPUs, or lightweight inference servers |

A well-designed system keeps these two pipelines decoupled but connected through a shared model registry and feature store, ensuring consistency between the features used during training and those used during inference.

## Scikit-learn Pipeline

The [scikit-learn](/wiki/scikit-learn) library provides a `Pipeline` class that is one of the most widely used implementations for building ML pipelines in Python. It chains multiple processing steps (transformers) with a final estimator (classifier or regressor) into a single object that can be fitted and used for predictions.

### How It Works

A scikit-learn Pipeline is constructed from a list of `(name, estimator)` tuples. All steps except the last must implement a `transform()` method (making them transformers). The last step can be any estimator. When `fit()` is called on the pipeline, each transformer in sequence calls `fit_transform()` on the data, and the final estimator calls `fit()`. When `predict()` is called, each transformer calls `transform()`, and the final estimator calls `predict()`.

```python
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC

# Explicit construction
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('reduce_dim', PCA(n_components=10)),
    ('clf', SVC())
])

# Shorthand with auto-generated names
pipe = make_pipeline(StandardScaler(), PCA(n_components=10), SVC())

pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
```

### Preventing Data Leakage

One of the most important benefits of using a scikit-learn Pipeline is its ability to prevent data leakage. Data leakage occurs when information from the test set (or future data) inadvertently influences the training process, leading to overly optimistic performance estimates that do not generalize to real-world data.

Without a pipeline, a common mistake is to fit a scaler or other transformer on the entire dataset before splitting into training and test sets. This allows statistics from the test set (such as the mean and standard deviation) to "leak" into the training process.

```python
# BAD: data leakage - scaler learns from test data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)   # fits on ALL data including test
X_train, X_test = train_test_split(X_scaled)

# GOOD: pipeline ensures scaler only fits on training data
pipe = make_pipeline(StandardScaler(), SVC())
pipe.fit(X_train, y_train)           # scaler fits ONLY on X_train
score = pipe.score(X_test, y_test)   # scaler transforms X_test only
```

When used with cross-validation functions like `cross_val_score`, the pipeline ensures that transformers are fit exclusively on each training fold, and the validation fold is only transformed (never fitted), preserving a clean separation between training and evaluation data.

### ColumnTransformer and FeatureUnion

Scikit-learn extends the basic Pipeline with two additional composition tools:

- **ColumnTransformer** applies different transformations to different columns of the input data. For example, numerical columns might be scaled while categorical columns are one-hot encoded.
- **FeatureUnion** runs multiple transformers in parallel and concatenates their outputs side by side, enabling the combination of different feature extraction strategies.

```python
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), ['age', 'income']),
    ('cat', OneHotEncoder(), ['city', 'gender'])
])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('clf', LogisticRegression())
])
```

## Pipeline Orchestration and DAGs

At the infrastructure level, ML pipelines are often represented as Directed Acyclic Graphs (DAGs). In a DAG, each node represents a processing step and edges define the dependencies between steps. The "acyclic" constraint means there are no circular dependencies, so the pipeline has a clear execution order. DAG-based orchestration allows the system to determine which steps can run in parallel and which must wait for upstream steps to complete.

Pipeline orchestrators are tools that manage the scheduling, execution, monitoring, and error handling of these DAG-based workflows. They handle concerns like retrying failed steps, managing compute resources, passing data between steps, and sending alerts when something goes wrong.

## Major Pipeline Frameworks and Tools

Several frameworks have emerged to support building, orchestrating, and managing ML pipelines at scale.

### TensorFlow Extended (TFX)

TFX is Google's open-source, end-to-end platform for deploying production ML pipelines. It is designed specifically for [TensorFlow](/wiki/tensorflow) models and provides a set of standardized components:

| TFX Component | Purpose |
|---|---|
| ExampleGen | Ingests and splits data into training and evaluation sets |
| StatisticsGen | Computes descriptive statistics over the dataset |
| SchemaGen | Infers the data schema (types, ranges, domains) |
| ExampleValidator | Detects anomalies and drift in incoming data |
| Transform | Performs feature engineering using TensorFlow Transform |
| Trainer | Trains TensorFlow/Keras models |
| Tuner | Runs hyperparameter tuning |
| Evaluator | Validates model quality against baseline metrics |
| Pusher | Deploys validated models to a serving infrastructure |

TFX pipelines can be orchestrated using Apache Airflow, Apache Beam, or Kubeflow Pipelines. The platform excels in scenarios requiring enterprise-grade reliability and can handle real-time serving at scale.

### Kubeflow Pipelines

Kubeflow Pipelines is a Kubernetes-native platform for building and deploying portable ML workflows. Each pipeline step runs in its own container, providing strong isolation and reproducibility. The system uses Argo Workflows under the hood to orchestrate Kubernetes Pods that carry out each step.

Key features include:

- A Python SDK for defining pipelines as code
- A web UI for visualizing pipeline runs, comparing experiments, and inspecting outputs
- Built-in support for distributed [training](/wiki/training), hyperparameter tuning (via Katib), and model serving (via KServe)
- Integration with feature stores (Feast) and model registries

Kubeflow is well suited for organizations that already run Kubernetes and need fine-grained control over resource allocation and scaling.

### MLflow

MLflow is a framework-agnostic, open-source platform that manages the entire ML lifecycle. Unlike TFX and Kubeflow, which focus heavily on pipeline orchestration, MLflow provides a broader set of capabilities:

- **Tracking**: Log parameters, metrics, and artifacts for every experiment run
- **Projects**: Package ML code in a reproducible format
- **Models**: Serve models across multiple deployment targets
- **Model Registry**: Version, stage, and manage models through their lifecycle

MLflow supports Python, R, and Java, and integrates with virtually any ML framework. Many teams combine MLflow with a dedicated orchestrator (like Airflow or Kubeflow) to get the best of both worlds: MLflow handles experiment tracking and model management, while the orchestrator handles scheduling and execution.

### Apache Airflow

Apache Airflow is a general-purpose workflow orchestration platform widely adopted for ML pipelines. Pipelines are defined as Python DAGs, and Airflow provides a scheduler, a web UI for monitoring, and a plugin system for integrating with external services.

A typical ML pipeline in Airflow might define tasks for data extraction, preprocessing, model training, evaluation, and deployment, with dependencies ensuring they run in the correct order. Airflow's distributed architecture consists of a scheduler that orchestrates workflow execution, executors responsible for running task instances on workers, and a metadata database that stores state and history.

Airflow is tool-agnostic and can orchestrate actions across any service with an API, making it a flexible backbone for MLOps workflows. However, Airflow was originally designed for data engineering workflows, so it lacks some ML-specific features (like built-in experiment tracking) that purpose-built ML platforms provide.

### Comparison of Pipeline Frameworks

| Feature | scikit-learn Pipeline | TFX | Kubeflow Pipelines | MLflow | Apache Airflow |
|---|---|---|---|---|---|
| **Primary use case** | Single-machine ML workflows | TensorFlow production pipelines | Kubernetes-native ML | Experiment tracking and lifecycle | General workflow orchestration |
| **Framework support** | Scikit-learn estimators | TensorFlow/Keras | Any (containerized) | Any | Any |
| **Orchestration** | In-process (Python) | Airflow, Beam, Kubeflow | Argo Workflows on K8s | External (pluggable) | Built-in scheduler |
| **Scalability** | Single machine | Distributed (Beam) | Kubernetes clusters | Varies by deployment | Distributed workers |
| **Data leakage prevention** | Built-in (fit/transform) | Manual | Manual | Manual | Manual |
| **Model registry** | No | Limited | Via integration | Built-in | No |
| **Learning curve** | Low | High | High | Medium | Medium |

## Feature Stores

A feature store is a centralized repository that manages [feature engineering](/wiki/feature_engineering) outputs for ML models. It serves as a single source of truth for feature definitions, ensuring that the same features are used consistently during both training and [inference](/wiki/inference). Without a feature store, teams often end up with duplicated feature computation code and inconsistencies between training and serving environments, a problem known as training-serving skew.

A feature store typically has two storage layers:

- **Offline store**: Holds historical feature data for batch training jobs, usually backed by a data warehouse or data lake
- **Online store**: Serves precomputed features at low latency for real-time inference, often backed by a key-value store like Redis or DynamoDB

Popular feature store implementations include Feast (open source), Tecton, Hopsworks, and built-in feature stores from cloud providers such as Amazon SageMaker Feature Store, Google Vertex AI Feature Store, and Databricks Feature Store.

## Model Registry

A model registry is a centralized system for storing, versioning, and managing trained ML models throughout their lifecycle. It acts as the bridge between training pipelines and deployment pipelines, providing a structured way to track which models are in production, which are in staging, and which are experimental.

Key capabilities of a model registry include:

- **Version control**: Automatically tracks each model version along with its training parameters, metrics, and data lineage
- **Stage management**: Models can be assigned stages (e.g., "staging," "production," "archived") to manage promotion workflows
- **Traceability**: Each registered model version links back to the specific training run, dataset, and code that produced it, enabling full reproducibility
- **Governance**: Role-based access controls and approval workflows ensure that only validated models reach production

MLflow Model Registry, Amazon SageMaker Model Registry, Google Vertex AI Model Registry, and Azure ML Model Registry are widely used implementations.

## CI/CD for Machine Learning

Continuous Integration and Continuous Deployment (CI/CD) practices, originally developed for traditional software, have been adapted for ML pipelines. However, ML CI/CD differs in important ways from conventional software CI/CD because in ML, the final product depends on both code and data. A change in either one can alter model behavior.

Google has defined three levels of MLOps maturity that describe how organizations adopt pipeline automation:

| Maturity Level | Description | Characteristics |
|---|---|---|
| **Level 0: Manual** | All steps performed manually | Data scientists hand off models to engineers; infrequent deployments; no monitoring |
| **Level 1: Pipeline Automation** | Training pipeline is automated | Continuous training with fresh data; feature stores and metadata management; automated data and model validation |
| **Level 2: CI/CD Automation** | Full automation of build, test, and deployment | Rapid experimentation; automated testing of data, features, and models; robust deployment strategies |

### Key CI/CD Components for ML

- **Data validation**: Automated checks for schema changes, missing values, and data drift
- **Model validation**: Performance benchmarking against baseline models before promotion
- **Deployment strategies**: Blue-green deployment (two identical environments with traffic switching), canary releases (routing a small percentage of traffic to the new model), and shadow deployment (running the new model alongside the old one without affecting users)
- **Automated retraining**: Pipelines that trigger when performance drops below a threshold or new data arrives

## Pipeline Monitoring

Monitoring a deployed ML pipeline involves tracking several dimensions beyond traditional software monitoring:

- **Data quality**: Monitoring input data distributions for drift, missing values, and schema violations
- **Model performance**: Tracking prediction accuracy, latency, and throughput over time
- **Feature drift**: Detecting when the statistical properties of input features change relative to the training distribution
- **Concept drift**: Identifying when the relationship between input features and the target variable shifts
- **System health**: Monitoring CPU, memory, GPU utilization, and pipeline execution times

Tools such as Evidently AI, WhyLabs, Arize AI, and Amazon SageMaker Model Monitor provide specialized capabilities for ML monitoring.

## Best Practices for ML Pipelines

1. **Modularize pipeline steps.** Each step should have a single responsibility, clearly defined inputs and outputs, and be independently testable.
2. **Version everything.** Track versions of data, code, models, and pipeline configurations. Tools like DVC (Data Version Control) complement Git for data and model versioning.
3. **Automate testing.** Include data validation tests, unit tests for transformation logic, integration tests for the full pipeline, and model performance tests.
4. **Use containers.** Package each pipeline step in a Docker container to ensure consistent execution across development, staging, and production environments.
5. **Implement idempotency.** Pipeline steps should produce the same output when run multiple times with the same input, making reruns and debugging safer.
6. **Log metadata extensively.** Record parameters, metrics, data lineage, and execution times for every pipeline run to support debugging and compliance.
7. **Separate concerns.** Keep training pipelines, inference pipelines, and monitoring systems as independent components connected through well-defined interfaces.
8. **Plan for failure.** Design pipelines with retry logic, graceful error handling, and alerting so that transient failures do not require manual intervention.

## Explain Like I'm 5 (ELI5)

Imagine you are making cookies. A machine learning pipeline is like the recipe and all the steps you follow to make the cookies. First, you gather and prepare the ingredients, which is like collecting and cleaning data. Then you mix them together in just the right way, which is like feature engineering. Next, you put the cookies in the oven and bake them, which is like training the model. You taste one cookie to check if they are good, which is like evaluation. Finally, you put all the cookies on a plate for your friends, which is like deployment. If you write down every step and follow it the same way each time, your cookies will always come out right. That is what a pipeline does for machine learning: it writes down all the steps so the computer follows them in the same order every time.

## References

1. Scikit-learn documentation. "Pipelines and composite estimators." [https://scikit-learn.org/stable/modules/compose.html](https://scikit-learn.org/stable/modules/compose.html)
2. Google Cloud Architecture Center. "MLOps: Continuous delivery and automation pipelines in machine learning." [https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)
3. Valohai. "What is a Machine Learning Pipeline?" [https://valohai.com/machine-learning-pipeline/](https://valohai.com/machine-learning-pipeline/)
4. Kubeflow documentation. "Pipelines Overview." [https://www.kubeflow.org/docs/components/pipelines/overview/](https://www.kubeflow.org/docs/components/pipelines/overview/)
5. MLflow documentation. "MLflow Model Registry." [https://mlflow.org/docs/latest/ml/model-registry/](https://mlflow.org/docs/latest/ml/model-registry/)
6. TensorFlow. "TFX: TensorFlow Extended." [https://www.tensorflow.org/tfx](https://www.tensorflow.org/tfx)
7. Apache Airflow documentation. [https://airflow.apache.org/](https://airflow.apache.org/)
8. Snowflake. "Feature Store for Machine Learning." [https://www.snowflake.com/en/fundamentals/feature-store/](https://www.snowflake.com/en/fundamentals/feature-store/)
9. DataCamp. "A Beginner's Guide to CI/CD for Machine Learning." [https://www.datacamp.com/tutorial/ci-cd-for-machine-learning](https://www.datacamp.com/tutorial/ci-cd-for-machine-learning)
10. Scikit-learn documentation. "Common pitfalls and recommended practices." [https://scikit-learn.org/stable/common_pitfalls.html](https://scikit-learn.org/stable/common_pitfalls.html)
