# MLOps

> Source: https://aiwiki.ai/wiki/mlops
> Updated: 2026-06-21
> Categories: MLOps
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

MLOps (Machine Learning Operations) is a set of practices, principles, and tools for deploying, monitoring, and maintaining [machine learning](/wiki/machine_learning) models in production reliably and efficiently. The discipline draws heavily from [DevOps](/wiki/devops) but extends those ideas to the unique requirements of ML systems, where data, model weights, hyperparameters, and shifting real-world conditions are first-class artifacts alongside code. MLOps exists because the model is the smallest part of a production ML system: the influential 2015 Google paper that named the problem observed that "only a small fraction of real-world ML systems are composed of ML code," with the rest being configuration, data collection, feature extraction, serving infrastructure, and monitoring [4]. MLOps has become essential as organizations move from experimental notebooks to systems that must operate at scale, remain accurate over time, and meet business and regulatory requirements for reliability and governance [1].

The field crystallized around 2018 to 2019, when teams that had built early production ML infrastructure (Google with TFX, Uber with Michelangelo, Facebook with FBLearner Flow, Netflix with Metaflow) began publishing the operational lessons they had learned. The launch of MLflow by Databricks in June 2018 gave the open-source community a common vocabulary, and the term "MLOps" itself entered widespread use shortly after. By the mid-2020s, MLOps had branched into specialized subdisciplines such as [LLMOps](/wiki/llmops) for generative models, while regulatory frameworks like the EU AI Act made formal lifecycle governance a legal requirement for many systems.

## what is MLOps and what does it cover?

At its core, MLOps addresses a fundamental problem: machine learning models that perform well on a held-out test set during research often fail or quietly degrade once deployed. Unlike traditional software, an ML system depends not only on its source code but also on the data it was trained on, the model weights produced by training, the hyperparameters used, and the statistical properties of the environment it operates in. When any of these change, the system can break in ways that are hard to detect because the code is still running and returning numerical answers.

MLOps formalizes the practices needed to manage that complexity. It covers the entire lifecycle of an ML model, from initial problem framing and data collection through deployment, monitoring, retraining, and eventual retirement. The goal is to make ML systems as reliable, reproducible, and maintainable as conventional software systems while accounting for the data and model dependencies that make ML different.

Google Cloud defines MLOps as a practice that "aims to unify ML system development and ML system deployment in order to standardize and streamline the continuous delivery of high-performing models in production" [2]. AWS describes it as combining "ML system development (the ML element) and ML system operations (the Ops element)" to automate the end-to-end ML lifecycle [1]. Databricks frames MLOps around three concerns: end-to-end machine learning workflow automation, reproducibility and collaboration, and continuous integration, delivery, and training of models in production [3].

The scope of MLOps typically includes data management, experiment tracking, model training and validation, packaging and deployment, monitoring and drift detection, retraining and continuous learning, governance and compliance, cost management, and the people and processes needed to coordinate all of the above.

## when did MLOps emerge?

MLOps did not appear fully formed. It emerged gradually as teams running ML in production discovered that the engineering practices that worked for ordinary software did not translate cleanly to systems whose behavior depended on training data and learned parameters.

### sculley et al. and hidden technical debt

The field's intellectual foundation is widely traced to a 2015 NeurIPS paper by D. Sculley and colleagues at Google titled "Hidden Technical Debt in Machine Learning Systems" [4]. The paper applied the software engineering concept of technical debt to ML and argued that ML systems have a special capacity for incurring debt because they have all the maintenance problems of traditional code plus an additional set of ML-specific issues. Crucially, this debt may be hard to detect because it exists at the system level rather than the code level.

The paper enumerated several recurring problems: boundary erosion (where ML systems mix concerns that traditional engineering keeps separate), entanglement (changing anything changes everything, often summarized as the "CACE" principle), hidden feedback loops (where a model's predictions influence the data it later trains on), undeclared consumers downstream of model outputs, data dependencies that are harder to track than code dependencies, configuration debt, and external-world changes that silently invalidate assumptions baked into a model. The authors illustrated their point with a now-famous diagram showing that the actual ML code in a real system is a tiny black box surrounded by a large infrastructure of configuration, data collection, feature extraction, serving infrastructure, monitoring, and analysis tools. As the paper put it, "only a small fraction of real-world ML systems are composed of ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex" [4].

The paper did not propose a solution catalog. Instead it argued that ML practitioners needed to think beyond model accuracy and consider the long-term operational cost of the systems they were building. That argument became the seed for MLOps as a discipline.

### early platform efforts

While Sculley's paper described the problem, several companies had already begun building internal ML platforms to manage it.

Facebook (now Meta) launched FBLearner Flow internally in 2014 and publicly described it in May 2016 as the "AI backbone" of the company [5]. The platform handled experiment management, training pipelines, and model deployment for engineers across the company. By 2018 it was used by more than 25 percent of Facebook's engineering team, and the prediction service was running more than six million predictions per second. FBLearner Flow eventually evolved into a generic workflow engine handling tasks well beyond ML.

Uber introduced Michelangelo in 2017 to power business-critical ML use cases such as ride ETAs, Eats delivery time predictions, fraud detection, and ranking [6]. Michelangelo combined open-source components such as HDFS, Spark, Cassandra, MLLib, XGBoost, and TensorFlow with internal infrastructure for feature management, training, and serving. A key innovation was its early feature store, which allowed batch-computed features to be reused at serving time and effectively created the design pattern that the broader industry later adopted. Several members of the Michelangelo team went on to found Tecton, the commercial feature platform.

Google had been developing TFX (TensorFlow Extended) since around 2017 as a portable, end-to-end production ML platform built on TensorFlow [7]. TFX exposed reusable components for data validation, feature transformation, training, model analysis, and serving, all coordinated through ML metadata that recorded artifact lineage. The publication of the TFX paper at KDD 2017 was an important moment because it described how Google actually ran ML in production, not just how it trained models.

Netflix took a different path. The company developed Metaflow internally and open-sourced it in 2019, focusing on the data scientist's experience rather than on Kubernetes-native infrastructure. Netflix built Metaflow on top of AWS Batch, EC2, and S3 and integrated it with Titus (Netflix's container platform) and Maestro (its workflow orchestrator) to support hundreds of personalization models in production [8].

### mlflow and the open-source convergence

The public turning point for MLOps as a field arrived in June 2018 when Databricks announced MLflow at the Spark + AI Summit [9]. The first alpha included three components: MLflow Tracking for logging parameters, metrics, and artifacts; MLflow Projects for packaging reusable training code; and MLflow Models for a standard model format. MLflow was deliberately framework-agnostic, designed to work with any ML library and exposed through REST APIs and simple file formats.

MLflow's adoption was unusually rapid. Databricks reported that the project grew its contributor count in nine months to a level that Apache Spark had taken three years to reach. Within a couple of years MLflow was the de facto open-source standard for experiment tracking and model registry, and many other tools built on it or interoperated with it. The term "MLOps" itself, modeled on "DevOps," began appearing in conference talks and job postings around the same time.

From 2019 onward the ecosystem expanded rapidly: Kubeflow released its first 1.0 version in March 2020, Weights & Biases became a popular commercial alternative for experiment tracking, Feast emerged as the open-source feature store, and Evidently and Arize built monitoring and observability tools focused on ML-specific failure modes.

## how does MLOps differ from DevOps?

MLOps borrows the philosophical core of DevOps: automation, continuous integration and delivery, monitoring, and tight collaboration between development and operations. However, ML systems introduce dimensions that pure DevOps does not address.

| Aspect | DevOps | MLOps |
|--------|--------|-------|
| Primary artifacts | Source code | Code, data, model weights, prompts, configurations |
| Primary inputs | Code commits | Code commits, new data, drift signals, schedule |
| Testing | Unit tests, integration tests, end-to-end tests | All of those plus data validation, model validation, fairness audits, drift checks |
| Versioning | Git for code | Code in Git, data in DVC or LakeFS, models in a registry, prompts in a prompt store |
| CI trigger | Pull request | Pull request, data update, performance regression |
| CD target | A binary or container image | A container plus a model artifact plus configuration plus possibly a feature pipeline |
| Reproducibility | Deterministic builds from source | Probabilistic; needs data snapshots, random seeds, hardware notes, library pins |
| Rollback | Deploy previous code version | Deploy previous model and possibly previous data and feature pipeline together |
| Failure mode | Crashes, exceptions, latency spikes | All of those plus silent accuracy decay |
| Monitoring | Latency, throughput, error rate | All of those plus accuracy, calibration, drift, fairness, prediction distribution |

The critical difference is that a traditional application produces the same output for the same input given the same code. An ML system's behavior depends on the training data and the statistical relationship between features and target, both of which can change without anyone changing a line of code. This is why MLOps requires monitoring concepts (data drift, concept drift, prediction drift) that have no direct equivalent in DevOps, and why CI pipelines often need to validate data and trained models alongside code.

Culturally, MLOps inherits the DevOps shift away from siloed teams. In a mature MLOps organization, data scientists, ML engineers, platform engineers, and product owners share responsibility for what runs in production rather than throwing artifacts over the wall.

## what are the stages of the ML lifecycle?

MLOps spans the full machine learning lifecycle. The exact stage names vary across vendors, but the phases below appear in nearly every MLOps reference architecture.

### problem framing

Before any data is collected, the team has to decide what business problem the ML system is solving, what success looks like in measurable terms, and whether ML is the right tool at all. This stage produces a problem statement, a success metric, an offline evaluation plan, and a rough idea of the data that will be required. Skipping this step is a common cause of projects that ship technically successful models nobody uses.

### data collection and labeling

Data is the substrate of any ML system. This stage covers identifying data sources, building ingestion pipelines, cleaning and deduplicating records, and labeling examples when supervised learning is required. Labeling can be done in-house, through crowdsourcing platforms such as Scale or Surge, or through programmatic labeling tools such as Snorkel.

Key practices include:

- Data versioning so any model can be reproduced with its exact training data
- Data validation that catches schema changes, missing values, outliers, and distribution shifts before they reach training
- Data lineage that records where each record came from and what transformations were applied
- Privacy and consent reviews, especially for personal data covered by GDPR or similar regimes

### feature engineering

Raw data is rarely fed directly into a model. Feature engineering transforms raw inputs into the numerical or categorical features the model consumes. The same transformations must be applied during training and during serving, otherwise the model encounters distributions at inference time that it never saw at training time. This problem is called training-serving skew and is a common cause of degraded production performance [10].

Feature stores were invented to address this problem. They provide a single definition of each feature and serve those features consistently to both batch training jobs and real-time inference services.

### training

Training selects a model architecture, picks hyperparameters, and runs an optimization process over labeled data to produce a set of model weights. MLOps practices for training include:

- Experiment tracking that records every run with its hyperparameters, code version, data version, metrics, and outputs
- Reproducibility, which requires fixing random seeds, pinning library versions, and documenting hardware
- Automated training pipelines that can be triggered by code changes, new data, or a drift signal rather than by a human running a notebook
- Resource management for GPUs and TPUs, including spot instance handling and elastic scaling
- [Distributed training](/wiki/distributed_training) for models too large to train on a single machine

### validation and evaluation

A trained model is not necessarily a production-ready model. Validation evaluates it against a held-out test set, against the current production model on a shared benchmark, against fairness and safety criteria, and against business metrics. Common validation checks include:

- Performance across data slices and demographic groups, not just overall averages
- Comparison against the current production baseline on the same evaluation data
- Stress tests with adversarial or edge-case inputs
- Calibration of predicted probabilities
- Compliance with documented requirements (latency budget, fairness thresholds, regulatory rules)

### deployment

Deployment packages the validated model with its dependencies and exposes it for use by downstream consumers. Common deployment patterns include:

| Pattern | Description | Typical use case |
|---------|-------------|-------|
| [Online inference](/wiki/online_inference) | Model responds to individual requests via an API at low latency | Recommendations, fraud scoring, ad ranking, chat |
| [Offline inference](/wiki/offline_inference) | Model processes large batches on a schedule | Credit scoring, churn prediction, lead scoring |
| Streaming inference | Model processes events from a queue with sub-second latency | Real-time anomaly detection |
| Edge deployment | Model runs on a device with intermittent or no connectivity | Mobile keyboards, on-device speech, IoT sensors |
| Shadow deployment | New model runs alongside the production model but does not serve responses | Validation under live traffic before cutover |
| Canary deployment | A small percentage of traffic is routed to the new model | Risk-bounded rollout |
| Blue-green deployment | Two identical environments swap roles after validation | Fast rollback at the cost of double infrastructure |
| A/B test | Traffic is split between two or more model versions | Statistical comparison of business metrics |

Beyond the rollout pattern, deployment also has to handle packaging (typically Docker), serving framework (Triton, KServe, BentoML, vLLM, or a custom server), hardware selection, autoscaling policies, and version pinning.

### monitoring and observability

Once a model is in production, it must be continuously monitored. Unlike traditional software where bugs are usually introduced by code changes, ML models can degrade silently because the world they operate in changes. Monitoring covers system health (latency, throughput, error rate, GPU and CPU utilization), model behavior (prediction distribution, confidence calibration, output distribution shift), input data (feature drift, missing or out-of-range values), and business outcomes (downstream conversion, revenue, user satisfaction).

### retraining and continuous learning

When monitoring detects performance decay or when fresh labeled data accumulates, the model must be retrained. Mature MLOps automates this loop. Triggers can be schedule-based (retrain every week), threshold-based (retrain when accuracy drops below X), or event-based (retrain when a drift detector fires). The retrained model goes through the same validation and deployment pipeline as the original.

### retirement

Models eventually outlive their usefulness, either because the underlying business has changed, a better model has replaced them, or they cannot be safely maintained. MLOps includes a process for cleanly removing a model from production, archiving its weights and documentation, and rerouting any consumers.

## what are the MLOps maturity levels?

Google has proposed a widely cited maturity model for MLOps that defines three levels of automation [2]. Microsoft and other vendors have published variants, but the Google model is the most often referenced.

| Level | Name | What is automated | What is still manual |
|-------|------|-------------------|----------------------|
| 0 | Manual process | Almost nothing. Data scientists train models in notebooks and hand off model files to engineers, who deploy them. | Data pulls, training, validation, deployment, monitoring. Retraining is rare and often forgotten. |
| 1 | ML pipeline automation | Training is wrapped in a reusable pipeline. Continuous training (CT) is enabled, so the model can be retrained automatically when new data arrives or a trigger fires. Feature stores and metadata stores are common. | Code changes still require manual deployment of the pipeline itself. CI/CD for the pipeline is not yet automated. |
| 2 | CI/CD pipeline automation | The training pipeline itself is built, tested, and deployed automatically. Code changes flow through a [CI/CD](/wiki/cicd) system that builds, tests, and pushes new pipeline versions. Models, data, and code are all versioned, validated, and deployed automatically. | Strategy decisions, problem framing, governance review. |

Most organizations begin at Level 0 and progress incrementally. Reaching Level 2 requires substantial investment in tooling, infrastructure, and organizational practices. In practice, many successful teams operate at Level 1 for the bulk of their models and reserve full Level 2 automation for their most critical or highest-volume systems.

Microsoft's Azure-flavored maturity model adds intermediate levels and is sometimes referenced when teams want a more granular self-assessment. Both models share the same underlying message: maturity is about automation, repeatability, and the absence of manual handoffs, not about model sophistication.

## what are the key components of an MLOps stack?

A production MLOps stack is rarely a single product. It is an assembly of components, each addressing a specific concern in the lifecycle.

### data versioning

Data versioning tools track changes to datasets the way Git tracks changes to source code. They allow a model to be rebuilt against the exact data it was trained on. Common tools include DVC (Data Version Control), which layers a Git-like interface over object storage; LakeFS, which provides Git-like operations on top of S3 or Azure Blob Storage; and Pachyderm, which links data versioning with pipeline orchestration on Kubernetes.

### experiment tracking

Experiment tracking systems record every training run with its hyperparameters, training data version, code version, environment details, evaluation metrics, and produced artifacts. By keeping a complete history, they enable comparison of approaches, reproduction of past results, and audit trails for governance. The dominant tools are [MLflow](/wiki/mlflow) Tracking, Weights & Biases, Neptune, and Comet. MLflow is the most widely deployed of these: its maintainers report more than 60 million monthly downloads as of 2025, making it the de facto open-source standard for experiment tracking and model registry [21].

### feature store

A [feature store](/wiki/feature_store) is a centralized service for defining, computing, storing, and serving ML features [10]. It addresses two problems at once: training-serving skew (because the same definition is used for both batch backfills and real-time serving) and feature reuse (because teams can discover and reuse features rather than recomputing them). The category took shape after Uber's Michelangelo Palette demonstrated the pattern. Today the leading options are Feast (open source, modular), Tecton (commercial, opinionated end-to-end), Hopsworks (open source, includes its own RonDB online store), Vertex AI Feature Store (Google), and SageMaker Feature Store (AWS).

### model registry

A model registry is a versioned repository for trained models with their metadata: training data versions, hyperparameters, evaluation metrics, lineage, and deployment status. It is the source of truth for what models exist, what their performance characteristics are, and which one is currently serving traffic. Registries typically support tagging (staging, production, archived), approval workflows, and rollback. MLflow Model Registry, Vertex AI Model Registry, and SageMaker Model Registry are the most widely used implementations.

### pipeline orchestration

Pipeline orchestrators define and execute the directed graph of steps in an ML workflow: extract data, validate it, transform features, train, evaluate, and deploy. Each tool occupies a slightly different niche.

| Orchestrator | Origin | Typical strength | Notes |
|--------------|--------|------------------|-------|
| Apache Airflow | Airbnb, 2014 | Mature ecosystem, hundreds of operators | Originally a general data pipeline tool; widely used for ML by extension |
| [Kubeflow](/wiki/kubeflow) Pipelines | Google, 2018 | Native Kubernetes execution, TFX integration | Powerful but operationally heavy; demands Kubernetes expertise |
| Metaflow | Netflix, 2019 | Data scientist ergonomics, AWS-first | Notebook-friendly DSL; less flexible outside AWS |
| Flyte | Lyft, 2020 | Strongly typed, Kubernetes-native, scalable | Reproducibility guarantees and good big-job support |
| Prefect | 2018 | Python-native decorators, hybrid execution | Good for data engineers transitioning to ML |
| Dagster | 2019 | Asset-centric model, software-defined assets | Strong observability and testing built in |
| ZenML | 2021 | Framework-agnostic abstraction layer | Plugs into other orchestrators as backends |

### model serving

Serving infrastructure runs the model in production, handling autoscaling, batching, and the runtime details of inference. The right choice depends on the model type and the latency budget.

| Server | Best for | Notable features |
|--------|----------|------------------|
| NVIDIA Triton | Heterogeneous models on GPUs | Concurrent model execution, dynamic batching, multi-framework |
| KServe | Kubernetes-native serving | Operates at the orchestration layer, supports many predictors |
| Seldon Core | Kubernetes-native serving with A/B testing | Canary deployments, multi-armed bandits |
| BentoML | Python-first model packaging | Builds OpenAI-compatible APIs, supports many backends |
| TensorFlow Serving | TensorFlow models | Mature, REST and gRPC, optimized for TF graphs |
| TorchServe | PyTorch models | Native PyTorch packaging |
| vLLM | LLM inference | PagedAttention, continuous batching, best-in-class TTFT for LLMs |
| Ray Serve | Composable Python services | Good for ensembles and serving DAGs |

For [large language models](/wiki/large_language_model) the calculus is different. Benchmarks reported by BentoML in 2024 found that vLLM achieved best-in-class time-to-first-token across concurrency levels, while TensorRT-LLM and LMDeploy delivered the highest token generation rates [11]. Triton can host vLLM as a backend, which is a common production pattern for organizations that want vLLM's performance with Triton's multi-model orchestration.

### monitoring and observability

ML monitoring tools track data drift, concept drift, prediction quality, latency, and cost. The category includes Evidently (open source, statistical drift reports), WhyLabs (managed, with WhyLogs as the open-source profiling library), Arize and Arize Phoenix (model and LLM observability), Fiddler AI (model performance management with explainability), and Datadog Model Monitoring (integrated with Datadog APM). Most of these tools have added LLM-specific features since 2023.

## continuous training

Continuous training (CT) is the ML-specific extension of continuous delivery. In a CI/CD pipeline for code, a commit triggers a build, a test run, and a deployment. In a CI/CD/CT pipeline for ML, the same flow applies, but additional triggers can also start a retraining run: new labeled data has accumulated, a drift detector has fired, scheduled retraining is due, or a downstream metric has dropped below a threshold.

A CT pipeline typically includes:

1. A trigger (schedule, drift signal, performance regression, or new data event)
2. Data extraction and validation, including schema and distribution checks
3. Feature transformation, often via a feature store
4. Model training with the chosen architecture and hyperparameters
5. Model evaluation on a held-out set and against the current production baseline
6. A validation gate (the new model must beat the baseline by at least X on metric Y)
7. Registration of the new model in the model registry with full metadata
8. Deployment, typically as a canary or shadow before full rollout
9. Post-deployment monitoring and automatic rollback if metrics regress

The gating step is what distinguishes CT from naive automation. A model that fails validation should not be deployed even if it was produced automatically. This is why a model registry with explicit promotion stages is a near-universal feature of mature MLOps stacks.

## ci/cd for machine learning

CI/CD for ML extends traditional CI/CD with steps that have no equivalent in pure software delivery.

- Code CI runs the standard checks (lint, unit tests, integration tests) and adds tests that exercise the training pipeline on a small sample dataset to catch errors before a full training run.
- Data CI validates that incoming data matches the expected schema, that distributions have not shifted beyond a tolerance, and that no new categorical values have appeared.
- Model CI trains a new model, evaluates it, and registers it in the model registry. This stage often requires significant compute, so it is usually triggered by changes to training code or by scheduled CT rather than on every commit.
- Model CD packages the validated model into a serving image and rolls it out, typically through a canary or blue-green pattern, with automatic rollback if monitoring detects a regression.

A common practice is to keep these stages in separate pipelines that share a model registry and feature store as connecting interfaces, so that data, model, and code teams can work asynchronously without blocking each other.

## monitoring in detail

Production monitoring is where MLOps does most of its day-to-day work, and it is also where the ML-specific failure modes show up.

### data drift

[Data drift](/wiki/concept_drift), also called covariate shift, occurs when the distribution of input features changes between training and serving. A model trained on customer behavior from 2023 may encounter very different behavior in 2026 as user preferences, payment methods, or platform mix evolve. Drift can be gradual (seasonality, slow demographic change) or sudden (a global event, a competitor launch, a feature pipeline bug).

Detection requires statistical tests on incoming feature distributions compared to a reference distribution (typically the training set). Common tests include the Kolmogorov-Smirnov (KS) test for continuous features, the chi-square test for categorical features, the Population Stability Index (PSI) used heavily in credit risk, and the Jensen-Shannon and Kullback-Leibler divergences. Each test has tradeoffs. The KS test is sensitive but can fire too often on large datasets where small changes are statistically significant but practically irrelevant. PSI gives a single interpretable number with widely used thresholds (0.1 for minor change, 0.25 for major change). KL divergence is asymmetric and requires care when comparing distributions [12].

### [concept drift](/wiki/concept_drift)

Concept drift refers to a change in the relationship between input features and the target variable, even when the input distribution looks stable. A churn model can decay because the same demographic and behavioral signals now predict churn at different rates than they used to. Concept drift is harder to detect than data drift because it requires labeled outcome data, and labels are usually delayed.

### prediction drift

Prediction drift watches the distribution of model outputs. A sudden shift in the fraction of positive predictions, the average predicted score, or the entropy of class probabilities can indicate either upstream data drift or concept drift, even before labels arrive. Prediction drift is often the first signal an MLOps team has that something has changed.

### system metrics

In addition to ML-specific monitoring, the team must watch the same system metrics any production service has: request rate, latency at p50 and p99, error rate, queue depth, GPU and CPU utilization, memory, and cost. For LLM services, latency is often broken into time-to-first-token (TTFT) and inter-token latency rather than treated as a single number, because user perception of quality depends heavily on how quickly the response begins streaming.

### calibration and fairness

Calibration measures whether predicted probabilities match observed frequencies. A well-calibrated binary classifier that outputs 0.7 should be correct about 70 percent of the time on examples in that score band. Calibration can decay independently of accuracy and matters wherever the model's score is used to drive a downstream decision.

Fairness monitoring tracks model behavior across protected groups (race, gender, age, geography) using metrics such as disparate impact, equal opportunity difference, and demographic parity. Fairness regressions are particularly important under regimes such as the EU AI Act that explicitly require ongoing bias monitoring for high-risk systems.

## how does LLMOps differ from MLOps?

[LLMOps](/wiki/llmops) is the operational discipline for systems built around [large language models](/wiki/large_language_model). It is a specialization of MLOps rather than a replacement, but enough is different that it has acquired its own tooling and vocabulary.

The biggest differences are that the model itself is often a third-party API (OpenAI, Anthropic, Google), the most important artifacts are prompts and retrieval pipelines rather than weights, the cost driver is inference tokens rather than training compute, and quality cannot be reduced to a single accuracy number because outputs are open-ended text.

| Aspect | Traditional MLOps | LLMOps |
|--------|------------------|--------|
| Model | Trained or fine-tuned in-house | Often consumed via API; sometimes fine-tuned with LoRA or QLoRA |
| Versioned artifacts | Code, data, model weights | Code, prompts, retrieval indexes, evaluation datasets, sometimes weights |
| Quality measurement | Standard metrics (accuracy, F1, AUC) | LLM-as-judge, human evaluation, rubric-based scoring, red-teaming |
| Latency | Single inference time | Time-to-first-token plus inter-token latency |
| Cost driver | Training compute | Per-token inference cost; long contexts and large output windows |
| Safety concerns | Bias, fairness | Hallucinations, prompt injection, PII leakage, jailbreaks, content policy |
| Data pipeline | Feature engineering | RAG corpus curation, embedding management, chunking strategy |
| Failure mode | Numeric drift | Hallucinated facts, format breakage, refusal regressions, drift in tone |

Specialized LLMOps tooling has emerged rapidly. LangSmith (closed source, deeply integrated with LangChain) provides tracing, evaluations, and prompt versioning. Langfuse is open source and framework-agnostic, popular for self-hosting. Helicone is a proxy-based observability tool that requires almost no code changes. PromptLayer focuses on prompt management with a no-code editor. Arize Phoenix offers open-source LLM tracing with notebook-friendly visualizations. Humanloop targets prompt evaluation workflows for product teams.

Generative AI gateways are another LLMOps category that did not exist in classical MLOps. LiteLLM provides a unified Python interface across more than 100 LLM providers and integrates logging out of the box. Portkey adds routing, retries, fallbacks, guardrails, and budgets on top of an OpenAI-compatible API. These gateways play a role analogous to a service mesh in microservices: they sit between the application and the model providers and enforce cross-cutting concerns.

## ml governance and responsible ai

As ML systems have moved into hiring, lending, healthcare, and law enforcement, governance has become a first-class MLOps concern rather than an afterthought.

### model cards and datasheets

The [model card](/wiki/model_card) was introduced by Margaret Mitchell and colleagues at Google in 2018 (published 2019) as a short document accompanying a trained model that describes its intended use, performance across demographic groups, evaluation conditions, ethical considerations, and known limitations [13]. Model cards have since been adopted by Hugging Face (where every model on the Hub has a card), Google's Vertex AI, and many internal MLOps platforms.

Datasheets for datasets, proposed by Timnit Gebru and colleagues in 2018, play the analogous role for training data. A datasheet describes the dataset's motivation, composition, collection process, labeling, recommended uses, and maintenance plan, inspired by the datasheets that accompany electronic components.

Google later introduced Data Cards (a slightly different format from datasheets) and Hugging Face uses Dataset Cards on its Hub. The general pattern of structured, machine-readable model and data documentation is now a standard MLOps practice.

### nist ai risk management framework

The US National Institute of Standards and Technology released the AI Risk Management Framework (AI RMF 1.0) in January 2023 as a voluntary, sector-agnostic guide for managing AI risk across the lifecycle [14]. The framework defines four functions:

| Function | Purpose |
|----------|---------|
| GOVERN | Establish leadership, accountability, and risk-tolerance policies for AI across the organization |
| MAP | Understand the context of use, stakeholders, and potential impacts |
| MEASURE | Quantitatively and qualitatively assess AI risks and impacts |
| MANAGE | Allocate resources to mitigate risks identified through GOVERN, MAP, and MEASURE |

NIST followed in 2024 with AI 600-1, a generative-AI-specific profile, and a voluntary playbook that maps the framework to concrete actions. AI RMF has become a common reference for MLOps governance even outside the US, because it is technology-agnostic and easy to overlay on existing development processes.

### eu ai act

The EU AI Act (Regulation EU 2024/1689) entered into force on 1 August 2024 and is the world's first comprehensive AI regulation [15]. The Act carries heavy penalties that make MLOps governance a financial imperative: under Article 99, engaging in a prohibited AI practice can draw administrative fines of up to 35 million euros or 7 percent of total worldwide annual turnover, whichever is higher, while non-compliance with high-risk system obligations can reach 15 million euros or 3 percent of turnover [22]. It classifies AI systems by risk: prohibited practices (such as social scoring) are banned, high-risk systems (in areas like employment, education, credit, law enforcement, and biometrics) face strict compliance obligations, limited-risk systems carry transparency requirements, and minimal-risk systems are largely unregulated. General-purpose AI (GPAI) models have their own dedicated rules.

For high-risk systems the providers must implement a continuous risk management process across the lifecycle, use representative and high-quality training data, maintain technical documentation and event logs, ensure human oversight, achieve adequate accuracy and cybersecurity, register the system in an EU database, and conduct conformity assessments. Most of these obligations map directly onto MLOps concerns: lineage, monitoring, model cards, audit logs, and structured deployment processes.

The Act's timeline is staggered. Prohibitions and AI literacy obligations applied from 2 February 2025. Governance rules and obligations for GPAI models took effect on 2 August 2025. The full regime applies on 2 August 2026, with high-risk systems embedded in regulated products getting until 2 August 2027. As of early 2026, MLOps teams operating in or selling into the EU are actively building documentation pipelines, conformity-assessment artifacts, and quality management systems to meet these deadlines.

### data-centric ai

A related strand of governance work is [data-centric AI](/wiki/data-centric_ai), the argument popularized by Andrew Ng that improving data quality often yields more reliable production performance than tweaking model architecture. Data-centric practices (consistent labeling guidelines, error analysis on slices, targeted data collection for failure cases) have become standard parts of MLOps for high-stakes systems.

## anti-patterns

Despite a decade of public MLOps writing, certain anti-patterns recur in organizations that are early in their ML journey.

- The throw-it-over-the-wall handoff, where data scientists produce a model artifact and engineers are expected to make it work in production without context.
- Notebook in production, where Jupyter notebooks are run on a schedule as if they were production services. Notebooks are designed for interactive exploration, not for reproducible, monitored execution.
- Manual deployment, where the steps to deploy a new model exist only in someone's head or in a Confluence page. Eventually that person leaves.
- Fire-and-forget models, where a model is deployed once and never monitored or retrained. These models silently decay until a downstream metric drops far enough for someone to notice.
- Single-model monitoring, where the team monitors one production model in detail but cannot say what is happening with the dozens of other models also serving traffic.
- Untracked feature pipelines, where the same features are computed differently in training jobs and serving services, causing training-serving skew.
- Reproducibility theater, where the code is in Git but the data and library versions are not pinned, so the model cannot actually be rebuilt.
- Last-mile fragility, where everything works in staging but fails on the first production traffic spike because autoscaling and batching were never tested.

MLOps maturity is largely about systematically eliminating these patterns. The Sculley paper's central message holds up: most of the cost of ML in production is not in the model but in the surrounding infrastructure and process.

## tooling landscape

The modern MLOps ecosystem is wide and continues to consolidate around a few dominant patterns.

### open source

| Tool | Category | Notes |
|------|----------|-------|
| [MLflow](/wiki/mlflow) | Tracking, registry, deployment | Most widely adopted open-source MLOps platform; MLflow 3 (June 2025) added GenAI support |
| [Kubeflow](/wiki/kubeflow) | Orchestration, training, serving | Kubernetes-native; broad but operationally heavy |
| Apache Airflow | Workflow orchestration | General-purpose; widely used as the default DAG runner |
| Prefect | Workflow orchestration | Python-native decorators, hybrid execution |
| Dagster | Workflow orchestration | Asset-centric model with strong observability |
| Flyte | Workflow orchestration | Strongly typed, Kubernetes-native |
| Metaflow | Workflow orchestration | Netflix-origin, AWS-first, data scientist friendly |
| ZenML | Abstraction layer | Plugs into other orchestrators as backends |
| DVC | Data versioning | Git-like data tracking |
| LakeFS | Data versioning | Git-like operations on object storage |
| Pachyderm | Data versioning + pipelines | Lineage-aware processing on Kubernetes |
| Feast | Feature store | The reference open-source feature store |
| Hopsworks | Feature store + lakehouse | Includes RonDB online store |
| Great Expectations | Data validation | Tests for data quality and schema |
| BentoML | Model packaging and serving | OpenAI-compatible APIs, multiple backends |
| Seldon Core | Model serving on Kubernetes | A/B testing and canary deployment patterns |
| KServe | Model serving on Kubernetes | Standardized inference runtime |
| Ray | Distributed compute, serving | Ray Train, Ray Tune, Ray Serve cover much of the lifecycle |
| Triton Inference Server | Model serving | Multi-framework, GPU-optimized |
| vLLM | LLM serving | PagedAttention, leading TTFT for LLMs |
| Evidently | Monitoring | Open-source data and model drift reports |
| WhyLogs | Data profiling | Lightweight statistical fingerprints |
| Arize Phoenix | LLM observability | Notebook-friendly tracing |
| Langfuse | LLM observability | Open-source, framework agnostic |
| LiteLLM | LLM gateway | Unified API across providers |

### cloud platforms

| Platform | Provider | Strength |
|----------|----------|----------|
| Amazon SageMaker | [AWS](/wiki/amazon_web_services) | End-to-end platform with notebooks, training, serving, registry, and pipelines; deep integration with the AWS ecosystem |
| Vertex AI | [Google Cloud](/wiki/google_cloud_terms) | Unified ML platform with AutoML, custom training, feature store, Pipelines, and tight integration with Gemini |
| Azure Machine Learning | [Azure](/wiki/azure) | Enterprise ML with responsible AI dashboards, managed endpoints, and tight integration with the Microsoft ecosystem |
| Databricks | Databricks | Lakehouse-centric with Mosaic AI, Unity Catalog, and managed MLflow |
| Snowflake Cortex | Snowflake | Brings models close to data inside the warehouse |

### managed and commercial tools

| Tool | Category | Notes |
|------|----------|-------|
| Weights & Biases | Experiment tracking, model registry, evaluations | The dominant commercial alternative to MLflow Tracking |
| Neptune | Experiment tracking | Strong for large-scale experiment management |
| Comet | Experiment tracking | Includes LLM features as Opik |
| Tecton | Feature platform | Born from the Uber Michelangelo team |
| Modal | Compute platform | Serverless GPUs for training and inference |
| Anyscale | Managed Ray | Ray-as-a-service for distributed ML |
| Hugging Face | Model and dataset hub plus Inference Endpoints | The de facto registry for open-weight models |
| Arize | Model and LLM observability | Combines classical ML monitoring with LLM tracing |
| Fiddler AI | Model performance management | Explainability and bias monitoring |
| WhyLabs | Monitoring | Built on the open-source WhyLogs library |
| LangSmith | LLM observability | LangChain-integrated tracing and evaluations |
| Helicone | LLM observability | Proxy-based, low setup overhead |
| Portkey | LLM gateway | Routing, fallbacks, guardrails, budgets |
| Humanloop | Prompt evaluation | Product-team-friendly LLM workflows |

The consensus pattern in 2026 is composition. A typical mature stack might combine MLflow for experiment tracking and registry, Feast or Tecton for features, a Kubernetes-based orchestrator like Kubeflow or Flyte for pipelines, KServe or Triton for serving, and Evidently or Arize for monitoring. LLM workloads add LangSmith or Langfuse for tracing and a gateway like LiteLLM or Portkey for provider abstraction.

## case studies

### uber michelangelo

Michelangelo is the longest-running production MLOps platform at internet scale [6]. It has gone through three major architectural generations: an initial monolithic platform focused on classical ML, a redesign that introduced a Lego-like plug-and-play model so teams could swap in best-of-breed open-source components, and a third generation that added support for deep learning and generative AI. By 2024 the platform was running 100 percent of Uber's business-critical ML workloads, including LLM-based customer service bots. Michelangelo's early feature store was an industry milestone, and the team that built it later founded Tecton.

### netflix metaflow and metaflow hosting

Netflix's machine learning platform team built Metaflow with the explicit goal of optimizing for the data scientist's experience [8]. Metaflow's Python DSL lets a scientist describe a workflow as decorated Python steps, and Metaflow handles versioning, scheduling, and resource provisioning. Backing Metaflow are Titus (Netflix's container platform on AWS), Maestro (Netflix's open-source workflow orchestrator), Atlas (Netflix's time-series telemetry system, processing 17 billion metrics per day), and Metaflow Hosting for deploying models behind a managed inference endpoint. The system supports hundreds of personalization models in production and serves predictions to more than 230 million members.

### facebook fblearner flow

FBLearner Flow is the longest-tenured large-scale workflow platform in the industry's MLOps story [5]. By the time of public disclosure in 2016 it was running ML workflows for ranking, ads, and content understanding across Facebook. Its evolution illustrates an important pattern: a platform built for ML often becomes a general-purpose workflow runner, because the orchestration, versioning, and resource-management problems are not actually ML-specific.

### google tfx

TFX is the open-source distillation of how Google runs production ML on TensorFlow [7]. A TFX pipeline is a sequence of components (ExampleGen, StatisticsGen, SchemaGen, Transform, Trainer, Evaluator, Pusher) that share data through ML Metadata, a service that records artifact lineage and configuration. TFX runs on multiple orchestrators (Apache Beam, Apache Airflow, Kubeflow Pipelines, Vertex AI Pipelines), which means a single TFX pipeline definition can move between local development, on-premise Kubernetes, and managed cloud execution.

## the people side

MLOps is a sociotechnical practice. The technology is necessary but not sufficient.

The MLOps Engineer role emerged as a distinct title around 2020 and has become common at companies running ML at scale. The role sits between data science, DevOps, and platform engineering. According to 2025-2026 salary data, MLOps engineers in the United States earn on average around 130,000 to 161,000 dollars per year, with senior roles reaching 200,000 dollars or more [16]. Compensation has grown roughly 20 percent year over year through 2025, reflecting strong demand for people who can bridge ML and operations.

A common organizational model in larger companies is to have a dedicated ML platform team that builds internal tools (a feature store, a model registry, training infrastructure, monitoring) used by multiple product teams. The platform team's job is to make ML deployment a self-service experience for data scientists rather than a custom engineering project for each model. This is essentially platform engineering applied to ML.

Smaller organizations often combine MLOps with data engineering, DevOps, or ML engineering. The boundary between roles matters less than the practices: someone has to own the production pipeline, someone has to monitor it, and someone has to retrain when it decays.

## what is the state of MLOps in 2025 and 2026?

MLOps in 2026 is mature in the sense that the core practices and tools are well established, but the field is still moving rapidly because the underlying technology continues to evolve.

A few patterns dominate the recent landscape:

The convergence of MLOps and LLMOps is well underway. The same platforms increasingly support both classical models and LLMs, and many MLOps teams now serve both kinds of workload. MLflow's 3.0 release in 2025 added generative AI tracking. Vertex AI, SageMaker, and Azure ML have all added LLM-specific features. Tools that started as LLM observability platforms (Arize Phoenix, Langfuse) are increasingly used for classical ML too.

Modular and composable architectures have replaced the monolithic vision of "one platform for all of MLOps." Teams pick best-of-breed tools and connect them through standard interfaces, especially the model registry and feature store. The AI gateway has emerged as a new architectural component for the LLM stack.

Governance has shifted from optional to mandatory for many systems. The EU AI Act, NIST AI RMF, ISO/IEC 42001, and sector-specific regulations (FDA for medical devices, OCC for credit, FINRA for finance) all require structured documentation and ongoing oversight. MLOps teams that already practiced version control, monitoring, and audit logging are well positioned; those that did not are scrambling.

Cost optimization has become central to MLOps for LLM workloads. Token-based pricing, large context windows, and high-volume inference make the cost dimension impossible to ignore. Techniques like model routing, prompt caching, [retrieval](/wiki/retrieval_augmented_generation) to reduce context size, quantization, and serving smaller distilled models for less demanding queries are now standard practice.

Agent orchestration is the newest frontier. Production AI systems in 2026 are increasingly composed of multiple interacting components: foundation models, fine-tuned adapters, retrieval systems, tools, guardrails, routing logic, and feedback loops. Each piece has its own lifecycle and failure modes. Tracing tools that visualize an entire agent run as a tree of nested spans (Langfuse, LangSmith, Arize Phoenix, OpenTelemetry semantic conventions for AI) are becoming the operational equivalent of distributed tracing for microservices.

The enduring mantra of MLOps remains "version everything": code, data, models, prompts, configurations, environments. Teams that adopt this discipline along with automated CI/CD and production-grade monitoring consistently achieve better outcomes than those that rely on manual processes. The Sculley paper from 2015 turned out to be exactly right: the model itself is the smallest part of an ML system, and most of the work is in everything around it.

## see also

- [DevOps](/wiki/devops)
- [LLMOps](/wiki/llmops)
- [Feature store](/wiki/feature_store)
- [Model card](/wiki/model_card)
- [MLflow](/wiki/mlflow)
- [Kubeflow](/wiki/kubeflow)
- [CI/CD](/wiki/cicd)
- [Distributed training](/wiki/distributed_training)
- [Online inference](/wiki/online_inference)
- [Offline inference](/wiki/offline_inference)
- [Concept drift](/wiki/concept_drift)
- [Data-centric AI](/wiki/data-centric_ai)

## references

1. [What is MLOps? (AWS)](https://aws.amazon.com/what-is/mlops/)
2. [MLOps: Continuous delivery and automation pipelines in machine learning (Google Cloud)](https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning)
3. [What is MLOps? (Databricks)](https://www.databricks.com/glossary/mlops)
4. [Sculley et al., "Hidden Technical Debt in Machine Learning Systems," NeurIPS 2015](https://papers.nips.cc/paper/5656-hidden-technical-debt-in-machine-learning-systems)
5. [Introducing FBLearner Flow: Facebook's AI backbone (Engineering at Meta, 2016)](https://engineering.fb.com/2016/05/09/core-infra/introducing-fblearner-flow-facebook-s-ai-backbone/)
6. [Meet Michelangelo: Uber's Machine Learning Platform (Uber Blog)](https://www.uber.com/blog/michelangelo-machine-learning-platform/)
7. [TFX: ML Production Pipelines (TensorFlow)](https://www.tensorflow.org/tfx)
8. [Supporting Diverse ML Systems at Netflix (Netflix Tech Blog)](https://netflixtechblog.com/supporting-diverse-ml-systems-at-netflix-2d2e6b6d205d)
9. [Introducing MLflow: an Open Source Machine Learning Platform (Databricks, June 2018)](https://www.databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html)
10. [A Comparative Analysis of Modern Feature Stores: Feast vs. Tecton vs. Hopsworks (Uplatz)](https://uplatz.com/blog/a-comparative-analysis-of-modern-feature-stores-feast-vs-tecton-vs-hopsworks/)
11. [Benchmarking LLM Inference Backends (BentoML)](https://www.bentoml.com/blog/benchmarking-llm-inference-backends)
12. [Which test is the best? We compared 5 methods to detect data drift (Evidently AI)](https://www.evidentlyai.com/blog/data-drift-detection-large-datasets)
13. [Mitchell et al., "Model Cards for Model Reporting," FAT* 2019](https://arxiv.org/abs/1810.03993)
14. [NIST AI 100-1: Artificial Intelligence Risk Management Framework (AI RMF 1.0)](https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf)
15. [High-level summary of the EU AI Act](https://artificialintelligenceact.eu/high-level-summary/)
16. [MLOps Engineer Salary Guide (KORE1, 2026)](https://www.kore1.com/mlops-engineer-salary-guide/)
17. [What is MLOps? Benefits, Challenges & Best Practices (lakeFS)](https://lakefs.io/mlops/)
18. [vLLM vs Triton vs KServe: Model Serving on Kubernetes (Kubenatives)](https://www.kubenatives.com/p/vllm-vs-triton-vs-kserve-kubernetes)
19. [The Complete Guide to LLM Observability Platforms (Helicone)](https://www.helicone.ai/blog/the-complete-guide-to-LLM-observability-platforms)
20. [Practitioners Guide to Machine Learning Operations (MLOps) (Google Cloud)](https://cloud.google.com/resources/mlops-whitepaper)
21. [MLflow: open source AI platform (GitHub, mlflow/mlflow)](https://github.com/mlflow/mlflow)
22. [Article 99: Penalties (EU Artificial Intelligence Act)](https://artificialintelligenceact.eu/article/99/)