MLOps

MLOps (Machine Learning Operations) is a set of practices, principles, and tools for deploying, monitoring, and maintaining machine learning models in production reliably and efficiently. The discipline draws heavily from DevOps but extends those ideas to the unique requirements of ML systems, where data, model weights, hyperparameters, and shifting real-world conditions are first-class artifacts alongside code. MLOps has become essential as organizations move from experimental notebooks to systems that must operate at scale, remain accurate over time, and meet business and regulatory requirements for reliability and governance [1].

The field crystallized around 2018 to 2019, when teams that had built early production ML infrastructure (Google with TFX, Uber with Michelangelo, Facebook with FBLearner Flow, Netflix with Metaflow) began publishing the operational lessons they had learned. The launch of MLflow by Databricks in June 2018 gave the open-source community a common vocabulary, and the term "MLOps" itself entered widespread use shortly after. By the mid-2020s, MLOps had branched into specialized subdisciplines such as LLMOps for generative models, while regulatory frameworks like the EU AI Act made formal lifecycle governance a legal requirement for many systems.

definition and scope

At its core, MLOps addresses a fundamental problem: machine learning models that perform well on a held-out test set during research often fail or quietly degrade once deployed. Unlike traditional software, an ML system depends not only on its source code but also on the data it was trained on, the model weights produced by training, the hyperparameters used, and the statistical properties of the environment it operates in. When any of these change, the system can break in ways that are hard to detect because the code is still running and returning numerical answers.

MLOps formalizes the practices needed to manage that complexity. It covers the entire lifecycle of an ML model, from initial problem framing and data collection through deployment, monitoring, retraining, and eventual retirement. The goal is to make ML systems as reliable, reproducible, and maintainable as conventional software systems while accounting for the data and model dependencies that make ML different.

Google Cloud defines MLOps as a practice that "aims to unify ML system development and ML system deployment in order to standardize and streamline the continuous delivery of high-performing models in production" [2]. AWS describes it as combining "ML system development (the ML element) and ML system operations (the Ops element)" to automate the end-to-end ML lifecycle [1]. Databricks frames MLOps around three concerns: end-to-end machine learning workflow automation, reproducibility and collaboration, and continuous integration, delivery, and training of models in production [3].

The scope of MLOps typically includes data management, experiment tracking, model training and validation, packaging and deployment, monitoring and drift detection, retraining and continuous learning, governance and compliance, cost management, and the people and processes needed to coordinate all of the above.

history and origins

MLOps did not appear fully formed. It emerged gradually as teams running ML in production discovered that the engineering practices that worked for ordinary software did not translate cleanly to systems whose behavior depended on training data and learned parameters.

sculley et al. and hidden technical debt

The field's intellectual foundation is widely traced to a 2015 NeurIPS paper by D. Sculley and colleagues at Google titled "Hidden Technical Debt in Machine Learning Systems" [4]. The paper applied the software engineering concept of technical debt to ML and argued that ML systems have a special capacity for incurring debt because they have all the maintenance problems of traditional code plus an additional set of ML-specific issues. Crucially, this debt may be hard to detect because it exists at the system level rather than the code level.

The paper enumerated several recurring problems: boundary erosion (where ML systems mix concerns that traditional engineering keeps separate), entanglement (changing anything changes everything, often summarized as the "CACE" principle), hidden feedback loops (where a model's predictions influence the data it later trains on), undeclared consumers downstream of model outputs, data dependencies that are harder to track than code dependencies, configuration debt, and external-world changes that silently invalidate assumptions baked into a model. The authors illustrated their point with a now-famous diagram showing that the actual ML code in a real system is a tiny black box surrounded by a large infrastructure of configuration, data collection, feature extraction, serving infrastructure, monitoring, and analysis tools.

The paper did not propose a solution catalog. Instead it argued that ML practitioners needed to think beyond model accuracy and consider the long-term operational cost of the systems they were building. That argument became the seed for MLOps as a discipline.

early platform efforts

While Sculley's paper described the problem, several companies had already begun building internal ML platforms to manage it.

Facebook (now Meta) launched FBLearner Flow internally in 2014 and publicly described it in May 2016 as the "AI backbone" of the company [5]. The platform handled experiment management, training pipelines, and model deployment for engineers across the company. By 2018 it was used by more than 25 percent of Facebook's engineering team, and the prediction service was running more than six million predictions per second. FBLearner Flow eventually evolved into a generic workflow engine handling tasks well beyond ML.

Uber introduced Michelangelo in 2017 to power business-critical ML use cases such as ride ETAs, Eats delivery time predictions, fraud detection, and ranking [6]. Michelangelo combined open-source components such as HDFS, Spark, Cassandra, MLLib, XGBoost, and TensorFlow with internal infrastructure for feature management, training, and serving. A key innovation was its early feature store, which allowed batch-computed features to be reused at serving time and effectively created the design pattern that the broader industry later adopted. Several members of the Michelangelo team went on to found Tecton, the commercial feature platform.

Google had been developing TFX (TensorFlow Extended) since around 2017 as a portable, end-to-end production ML platform built on TensorFlow [7]. TFX exposed reusable components for data validation, feature transformation, training, model analysis, and serving, all coordinated through ML metadata that recorded artifact lineage. The publication of the TFX paper at KDD 2017 was an important moment because it described how Google actually ran ML in production, not just how it trained models.

Netflix took a different path. The company developed Metaflow internally and open-sourced it in 2019, focusing on the data scientist's experience rather than on Kubernetes-native infrastructure. Netflix built Metaflow on top of AWS Batch, EC2, and S3 and integrated it with Titus (Netflix's container platform) and Maestro (its workflow orchestrator) to support hundreds of personalization models in production [8].

mlflow and the open-source convergence

The public turning point for MLOps as a field arrived in June 2018 when Databricks announced MLflow at the Spark + AI Summit [9]. The first alpha included three components: MLflow Tracking for logging parameters, metrics, and artifacts; MLflow Projects for packaging reusable training code; and MLflow Models for a standard model format. MLflow was deliberately framework-agnostic, designed to work with any ML library and exposed through REST APIs and simple file formats.

MLflow's adoption was unusually rapid. Databricks reported that the project grew its contributor count in nine months to a level that Apache Spark had taken three years to reach. Within a couple of years MLflow was the de facto open-source standard for experiment tracking and model registry, and many other tools built on it or interoperated with it. The term "MLOps" itself, modeled on "DevOps," began appearing in conference talks and job postings around the same time.

From 2019 onward the ecosystem expanded rapidly: Kubeflow released its first 1.0 version in March 2020, Weights & Biases became a popular commercial alternative for experiment tracking, Feast emerged as the open-source feature store, and Evidently and Arize built monitoring and observability tools focused on ML-specific failure modes.

relationship to devops

MLOps borrows the philosophical core of DevOps: automation, continuous integration and delivery, monitoring, and tight collaboration between development and operations. However, ML systems introduce dimensions that pure DevOps does not address.

Aspect	DevOps	MLOps
Primary artifacts	Source code	Code, data, model weights, prompts, configurations
Primary inputs	Code commits	Code commits, new data, drift signals, schedule
Testing	Unit tests, integration tests, end-to-end tests	All of those plus data validation, model validation, fairness audits, drift checks
Versioning	Git for code	Code in Git, data in DVC or LakeFS, models in a registry, prompts in a prompt store
CI trigger	Pull request	Pull request, data update, performance regression
CD target	A binary or container image	A container plus a model artifact plus configuration plus possibly a feature pipeline
Reproducibility	Deterministic builds from source	Probabilistic; needs data snapshots, random seeds, hardware notes, library pins
Rollback	Deploy previous code version	Deploy previous model and possibly previous data and feature pipeline together
Failure mode	Crashes, exceptions, latency spikes	All of those plus silent accuracy decay
Monitoring	Latency, throughput, error rate	All of those plus accuracy, calibration, drift, fairness, prediction distribution

The critical difference is that a traditional application produces the same output for the same input given the same code. An ML system's behavior depends on the training data and the statistical relationship between features and target, both of which can change without anyone changing a line of code. This is why MLOps requires monitoring concepts (data drift, concept drift, prediction drift) that have no direct equivalent in DevOps, and why CI pipelines often need to validate data and trained models alongside code.

Culturally, MLOps inherits the DevOps shift away from siloed teams. In a mature MLOps organization, data scientists, ML engineers, platform engineers, and product owners share responsibility for what runs in production rather than throwing artifacts over the wall.

the ml lifecycle

MLOps spans the full machine learning lifecycle. The exact stage names vary across vendors, but the phases below appear in nearly every MLOps reference architecture.

problem framing

Before any data is collected, the team has to decide what business problem the ML system is solving, what success looks like in measurable terms, and whether ML is the right tool at all. This stage produces a problem statement, a success metric, an offline evaluation plan, and a rough idea of the data that will be required. Skipping this step is a common cause of projects that ship technically successful models nobody uses.

data collection and labeling

Data is the substrate of any ML system. This stage covers identifying data sources, building ingestion pipelines, cleaning and deduplicating records, and labeling examples when supervised learning is required. Labeling can be done in-house, through crowdsourcing platforms such as Scale or Surge, or through programmatic labeling tools such as Snorkel.

Key practices include:

Data versioning so any model can be reproduced with its exact training data
Data validation that catches schema changes, missing values, outliers, and distribution shifts before they reach training
Data lineage that records where each record came from and what transformations were applied
Privacy and consent reviews, especially for personal data covered by GDPR or similar regimes

feature engineering

Raw data is rarely fed directly into a model. Feature engineering transforms raw inputs into the numerical or categorical features the model consumes. The same transformations must be applied during training and during serving, otherwise the model encounters distributions at inference time that it never saw at training time. This problem is called training-serving skew and is a common cause of degraded production performance [10].

Feature stores were invented to address this problem. They provide a single definition of each feature and serve those features consistently to both batch training jobs and real-time inference services.

training

Training selects a model architecture, picks hyperparameters, and runs an optimization process over labeled data to produce a set of model weights. MLOps practices for training include:

Experiment tracking that records every run with its hyperparameters, code version, data version, metrics, and outputs
Reproducibility, which requires fixing random seeds, pinning library versions, and documenting hardware
Automated training pipelines that can be triggered by code changes, new data, or a drift signal rather than by a human running a notebook
Resource management for GPUs and TPUs, including spot instance handling and elastic scaling
Distributed training for models too large to train on a single machine

validation and evaluation

A trained model is not necessarily a production-ready model. Validation evaluates it against a held-out test set, against the current production model on a shared benchmark, against fairness and safety criteria, and against business metrics. Common validation checks include:

Performance across data slices and demographic groups, not just overall averages
Comparison against the current production baseline on the same evaluation data
Stress tests with adversarial or edge-case inputs
Calibration of predicted probabilities
Compliance with documented requirements (latency budget, fairness thresholds, regulatory rules)

deployment

Deployment packages the validated model with its dependencies and exposes it for use by downstream consumers. Common deployment patterns include:

Pattern	Description	Typical use case
Online inference	Model responds to individual requests via an API at low latency	Recommendations, fraud scoring, ad ranking, chat
Offline inference	Model processes large batches on a schedule	Credit scoring, churn prediction, lead scoring
Streaming inference	Model processes events from a queue with sub-second latency	Real-time anomaly detection
Edge deployment	Model runs on a device with intermittent or no connectivity	Mobile keyboards, on-device speech, IoT sensors
Shadow deployment	New model runs alongside the production model but does not serve responses	Validation under live traffic before cutover
Canary deployment	A small percentage of traffic is routed to the new model	Risk-bounded rollout
Blue-green deployment	Two identical environments swap roles after validation	Fast rollback at the cost of double infrastructure
A/B test	Traffic is split between two or more model versions	Statistical comparison of business metrics

Beyond the rollout pattern, deployment also has to handle packaging (typically Docker), serving framework (Triton, KServe, BentoML, vLLM, or a custom server), hardware selection, autoscaling policies, and version pinning.

monitoring and observability

Once a model is in production, it must be continuously monitored. Unlike traditional software where bugs are usually introduced by code changes, ML models can degrade silently because the world they operate in changes. Monitoring covers system health (latency, throughput, error rate, GPU and CPU utilization), model behavior (prediction distribution, confidence calibration, output distribution shift), input data (feature drift, missing or out-of-range values), and business outcomes (downstream conversion, revenue, user satisfaction).

retraining and continuous learning

When monitoring detects performance decay or when fresh labeled data accumulates, the model must be retrained. Mature MLOps automates this loop. Triggers can be schedule-based (retrain every week), threshold-based (retrain when accuracy drops below X), or event-based (retrain when a drift detector fires). The retrained model goes through the same validation and deployment pipeline as the original.

retirement

Models eventually outlive their usefulness, either because the underlying business has changed, a better model has replaced them, or they cannot be safely maintained. MLOps includes a process for cleanly removing a model from production, archiving its weights and documentation, and rerouting any consumers.

mlops maturity levels

Google has proposed a widely cited maturity model for MLOps that defines three levels of automation [2]. Microsoft and other vendors have published variants, but the Google model is the most often referenced.

Level	Name	What is automated	What is still manual
0	Manual process	Almost nothing. Data scientists train models in notebooks and hand off model files to engineers, who deploy them.	Data pulls, training, validation, deployment, monitoring. Retraining is rare and often forgotten.
1	ML pipeline automation	Training is wrapped in a reusable pipeline. Continuous training (CT) is enabled, so the model can be retrained automatically when new data arrives or a trigger fires. Feature stores and metadata stores are common.	Code changes still require manual deployment of the pipeline itself. CI/CD for the pipeline is not yet automated.
2	CI/CD pipeline automation	The training pipeline itself is built, tested, and deployed automatically. Code changes flow through a CI/CD system that builds, tests, and pushes new pipeline versions. Models, data, and code are all versioned, validated, and deployed automatically.	Strategy decisions, problem framing, governance review.

Most organizations begin at Level 0 and progress incrementally. Reaching Level 2 requires substantial investment in tooling, infrastructure, and organizational practices. In practice, many successful teams operate at Level 1 for the bulk of their models and reserve full Level 2 automation for their most critical or highest-volume systems.

Microsoft's Azure-flavored maturity model adds intermediate levels and is sometimes referenced when teams want a more granular self-assessment. Both models share the same underlying message: maturity is about automation, repeatability, and the absence of manual handoffs, not about model sophistication.

key components

A production MLOps stack is rarely a single product. It is an assembly of components, each addressing a specific concern in the lifecycle.

data versioning

Data versioning tools track changes to datasets the way Git tracks changes to source code. They allow a model to be rebuilt against the exact data it was trained on. Common tools include DVC (Data Version Control), which layers a Git-like interface over object storage; LakeFS, which provides Git-like operations on top of S3 or Azure Blob Storage; and Pachyderm, which links data versioning with pipeline orchestration on Kubernetes.

experiment tracking

Experiment tracking systems record every training run with its hyperparameters, training data version, code version, environment details, evaluation metrics, and produced artifacts. By keeping a complete history, they enable comparison of approaches, reproduction of past results, and audit trails for governance. The dominant tools are MLflow Tracking, Weights & Biases, Neptune, and Comet.

feature store

A feature store is a centralized service for defining, computing, storing, and serving ML features [10]. It addresses two problems at once: training-serving skew (because the same definition is used for both batch backfills and real-time serving) and feature reuse (because teams can discover and reuse features rather than recomputing them). The category took shape after Uber's Michelangelo Palette demonstrated the pattern. Today the leading options are Feast (open source, modular), Tecton (commercial, opinionated end-to-end), Hopsworks (open source, includes its own RonDB online store), Vertex AI Feature Store (Google), and SageMaker Feature Store (AWS).

model registry

A model registry is a versioned repository for trained models with their metadata: training data versions, hyperparameters, evaluation metrics, lineage, and deployment status. It is the source of truth for what models exist, what their performance characteristics are, and which one is currently serving traffic. Registries typically support tagging (staging, production, archived), approval workflows, and rollback. MLflow Model Registry, Vertex AI Model Registry, and SageMaker Model Registry are the most widely used implementations.

pipeline orchestration

Pipeline orchestrators define and execute the directed graph of steps in an ML workflow: extract data, validate it, transform features, train, evaluate, and deploy. Each tool occupies a slightly different niche.

Orchestrator	Origin	Typical strength	Notes
Apache Airflow	Airbnb, 2014	Mature ecosystem, hundreds of operators	Originally a general data pipeline tool; widely used for ML by extension
Kubeflow Pipelines	Google, 2018	Native Kubernetes execution, TFX integration	Powerful but operationally heavy; demands Kubernetes expertise
Metaflow	Netflix, 2019	Data scientist ergonomics, AWS-first	Notebook-friendly DSL; less flexible outside AWS
Flyte	Lyft, 2020	Strongly typed, Kubernetes-native, scalable	Reproducibility guarantees and good big-job support
Prefect	2018	Python-native decorators, hybrid execution	Good for data engineers transitioning to ML
Dagster	2019	Asset-centric model, software-defined assets	Strong observability and testing built in
ZenML	2021	Framework-agnostic abstraction layer	Plugs into other orchestrators as backends

model serving

Serving infrastructure runs the model in production, handling autoscaling, batching, and the runtime details of inference. The right choice depends on the model type and the latency budget.

Server	Best for	Notable features
NVIDIA Triton	Heterogeneous models on GPUs	Concurrent model execution, dynamic batching, multi-framework
KServe	Kubernetes-native serving	Operates at the orchestration layer, supports many predictors
Seldon Core	Kubernetes-native serving with A/B testing	Canary deployments, multi-armed bandits
BentoML	Python-first model packaging	Builds OpenAI-compatible APIs, supports many backends
TensorFlow Serving	TensorFlow models	Mature, REST and gRPC, optimized for TF graphs
TorchServe	PyTorch models	Native PyTorch packaging
vLLM	LLM inference	PagedAttention, continuous batching, best-in-class TTFT for LLMs
Ray Serve	Composable Python services	Good for ensembles and serving DAGs

For large language models the calculus is different. Benchmarks reported by BentoML in 2024 found that vLLM achieved best-in-class time-to-first-token across concurrency levels, while TensorRT-LLM and LMDeploy delivered the highest token generation rates [11]. Triton can host vLLM as a backend, which is a common production pattern for organizations that want vLLM's performance with Triton's multi-model orchestration.

monitoring and observability

ML monitoring tools track data drift, concept drift, prediction quality, latency, and cost. The category includes Evidently (open source, statistical drift reports), WhyLabs (managed, with WhyLogs as the open-source profiling library), Arize and Arize Phoenix (model and LLM observability), Fiddler AI (model performance management with explainability), and Datadog Model Monitoring (integrated with Datadog APM). Most of these tools have added LLM-specific features since 2023.

continuous training

Continuous training (CT) is the ML-specific extension of continuous delivery. In a CI/CD pipeline for code, a commit triggers a build, a test run, and a deployment. In a CI/CD/CT pipeline for ML, the same flow applies, but additional triggers can also start a retraining run: new labeled data has accumulated, a drift detector has fired, scheduled retraining is due, or a downstream metric has dropped below a threshold.

A CT pipeline typically includes:

A trigger (schedule, drift signal, performance regression, or new data event)
Data extraction and validation, including schema and distribution checks
Feature transformation, often via a feature store
Model training with the chosen architecture and hyperparameters
Model evaluation on a held-out set and against the current production baseline
A validation gate (the new model must beat the baseline by at least X on metric Y)
Registration of the new model in the model registry with full metadata
Deployment, typically as a canary or shadow before full rollout
Post-deployment monitoring and automatic rollback if metrics regress

The gating step is what distinguishes CT from naive automation. A model that fails validation should not be deployed even if it was produced automatically. This is why a model registry with explicit promotion stages is a near-universal feature of mature MLOps stacks.

ci/cd for machine learning

CI/CD for ML extends traditional CI/CD with steps that have no equivalent in pure software delivery.

Code CI runs the standard checks (lint, unit tests, integration tests) and adds tests that exercise the training pipeline on a small sample dataset to catch errors before a full training run.
Data CI validates that incoming data matches the expected schema, that distributions have not shifted beyond a tolerance, and that no new categorical values have appeared.
Model CI trains a new model, evaluates it, and registers it in the model registry. This stage often requires significant compute, so it is usually triggered by changes to training code or by scheduled CT rather than on every commit.
Model CD packages the validated model into a serving image and rolls it out, typically through a canary or blue-green pattern, with automatic rollback if monitoring detects a regression.

A common practice is to keep these stages in separate pipelines that share a model registry and feature store as connecting interfaces, so that data, model, and code teams can work asynchronously without blocking each other.

monitoring in detail

Production monitoring is where MLOps does most of its day-to-day work, and it is also where the ML-specific failure modes show up.

data drift

Data drift, also called covariate shift, occurs when the distribution of input features changes between training and serving. A model trained on customer behavior from 2023 may encounter very different behavior in 2026 as user preferences, payment methods, or platform mix evolve. Drift can be gradual (seasonality, slow demographic change) or sudden (a global event, a competitor launch, a feature pipeline bug).

Detection requires statistical tests on incoming feature distributions compared to a reference distribution (typically the training set). Common tests include the Kolmogorov-Smirnov (KS) test for continuous features, the chi-square test for categorical features, the Population Stability Index (PSI) used heavily in credit risk, and the Jensen-Shannon and Kullback-Leibler divergences. Each test has tradeoffs. The KS test is sensitive but can fire too often on large datasets where small changes are statistically significant but practically irrelevant. PSI gives a single interpretable number with widely used thresholds (0.1 for minor change, 0.25 for major change). KL divergence is asymmetric and requires care when comparing distributions [12].

concept drift

Concept drift refers to a change in the relationship between input features and the target variable, even when the input distribution looks stable. A churn model can decay because the same demographic and behavioral signals now predict churn at different rates than they used to. Concept drift is harder to detect than data drift because it requires labeled outcome data, and labels are usually delayed.

prediction drift

Prediction drift watches the distribution of model outputs. A sudden shift in the fraction of positive predictions, the average predicted score, or the entropy of class probabilities can indicate either upstream data drift or concept drift, even before labels arrive. Prediction drift is often the first signal an MLOps team has that something has changed.

system metrics

In addition to ML-specific monitoring, the team must watch the same system metrics any production service has: request rate, latency at p50 and p99, error rate, queue depth, GPU and CPU utilization, memory, and cost. For LLM services, latency is often broken into time-to-first-token (TTFT) and inter-token latency rather than treated as a single number, because user perception of quality depends heavily on how quickly the response begins streaming.

calibration and fairness

Calibration measures whether predicted probabilities match observed frequencies. A well-calibrated binary classifier that outputs 0.7 should be correct about 70 percent of the time on examples in that score band. Calibration can decay independently of accuracy and matters wherever the model's score is used to drive a downstream decision.

Fairness monitoring tracks model behavior across protected groups (race, gender, age, geography) using metrics such as disparate impact, equal opportunity difference, and demographic parity. Fairness regressions are particularly important under regimes such as the EU AI Act that explicitly require ongoing bias monitoring for high-risk systems.

llmops as a specialization

LLMOps is the operational discipline for systems built around large language models. It is a specialization of MLOps rather than a replacement, but enough is different that it has acquired its own tooling and vocabulary.

The biggest differences are that the model itself is often a third-party API (OpenAI, Anthropic, Google), the most important artifacts are prompts and retrieval pipelines rather than weights, the cost driver is inference tokens rather than training compute, and quality cannot be reduced to a single accuracy number because outputs are open-ended text.

Aspect	Traditional MLOps	LLMOps
Model	Trained or fine-tuned in-house	Often consumed via API; sometimes fine-tuned with LoRA or QLoRA
Versioned artifacts	Code, data, model weights	Code, prompts, retrieval indexes, evaluation datasets, sometimes weights
Quality measurement	Standard metrics (accuracy, F1, AUC)	LLM-as-judge, human evaluation, rubric-based scoring, red-teaming
Latency	Single inference time	Time-to-first-token plus inter-token latency
Cost driver	Training compute	Per-token inference cost; long contexts and large output windows
Safety concerns	Bias, fairness	Hallucinations, prompt injection, PII leakage, jailbreaks, content policy
Data pipeline	Feature engineering	RAG corpus curation, embedding management, chunking strategy
Failure mode	Numeric drift	Hallucinated facts, format breakage, refusal regressions, drift in tone

Specialized LLMOps tooling has emerged rapidly. LangSmith (closed source, deeply integrated with LangChain) provides tracing, evaluations, and prompt versioning. Langfuse is open source and framework-agnostic, popular for self-hosting. Helicone is a proxy-based observability tool that requires almost no code changes. PromptLayer focuses on prompt management with a no-code editor. Arize Phoenix offers open-source LLM tracing with notebook-friendly visualizations. Humanloop targets prompt evaluation workflows for product teams.

Generative AI gateways are another LLMOps category that did not exist in classical MLOps. LiteLLM provides a unified Python interface across more than 100 LLM providers and integrates logging out of the box. Portkey adds routing, retries, fallbacks, guardrails, and budgets on top of an OpenAI-compatible API. These gateways play a role analogous to a service mesh in microservices: they sit between the application and the model providers and enforce cross-cutting concerns.

ml governance and responsible ai

As ML systems have moved into hiring, lending, healthcare, and law enforcement, governance has become a first-class MLOps concern rather than an afterthought.

model cards and datasheets

The model card was introduced by Margaret Mitchell and colleagues at Google in 2018 (published 2019) as a short document accompanying a trained model that describes its intended use, performance across demographic groups, evaluation conditions, ethical considerations, and known limitations [13]. Model cards have since been adopted by Hugging Face (where every model on the Hub has a card), Google's Vertex AI, and many internal MLOps platforms.

Datasheets for datasets, proposed by Timnit Gebru and colleagues in 2018, play the analogous role for training data. A datasheet describes the dataset's motivation, composition, collection process, labeling, recommended uses, and maintenance plan, inspired by the datasheets that accompany electronic components.

Google later introduced Data Cards (a slightly different format from datasheets) and Hugging Face uses Dataset Cards on its Hub. The general pattern of structured, machine-readable model and data documentation is now a standard MLOps practice.

nist ai risk management framework

The US National Institute of Standards and Technology released the AI Risk Management Framework (AI RMF 1.0) in January 2023 as a voluntary, sector-agnostic guide for managing AI risk across the lifecycle [14]. The framework defines four functions:

Function	Purpose
GOVERN	Establish leadership, accountability, and risk-tolerance policies for AI across the organization
MAP	Understand the context of use, stakeholders, and potential impacts
MEASURE	Quantitatively and qualitatively assess AI risks and impacts
MANAGE	Allocate resources to mitigate risks identified through GOVERN, MAP, and MEASURE

NIST followed in 2024 with AI 600-1, a generative-AI-specific profile, and a voluntary playbook that maps the framework to concrete actions. AI RMF has become a common reference for MLOps governance even outside the US, because it is technology-agnostic and easy to overlay on existing development processes.

eu ai act

The EU AI Act (Regulation EU 2024/1689) entered into force on 1 August 2024 and is the world's first comprehensive AI regulation [15]. It classifies AI systems by risk: prohibited practices (such as social scoring) are banned, high-risk systems (in areas like employment, education, credit, law enforcement, and biometrics) face strict compliance obligations, limited-risk systems carry transparency requirements, and minimal-risk systems are largely unregulated. General-purpose AI (GPAI) models have their own dedicated rules.

For high-risk systems the providers must implement a continuous risk management process across the lifecycle, use representative and high-quality training data, maintain technical documentation and event logs, ensure human oversight, achieve adequate accuracy and cybersecurity, register the system in an EU database, and conduct conformity assessments. Most of these obligations map directly onto MLOps concerns: lineage, monitoring, model cards, audit logs, and structured deployment processes.

The Act's timeline is staggered. Prohibitions and AI literacy obligations applied from 2 February 2025. Governance rules and obligations for GPAI models took effect on 2 August 2025. The full regime applies on 2 August 2026, with high-risk systems embedded in regulated products getting until 2 August 2027. As of early 2026, MLOps teams operating in or selling into the EU are actively building documentation pipelines, conformity-assessment artifacts, and quality management systems to meet these deadlines.

data-centric ai

A related strand of governance work is data-centric AI, the argument popularized by Andrew Ng that improving data quality often yields more reliable production performance than tweaking model architecture. Data-centric practices (consistent labeling guidelines, error analysis on slices, targeted data collection for failure cases) have become standard parts of MLOps for high-stakes systems.

anti-patterns

Despite a decade of public MLOps writing, certain anti-patterns recur in organizations that are early in their ML journey.

The throw-it-over-the-wall handoff, where data scientists produce a model artifact and engineers are expected to make it work in production without context.
Notebook in production, where Jupyter notebooks are run on a schedule as if they were production services. Notebooks are designed for interactive exploration, not for reproducible, monitored execution.
Manual deployment, where the steps to deploy a new model exist only in someone's head or in a Confluence page. Eventually that person leaves.
Fire-and-forget models, where a model is deployed once and never monitored or retrained. These models silently decay until a downstream metric drops far enough for someone to notice.
Single-model monitoring, where the team monitors one production model in detail but cannot say what is happening with the dozens of other models also serving traffic.
Untracked feature pipelines, where the same features are computed differently in training jobs and serving services, causing training-serving skew.
Reproducibility theater, where the code is in Git but the data and library versions are not pinned, so the model cannot actually be rebuilt.
Last-mile fragility, where everything works in staging but fails on the first production traffic spike because autoscaling and batching were never tested.

MLOps maturity is largely about systematically eliminating these patterns. The Sculley paper's central message holds up: most of the cost of ML in production is not in the model but in the surrounding infrastructure and process.

tooling landscape

The modern MLOps ecosystem is wide and continues to consolidate around a few dominant patterns.

open source

Tool	Category	Notes
MLflow	Tracking, registry, deployment	Most widely adopted open-source MLOps platform; MLflow 3 (June 2025) added GenAI support
Kubeflow	Orchestration, training, serving	Kubernetes-native; broad but operationally heavy
Apache Airflow	Workflow orchestration	General-purpose; widely used as the default DAG runner
Prefect	Workflow orchestration	Python-native decorators, hybrid execution
Dagster	Workflow orchestration	Asset-centric model with strong observability
Flyte	Workflow orchestration	Strongly typed, Kubernetes-native
Metaflow	Workflow orchestration	Netflix-origin, AWS-first, data scientist friendly
ZenML	Abstraction layer	Plugs into other orchestrators as backends
DVC	Data versioning	Git-like data tracking
LakeFS	Data versioning	Git-like operations on object storage
Pachyderm	Data versioning + pipelines	Lineage-aware processing on Kubernetes
Feast	Feature store	The reference open-source feature store
Hopsworks	Feature store + lakehouse	Includes RonDB online store
Great Expectations	Data validation	Tests for data quality and schema
BentoML	Model packaging and serving	OpenAI-compatible APIs, multiple backends
Seldon Core	Model serving on Kubernetes	A/B testing and canary deployment patterns
KServe	Model serving on Kubernetes	Standardized inference runtime
Ray	Distributed compute, serving	Ray Train, Ray Tune, Ray Serve cover much of the lifecycle
Triton Inference Server	Model serving	Multi-framework, GPU-optimized
vLLM	LLM serving	PagedAttention, leading TTFT for LLMs
Evidently	Monitoring	Open-source data and model drift reports
WhyLogs	Data profiling	Lightweight statistical fingerprints
Arize Phoenix	LLM observability	Notebook-friendly tracing
Langfuse	LLM observability	Open-source, framework agnostic
LiteLLM	LLM gateway	Unified API across providers

cloud platforms

Platform	Provider	Strength
Amazon SageMaker	AWS	End-to-end platform with notebooks, training, serving, registry, and pipelines; deep integration with the AWS ecosystem
Vertex AI	Google Cloud	Unified ML platform with AutoML, custom training, feature store, Pipelines, and tight integration with Gemini
Azure Machine Learning	Azure	Enterprise ML with responsible AI dashboards, managed endpoints, and tight integration with the Microsoft ecosystem
Databricks	Databricks	Lakehouse-centric with Mosaic AI, Unity Catalog, and managed MLflow
Snowflake Cortex	Snowflake	Brings models close to data inside the warehouse

managed and commercial tools

Tool	Category	Notes
Weights & Biases	Experiment tracking, model registry, evaluations	The dominant commercial alternative to MLflow Tracking
Neptune	Experiment tracking	Strong for large-scale experiment management
Comet	Experiment tracking	Includes LLM features as Opik
Tecton	Feature platform	Born from the Uber Michelangelo team
Modal	Compute platform	Serverless GPUs for training and inference
Anyscale	Managed Ray	Ray-as-a-service for distributed ML
Hugging Face	Model and dataset hub plus Inference Endpoints	The de facto registry for open-weight models
Arize	Model and LLM observability	Combines classical ML monitoring with LLM tracing
Fiddler AI	Model performance management	Explainability and bias monitoring
WhyLabs	Monitoring	Built on the open-source WhyLogs library
LangSmith	LLM observability	LangChain-integrated tracing and evaluations
Helicone	LLM observability	Proxy-based, low setup overhead
Portkey	LLM gateway	Routing, fallbacks, guardrails, budgets
Humanloop	Prompt evaluation	Product-team-friendly LLM workflows

The consensus pattern in 2026 is composition. A typical mature stack might combine MLflow for experiment tracking and registry, Feast or Tecton for features, a Kubernetes-based orchestrator like Kubeflow or Flyte for pipelines, KServe or Triton for serving, and Evidently or Arize for monitoring. LLM workloads add LangSmith or Langfuse for tracing and a gateway like LiteLLM or Portkey for provider abstraction.

case studies

uber michelangelo

Michelangelo is the longest-running production MLOps platform at internet scale [6]. It has gone through three major architectural generations: an initial monolithic platform focused on classical ML, a redesign that introduced a Lego-like plug-and-play model so teams could swap in best-of-breed open-source components, and a third generation that added support for deep learning and generative AI. By 2024 the platform was running 100 percent of Uber's business-critical ML workloads, including LLM-based customer service bots. Michelangelo's early feature store was an industry milestone, and the team that built it later founded Tecton.

netflix metaflow and metaflow hosting

Netflix's machine learning platform team built Metaflow with the explicit goal of optimizing for the data scientist's experience [8]. Metaflow's Python DSL lets a scientist describe a workflow as decorated Python steps, and Metaflow handles versioning, scheduling, and resource provisioning. Backing Metaflow are Titus (Netflix's container platform on AWS), Maestro (Netflix's open-source workflow orchestrator), Atlas (Netflix's time-series telemetry system, processing 17 billion metrics per day), and Metaflow Hosting for deploying models behind a managed inference endpoint. The system supports hundreds of personalization models in production and serves predictions to more than 230 million members.

facebook fblearner flow

FBLearner Flow is the longest-tenured large-scale workflow platform in the industry's MLOps story [5]. By the time of public disclosure in 2016 it was running ML workflows for ranking, ads, and content understanding across Facebook. Its evolution illustrates an important pattern: a platform built for ML often becomes a general-purpose workflow runner, because the orchestration, versioning, and resource-management problems are not actually ML-specific.

google tfx

TFX is the open-source distillation of how Google runs production ML on TensorFlow [7]. A TFX pipeline is a sequence of components (ExampleGen, StatisticsGen, SchemaGen, Transform, Trainer, Evaluator, Pusher) that share data through ML Metadata, a service that records artifact lineage and configuration. TFX runs on multiple orchestrators (Apache Beam, Apache Airflow, Kubeflow Pipelines, Vertex AI Pipelines), which means a single TFX pipeline definition can move between local development, on-premise Kubernetes, and managed cloud execution.

the people side

MLOps is a sociotechnical practice. The technology is necessary but not sufficient.

The MLOps Engineer role emerged as a distinct title around 2020 and has become common at companies running ML at scale. The role sits between data science, DevOps, and platform engineering. According to 2025-2026 salary data, MLOps engineers in the United States earn on average around 130,000 to 161,000 dollars per year, with senior roles reaching 200,000 dollars or more [16]. Compensation has grown roughly 20 percent year over year through 2025, reflecting strong demand for people who can bridge ML and operations.

A common organizational model in larger companies is to have a dedicated ML platform team that builds internal tools (a feature store, a model registry, training infrastructure, monitoring) used by multiple product teams. The platform team's job is to make ML deployment a self-service experience for data scientists rather than a custom engineering project for each model. This is essentially platform engineering applied to ML.

Smaller organizations often combine MLOps with data engineering, DevOps, or ML engineering. The boundary between roles matters less than the practices: someone has to own the production pipeline, someone has to monitor it, and someone has to retrain when it decays.

current state in 2025 and 2026

MLOps in 2026 is mature in the sense that the core practices and tools are well established, but the field is still moving rapidly because the underlying technology continues to evolve.

A few patterns dominate the recent landscape:

The convergence of MLOps and LLMOps is well underway. The same platforms increasingly support both classical models and LLMs, and many MLOps teams now serve both kinds of workload. MLflow's 3.0 release in 2025 added generative AI tracking. Vertex AI, SageMaker, and Azure ML have all added LLM-specific features. Tools that started as LLM observability platforms (Arize Phoenix, Langfuse) are increasingly used for classical ML too.

Modular and composable architectures have replaced the monolithic vision of "one platform for all of MLOps." Teams pick best-of-breed tools and connect them through standard interfaces, especially the model registry and feature store. The AI gateway has emerged as a new architectural component for the LLM stack.

Governance has shifted from optional to mandatory for many systems. The EU AI Act, NIST AI RMF, ISO/IEC 42001, and sector-specific regulations (FDA for medical devices, OCC for credit, FINRA for finance) all require structured documentation and ongoing oversight. MLOps teams that already practiced version control, monitoring, and audit logging are well positioned; those that did not are scrambling.

Cost optimization has become central to MLOps for LLM workloads. Token-based pricing, large context windows, and high-volume inference make the cost dimension impossible to ignore. Techniques like model routing, prompt caching, retrieval to reduce context size, quantization, and serving smaller distilled models for less demanding queries are now standard practice.

Agent orchestration is the newest frontier. Production AI systems in 2026 are increasingly composed of multiple interacting components: foundation models, fine-tuned adapters, retrieval systems, tools, guardrails, routing logic, and feedback loops. Each piece has its own lifecycle and failure modes. Tracing tools that visualize an entire agent run as a tree of nested spans (Langfuse, LangSmith, Arize Phoenix, OpenTelemetry semantic conventions for AI) are becoming the operational equivalent of distributed tracing for microservices.

The enduring mantra of MLOps remains "version everything": code, data, models, prompts, configurations, environments. Teams that adopt this discipline along with automated CI/CD and production-grade monitoring consistently achieve better outcomes than those that rely on manual processes. The Sculley paper from 2015 turned out to be exactly right: the model itself is the smallest part of an ML system, and most of the work is in everything around it.

references

definition and scope

history and origins

sculley et al. and hidden technical debt

early platform efforts

mlflow and the open-source convergence

relationship to devops

the ml lifecycle

problem framing

data collection and labeling

feature engineering

training

validation and evaluation

deployment

monitoring and observability

retraining and continuous learning

retirement

mlops maturity levels

key components

data versioning

experiment tracking

feature store

model registry

pipeline orchestration

model serving

monitoring and observability

continuous training

ci/cd for machine learning

monitoring in detail

data drift

concept drift

prediction drift

system metrics

calibration and fairness

llmops as a specialization

ml governance and responsible ai

model cards and datasheets

nist ai risk management framework

eu ai act

data-centric ai

anti-patterns

tooling landscape

open source

cloud platforms

managed and commercial tools

case studies

uber michelangelo

netflix metaflow and metaflow hosting

facebook fblearner flow

google tfx

the people side

current state in 2025 and 2026

see also

references

Improve this article

Related Articles

Online inference

Operation (op)

Partitioning strategy

TensorFlow Serving

Distributed training

definition and scope

history and origins

sculley et al. and hidden technical debt

early platform efforts

mlflow and the open-source convergence

relationship to devops

the ml lifecycle

problem framing

data collection and labeling

feature engineering

training

validation and evaluation

deployment

monitoring and observability

retraining and continuous learning

retirement

mlops maturity levels

key components

data versioning

experiment tracking