MLOps (Machine Learning Operations) is a set of practices, principles, and tools for deploying, monitoring, and maintaining machine learning models in production reliably and efficiently. The discipline draws heavily from DevOps but extends those ideas to the unique requirements of ML systems, where data, model weights, hyperparameters, and shifting real-world conditions are first-class artifacts alongside code. MLOps has become essential as organizations move from experimental notebooks to systems that must operate at scale, remain accurate over time, and meet business and regulatory requirements for reliability and governance [1].
The field crystallized around 2018 to 2019, when teams that had built early production ML infrastructure (Google with TFX, Uber with Michelangelo, Facebook with FBLearner Flow, Netflix with Metaflow) began publishing the operational lessons they had learned. The launch of MLflow by Databricks in June 2018 gave the open-source community a common vocabulary, and the term "MLOps" itself entered widespread use shortly after. By the mid-2020s, MLOps had branched into specialized subdisciplines such as LLMOps for generative models, while regulatory frameworks like the EU AI Act made formal lifecycle governance a legal requirement for many systems.
At its core, MLOps addresses a fundamental problem: machine learning models that perform well on a held-out test set during research often fail or quietly degrade once deployed. Unlike traditional software, an ML system depends not only on its source code but also on the data it was trained on, the model weights produced by training, the hyperparameters used, and the statistical properties of the environment it operates in. When any of these change, the system can break in ways that are hard to detect because the code is still running and returning numerical answers.
MLOps formalizes the practices needed to manage that complexity. It covers the entire lifecycle of an ML model, from initial problem framing and data collection through deployment, monitoring, retraining, and eventual retirement. The goal is to make ML systems as reliable, reproducible, and maintainable as conventional software systems while accounting for the data and model dependencies that make ML different.
Google Cloud defines MLOps as a practice that "aims to unify ML system development and ML system deployment in order to standardize and streamline the continuous delivery of high-performing models in production" [2]. AWS describes it as combining "ML system development (the ML element) and ML system operations (the Ops element)" to automate the end-to-end ML lifecycle [1]. Databricks frames MLOps around three concerns: end-to-end machine learning workflow automation, reproducibility and collaboration, and continuous integration, delivery, and training of models in production [3].
The scope of MLOps typically includes data management, experiment tracking, model training and validation, packaging and deployment, monitoring and drift detection, retraining and continuous learning, governance and compliance, cost management, and the people and processes needed to coordinate all of the above.
MLOps did not appear fully formed. It emerged gradually as teams running ML in production discovered that the engineering practices that worked for ordinary software did not translate cleanly to systems whose behavior depended on training data and learned parameters.
The field's intellectual foundation is widely traced to a 2015 NeurIPS paper by D. Sculley and colleagues at Google titled "Hidden Technical Debt in Machine Learning Systems" [4]. The paper applied the software engineering concept of technical debt to ML and argued that ML systems have a special capacity for incurring debt because they have all the maintenance problems of traditional code plus an additional set of ML-specific issues. Crucially, this debt may be hard to detect because it exists at the system level rather than the code level.
The paper enumerated several recurring problems: boundary erosion (where ML systems mix concerns that traditional engineering keeps separate), entanglement (changing anything changes everything, often summarized as the "CACE" principle), hidden feedback loops (where a model's predictions influence the data it later trains on), undeclared consumers downstream of model outputs, data dependencies that are harder to track than code dependencies, configuration debt, and external-world changes that silently invalidate assumptions baked into a model. The authors illustrated their point with a now-famous diagram showing that the actual ML code in a real system is a tiny black box surrounded by a large infrastructure of configuration, data collection, feature extraction, serving infrastructure, monitoring, and analysis tools.
The paper did not propose a solution catalog. Instead it argued that ML practitioners needed to think beyond model accuracy and consider the long-term operational cost of the systems they were building. That argument became the seed for MLOps as a discipline.
While Sculley's paper described the problem, several companies had already begun building internal ML platforms to manage it.
Facebook (now Meta) launched FBLearner Flow internally in 2014 and publicly described it in May 2016 as the "AI backbone" of the company [5]. The platform handled experiment management, training pipelines, and model deployment for engineers across the company. By 2018 it was used by more than 25 percent of Facebook's engineering team, and the prediction service was running more than six million predictions per second. FBLearner Flow eventually evolved into a generic workflow engine handling tasks well beyond ML.
Uber introduced Michelangelo in 2017 to power business-critical ML use cases such as ride ETAs, Eats delivery time predictions, fraud detection, and ranking [6]. Michelangelo combined open-source components such as HDFS, Spark, Cassandra, MLLib, XGBoost, and TensorFlow with internal infrastructure for feature management, training, and serving. A key innovation was its early feature store, which allowed batch-computed features to be reused at serving time and effectively created the design pattern that the broader industry later adopted. Several members of the Michelangelo team went on to found Tecton, the commercial feature platform.
Google had been developing TFX (TensorFlow Extended) since around 2017 as a portable, end-to-end production ML platform built on TensorFlow [7]. TFX exposed reusable components for data validation, feature transformation, training, model analysis, and serving, all coordinated through ML metadata that recorded artifact lineage. The publication of the TFX paper at KDD 2017 was an important moment because it described how Google actually ran ML in production, not just how it trained models.
Netflix took a different path. The company developed Metaflow internally and open-sourced it in 2019, focusing on the data scientist's experience rather than on Kubernetes-native infrastructure. Netflix built Metaflow on top of AWS Batch, EC2, and S3 and integrated it with Titus (Netflix's container platform) and Maestro (its workflow orchestrator) to support hundreds of personalization models in production [8].
The public turning point for MLOps as a field arrived in June 2018 when Databricks announced MLflow at the Spark + AI Summit [9]. The first alpha included three components: MLflow Tracking for logging parameters, metrics, and artifacts; MLflow Projects for packaging reusable training code; and MLflow Models for a standard model format. MLflow was deliberately framework-agnostic, designed to work with any ML library and exposed through REST APIs and simple file formats.
MLflow's adoption was unusually rapid. Databricks reported that the project grew its contributor count in nine months to a level that Apache Spark had taken three years to reach. Within a couple of years MLflow was the de facto open-source standard for experiment tracking and model registry, and many other tools built on it or interoperated with it. The term "MLOps" itself, modeled on "DevOps," began appearing in conference talks and job postings around the same time.
From 2019 onward the ecosystem expanded rapidly: Kubeflow released its first 1.0 version in March 2020, Weights & Biases became a popular commercial alternative for experiment tracking, Feast emerged as the open-source feature store, and Evidently and Arize built monitoring and observability tools focused on ML-specific failure modes.
MLOps borrows the philosophical core of DevOps: automation, continuous integration and delivery, monitoring, and tight collaboration between development and operations. However, ML systems introduce dimensions that pure DevOps does not address.
| Aspect | DevOps | MLOps |
|---|---|---|
| Primary artifacts | Source code | Code, data, model weights, prompts, configurations |
| Primary inputs | Code commits | Code commits, new data, drift signals, schedule |
| Testing | Unit tests, integration tests, end-to-end tests | All of those plus data validation, model validation, fairness audits, drift checks |
| Versioning | Git for code | Code in Git, data in DVC or LakeFS, models in a registry, prompts in a prompt store |
| CI trigger | Pull request | Pull request, data update, performance regression |
| CD target | A binary or container image | A container plus a model artifact plus configuration plus possibly a feature pipeline |
| Reproducibility | Deterministic builds from source | Probabilistic; needs data snapshots, random seeds, hardware notes, library pins |
| Rollback | Deploy previous code version | Deploy previous model and possibly previous data and feature pipeline together |
| Failure mode | Crashes, exceptions, latency spikes | All of those plus silent accuracy decay |
| Monitoring | Latency, throughput, error rate | All of those plus accuracy, calibration, drift, fairness, prediction distribution |
The critical difference is that a traditional application produces the same output for the same input given the same code. An ML system's behavior depends on the training data and the statistical relationship between features and target, both of which can change without anyone changing a line of code. This is why MLOps requires monitoring concepts (data drift, concept drift, prediction drift) that have no direct equivalent in DevOps, and why CI pipelines often need to validate data and trained models alongside code.
Culturally, MLOps inherits the DevOps shift away from siloed teams. In a mature MLOps organization, data scientists, ML engineers, platform engineers, and product owners share responsibility for what runs in production rather than throwing artifacts over the wall.
MLOps spans the full machine learning lifecycle. The exact stage names vary across vendors, but the phases below appear in nearly every MLOps reference architecture.
Before any data is collected, the team has to decide what business problem the ML system is solving, what success looks like in measurable terms, and whether ML is the right tool at all. This stage produces a problem statement, a success metric, an offline evaluation plan, and a rough idea of the data that will be required. Skipping this step is a common cause of projects that ship technically successful models nobody uses.
Data is the substrate of any ML system. This stage covers identifying data sources, building ingestion pipelines, cleaning and deduplicating records, and labeling examples when supervised learning is required. Labeling can be done in-house, through crowdsourcing platforms such as Scale or Surge, or through programmatic labeling tools such as Snorkel.
Key practices include:
Raw data is rarely fed directly into a model. Feature engineering transforms raw inputs into the numerical or categorical features the model consumes. The same transformations must be applied during training and during serving, otherwise the model encounters distributions at inference time that it never saw at training time. This problem is called training-serving skew and is a common cause of degraded production performance [10].
Feature stores were invented to address this problem. They provide a single definition of each feature and serve those features consistently to both batch training jobs and real-time inference services.
Training selects a model architecture, picks hyperparameters, and runs an optimization process over labeled data to produce a set of model weights. MLOps practices for training include:
A trained model is not necessarily a production-ready model. Validation evaluates it against a held-out test set, against the current production model on a shared benchmark, against fairness and safety criteria, and against business metrics. Common validation checks include:
Deployment packages the validated model with its dependencies and exposes it for use by downstream consumers. Common deployment patterns include:
| Pattern | Description | Typical use case |
|---|---|---|
| Online inference | Model responds to individual requests via an API at low latency | Recommendations, fraud scoring, ad ranking, chat |
| Offline inference | Model processes large batches on a schedule | Credit scoring, churn prediction, lead scoring |
| Streaming inference | Model processes events from a queue with sub-second latency | Real-time anomaly detection |
| Edge deployment | Model runs on a device with intermittent or no connectivity | Mobile keyboards, on-device speech, IoT sensors |
| Shadow deployment | New model runs alongside the production model but does not serve responses | Validation under live traffic before cutover |
| Canary deployment | A small percentage of traffic is routed to the new model | Risk-bounded rollout |
| Blue-green deployment | Two identical environments swap roles after validation | Fast rollback at the cost of double infrastructure |
| A/B test | Traffic is split between two or more model versions | Statistical comparison of business metrics |
Beyond the rollout pattern, deployment also has to handle packaging (typically Docker), serving framework (Triton, KServe, BentoML, vLLM, or a custom server), hardware selection, autoscaling policies, and version pinning.
Once a model is in production, it must be continuously monitored. Unlike traditional software where bugs are usually introduced by code changes, ML models can degrade silently because the world they operate in changes. Monitoring covers system health (latency, throughput, error rate, GPU and CPU utilization), model behavior (prediction distribution, confidence calibration, output distribution shift), input data (feature drift, missing or out-of-range values), and business outcomes (downstream conversion, revenue, user satisfaction).
When monitoring detects performance decay or when fresh labeled data accumulates, the model must be retrained. Mature MLOps automates this loop. Triggers can be schedule-based (retrain every week), threshold-based (retrain when accuracy drops below X), or event-based (retrain when a drift detector fires). The retrained model goes through the same validation and deployment pipeline as the original.
Models eventually outlive their usefulness, either because the underlying business has changed, a better model has replaced them, or they cannot be safely maintained. MLOps includes a process for cleanly removing a model from production, archiving its weights and documentation, and rerouting any consumers.
Google has proposed a widely cited maturity model for MLOps that defines three levels of automation [2]. Microsoft and other vendors have published variants, but the Google model is the most often referenced.
| Level | Name | What is automated | What is still manual |
|---|---|---|---|
| 0 | Manual process | Almost nothing. Data scientists train models in notebooks and hand off model files to engineers, who deploy them. | Data pulls, training, validation, deployment, monitoring. Retraining is rare and often forgotten. |
| 1 | ML pipeline automation | Training is wrapped in a reusable pipeline. Continuous training (CT) is enabled, so the model can be retrained automatically when new data arrives or a trigger fires. Feature stores and metadata stores are common. | Code changes still require manual deployment of the pipeline itself. CI/CD for the pipeline is not yet automated. |
| 2 | CI/CD pipeline automation | The training pipeline itself is built, tested, and deployed automatically. Code changes flow through a CI/CD system that builds, tests, and pushes new pipeline versions. Models, data, and code are all versioned, validated, and deployed automatically. | Strategy decisions, problem framing, governance review. |
Most organizations begin at Level 0 and progress incrementally. Reaching Level 2 requires substantial investment in tooling, infrastructure, and organizational practices. In practice, many successful teams operate at Level 1 for the bulk of their models and reserve full Level 2 automation for their most critical or highest-volume systems.
Microsoft's Azure-flavored maturity model adds intermediate levels and is sometimes referenced when teams want a more granular self-assessment. Both models share the same underlying message: maturity is about automation, repeatability, and the absence of manual handoffs, not about model sophistication.
A production MLOps stack is rarely a single product. It is an assembly of components, each addressing a specific concern in the lifecycle.
Data versioning tools track changes to datasets the way Git tracks changes to source code. They allow a model to be rebuilt against the exact data it was trained on. Common tools include DVC (Data Version Control), which layers a Git-like interface over object storage; LakeFS, which provides Git-like operations on top of S3 or Azure Blob Storage; and Pachyderm, which links data versioning with pipeline orchestration on Kubernetes.
Experiment tracking systems record every training run with its hyperparameters, training data version, code version, environment details, evaluation metrics, and produced artifacts. By keeping a complete history, they enable comparison of approaches, reproduction of past results, and audit trails for governance. The dominant tools are MLflow Tracking, Weights & Biases, Neptune, and Comet.
A feature store is a centralized service for defining, computing, storing, and serving ML features [10]. It addresses two problems at once: training-serving skew (because the same definition is used for both batch backfills and real-time serving) and feature reuse (because teams can discover and reuse features rather than recomputing them). The category took shape after Uber's Michelangelo Palette demonstrated the pattern. Today the leading options are Feast (open source, modular), Tecton (commercial, opinionated end-to-end), Hopsworks (open source, includes its own RonDB online store), Vertex AI Feature Store (Google), and SageMaker Feature Store (AWS).
A model registry is a versioned repository for trained models with their metadata: training data versions, hyperparameters, evaluation metrics, lineage, and deployment status. It is the source of truth for what models exist, what their performance characteristics are, and which one is currently serving traffic. Registries typically support tagging (staging, production, archived), approval workflows, and rollback. MLflow Model Registry, Vertex AI Model Registry, and SageMaker Model Registry are the most widely used implementations.
Pipeline orchestrators define and execute the directed graph of steps in an ML workflow: extract data, validate it, transform features, train, evaluate, and deploy. Each tool occupies a slightly different niche.
| Orchestrator | Origin | Typical strength | Notes |
|---|---|---|---|
| Apache Airflow | Airbnb, 2014 | Mature ecosystem, hundreds of operators | Originally a general data pipeline tool; widely used for ML by extension |
| Kubeflow Pipelines | Google, 2018 | Native Kubernetes execution, TFX integration | Powerful but operationally heavy; demands Kubernetes expertise |
| Metaflow | Netflix, 2019 | Data scientist ergonomics, AWS-first | Notebook-friendly DSL; less flexible outside AWS |
| Flyte | Lyft, 2020 | Strongly typed, Kubernetes-native, scalable | Reproducibility guarantees and good big-job support |
| Prefect | 2018 | Python-native decorators, hybrid execution | Good for data engineers transitioning to ML |
| Dagster | 2019 | Asset-centric model, software-defined assets | Strong observability and testing built in |
| ZenML | 2021 | Framework-agnostic abstraction layer | Plugs into other orchestrators as backends |
Serving infrastructure runs the model in production, handling autoscaling, batching, and the runtime details of inference. The right choice depends on the model type and the latency budget.
| Server | Best for | Notable features |
|---|---|---|
| NVIDIA Triton | Heterogeneous models on GPUs | Concurrent model execution, dynamic batching, multi-framework |
| KServe | Kubernetes-native serving | Operates at the orchestration layer, supports many predictors |
| Seldon Core | Kubernetes-native serving with A/B testing | Canary deployments, multi-armed bandits |
| BentoML | Python-first model packaging | Builds OpenAI-compatible APIs, supports many backends |
| TensorFlow Serving | TensorFlow models | Mature, REST and gRPC, optimized for TF graphs |
| TorchServe | PyTorch models | Native PyTorch packaging |
| vLLM | LLM inference | PagedAttention, continuous batching, best-in-class TTFT for LLMs |
| Ray Serve | Composable Python services | Good for ensembles and serving DAGs |
For large language models the calculus is different. Benchmarks reported by BentoML in 2024 found that vLLM achieved best-in-class time-to-first-token across concurrency levels, while TensorRT-LLM and LMDeploy delivered the highest token generation rates [11]. Triton can host vLLM as a backend, which is a common production pattern for organizations that want vLLM's performance with Triton's multi-model orchestration.
ML monitoring tools track data drift, concept drift, prediction quality, latency, and cost. The category includes Evidently (open source, statistical drift reports), WhyLabs (managed, with WhyLogs as the open-source profiling library), Arize and Arize Phoenix (model and LLM observability), Fiddler AI (model performance management with explainability), and Datadog Model Monitoring (integrated with Datadog APM). Most of these tools have added LLM-specific features since 2023.
Continuous training (CT) is the ML-specific extension of continuous delivery. In a CI/CD pipeline for code, a commit triggers a build, a test run, and a deployment. In a CI/CD/CT pipeline for ML, the same flow applies, but additional triggers can also start a retraining run: new labeled data has accumulated, a drift detector has fired, scheduled retraining is due, or a downstream metric has dropped below a threshold.
A CT pipeline typically includes:
The gating step is what distinguishes CT from naive automation. A model that fails validation should not be deployed even if it was produced automatically. This is why a model registry with explicit promotion stages is a near-universal feature of mature MLOps stacks.
CI/CD for ML extends traditional CI/CD with steps that have no equivalent in pure software delivery.
A common practice is to keep these stages in separate pipelines that share a model registry and feature store as connecting interfaces, so that data, model, and code teams can work asynchronously without blocking each other.
Production monitoring is where MLOps does most of its day-to-day work, and it is also where the ML-specific failure modes show up.
Data drift, also called covariate shift, occurs when the distribution of input features changes between training and serving. A model trained on customer behavior from 2023 may encounter very different behavior in 2026 as user preferences, payment methods, or platform mix evolve. Drift can be gradual (seasonality, slow demographic change) or sudden (a global event, a competitor launch, a feature pipeline bug).
Detection requires statistical tests on incoming feature distributions compared to a reference distribution (typically the training set). Common tests include the Kolmogorov-Smirnov (KS) test for continuous features, the chi-square test for categorical features, the Population Stability Index (PSI) used heavily in credit risk, and the Jensen-Shannon and Kullback-Leibler divergences. Each test has tradeoffs. The KS test is sensitive but can fire too often on large datasets where small changes are statistically significant but practically irrelevant. PSI gives a single interpretable number with widely used thresholds (0.1 for minor change, 0.25 for major change). KL divergence is asymmetric and requires care when comparing distributions [12].
Concept drift refers to a change in the relationship between input features and the target variable, even when the input distribution looks stable. A churn model can decay because the same demographic and behavioral signals now predict churn at different rates than they used to. Concept drift is harder to detect than data drift because it requires labeled outcome data, and labels are usually delayed.
Prediction drift watches the distribution of model outputs. A sudden shift in the fraction of positive predictions, the average predicted score, or the entropy of class probabilities can indicate either upstream data drift or concept drift, even before labels arrive. Prediction drift is often the first signal an MLOps team has that something has changed.
In addition to ML-specific monitoring, the team must watch the same system metrics any production service has: request rate, latency at p50 and p99, error rate, queue depth, GPU and CPU utilization, memory, and cost. For LLM services, latency is often broken into time-to-first-token (TTFT) and inter-token latency rather than treated as a single number, because user perception of quality depends heavily on how quickly the response begins streaming.
Calibration measures whether predicted probabilities match observed frequencies. A well-calibrated binary classifier that outputs 0.7 should be correct about 70 percent of the time on examples in that score band. Calibration can decay independently of accuracy and matters wherever the model's score is used to drive a downstream decision.
Fairness monitoring tracks model behavior across protected groups (race, gender, age, geography) using metrics such as disparate impact, equal opportunity difference, and demographic parity. Fairness regressions are particularly important under regimes such as the EU AI Act that explicitly require ongoing bias monitoring for high-risk systems.
LLMOps is the operational discipline for systems built around large language models. It is a specialization of MLOps rather than a replacement, but enough is different that it has acquired its own tooling and vocabulary.
The biggest differences are that the model itself is often a third-party API (OpenAI, Anthropic, Google), the most important artifacts are prompts and retrieval pipelines rather than weights, the cost driver is inference tokens rather than training compute, and quality cannot be reduced to a single accuracy number because outputs are open-ended text.
| Aspect | Traditional MLOps | LLMOps |
|---|---|---|
| Model | Trained or fine-tuned in-house | Often consumed via API; sometimes fine-tuned with LoRA or QLoRA |
| Versioned artifacts | Code, data, model weights | Code, prompts, retrieval indexes, evaluation datasets, sometimes weights |
| Quality measurement | Standard metrics (accuracy, F1, AUC) | LLM-as-judge, human evaluation, rubric-based scoring, red-teaming |
| Latency | Single inference time | Time-to-first-token plus inter-token latency |
| Cost driver | Training compute | Per-token inference cost; long contexts and large output windows |
| Safety concerns | Bias, fairness | Hallucinations, prompt injection, PII leakage, jailbreaks, content policy |
| Data pipeline | Feature engineering | RAG corpus curation, embedding management, chunking strategy |
| Failure mode | Numeric drift | Hallucinated facts, format breakage, refusal regressions, drift in tone |
Specialized LLMOps tooling has emerged rapidly. LangSmith (closed source, deeply integrated with LangChain) provides tracing, evaluations, and prompt versioning. Langfuse is open source and framework-agnostic, popular for self-hosting. Helicone is a proxy-based observability tool that requires almost no code changes. PromptLayer focuses on prompt management with a no-code editor. Arize Phoenix offers open-source LLM tracing with notebook-friendly visualizations. Humanloop targets prompt evaluation workflows for product teams.
Generative AI gateways are another LLMOps category that did not exist in classical MLOps. LiteLLM provides a unified Python interface across more than 100 LLM providers and integrates logging out of the box. Portkey adds routing, retries, fallbacks, guardrails, and budgets on top of an OpenAI-compatible API. These gateways play a role analogous to a service mesh in microservices: they sit between the application and the model providers and enforce cross-cutting concerns.
As ML systems have moved into hiring, lending, healthcare, and law enforcement, governance has become a first-class MLOps concern rather than an afterthought.
The model card was introduced by Margaret Mitchell and colleagues at Google in 2018 (published 2019) as a short document accompanying a trained model that describes its intended use, performance across demographic groups, evaluation conditions, ethical considerations, and known limitations [13]. Model cards have since been adopted by Hugging Face (where every model on the Hub has a card), Google's Vertex AI, and many internal MLOps platforms.
Datasheets for datasets, proposed by Timnit Gebru and colleagues in 2018, play the analogous role for training data. A datasheet describes the dataset's motivation, composition, collection process, labeling, recommended uses, and maintenance plan, inspired by the datasheets that accompany electronic components.
Google later introduced Data Cards (a slightly different format from datasheets) and Hugging Face uses Dataset Cards on its Hub. The general pattern of structured, machine-readable model and data documentation is now a standard MLOps practice.
The US National Institute of Standards and Technology released the AI Risk Management Framework (AI RMF 1.0) in January 2023 as a voluntary, sector-agnostic guide for managing AI risk across the lifecycle [14]. The framework defines four functions:
| Function | Purpose |
|---|---|
| GOVERN | Establish leadership, accountability, and risk-tolerance policies for AI across the organization |
| MAP | Understand the context of use, stakeholders, and potential impacts |
| MEASURE | Quantitatively and qualitatively assess AI risks and impacts |
| MANAGE | Allocate resources to mitigate risks identified through GOVERN, MAP, and MEASURE |
NIST followed in 2024 with AI 600-1, a generative-AI-specific profile, and a voluntary playbook that maps the framework to concrete actions. AI RMF has become a common reference for MLOps governance even outside the US, because it is technology-agnostic and easy to overlay on existing development processes.
The EU AI Act (Regulation EU 2024/1689) entered into force on 1 August 2024 and is the world's first comprehensive AI regulation [15]. It classifies AI systems by risk: prohibited practices (such as social scoring) are banned, high-risk systems (in areas like employment, education, credit, law enforcement, and biometrics) face strict compliance obligations, limited-risk systems carry transparency requirements, and minimal-risk systems are largely unregulated. General-purpose AI (GPAI) models have their own dedicated rules.
For high-risk systems the providers must implement a continuous risk management process across the lifecycle, use representative and high-quality training data, maintain technical documentation and event logs, ensure human oversight, achieve adequate accuracy and cybersecurity, register the system in an EU database, and conduct conformity assessments. Most of these obligations map directly onto MLOps concerns: lineage, monitoring, model cards, audit logs, and structured deployment processes.
The Act's timeline is staggered. Prohibitions and AI literacy obligations applied from 2 February 2025. Governance rules and obligations for GPAI models took effect on 2 August 2025. The full regime applies on 2 August 2026, with high-risk systems embedded in regulated products getting until 2 August 2027. As of early 2026, MLOps teams operating in or selling into the EU are actively building documentation pipelines, conformity-assessment artifacts, and quality management systems to meet these deadlines.
A related strand of governance work is data-centric AI, the argument popularized by Andrew Ng that improving data quality often yields more reliable production performance than tweaking model architecture. Data-centric practices (consistent labeling guidelines, error analysis on slices, targeted data collection for failure cases) have become standard parts of MLOps for high-stakes systems.
Despite a decade of public MLOps writing, certain anti-patterns recur in organizations that are early in their ML journey.
MLOps maturity is largely about systematically eliminating these patterns. The Sculley paper's central message holds up: most of the cost of ML in production is not in the model but in the surrounding infrastructure and process.
The modern MLOps ecosystem is wide and continues to consolidate around a few dominant patterns.
| Tool | Category | Notes |
|---|---|---|
| MLflow | Tracking, registry, deployment | Most widely adopted open-source MLOps platform; MLflow 3 (June 2025) added GenAI support |
| Kubeflow | Orchestration, training, serving | Kubernetes-native; broad but operationally heavy |
| Apache Airflow | Workflow orchestration | General-purpose; widely used as the default DAG runner |
| Prefect | Workflow orchestration | Python-native decorators, hybrid execution |
| Dagster | Workflow orchestration | Asset-centric model with strong observability |
| Flyte | Workflow orchestration | Strongly typed, Kubernetes-native |
| Metaflow | Workflow orchestration | Netflix-origin, AWS-first, data scientist friendly |
| ZenML | Abstraction layer | Plugs into other orchestrators as backends |
| DVC | Data versioning | Git-like data tracking |
| LakeFS | Data versioning | Git-like operations on object storage |
| Pachyderm | Data versioning + pipelines | Lineage-aware processing on Kubernetes |
| Feast | Feature store | The reference open-source feature store |
| Hopsworks | Feature store + lakehouse | Includes RonDB online store |
| Great Expectations | Data validation | Tests for data quality and schema |
| BentoML | Model packaging and serving | OpenAI-compatible APIs, multiple backends |
| Seldon Core | Model serving on Kubernetes | A/B testing and canary deployment patterns |
| KServe | Model serving on Kubernetes | Standardized inference runtime |
| Ray | Distributed compute, serving | Ray Train, Ray Tune, Ray Serve cover much of the lifecycle |
| Triton Inference Server | Model serving | Multi-framework, GPU-optimized |
| vLLM | LLM serving | PagedAttention, leading TTFT for LLMs |
| Evidently | Monitoring | Open-source data and model drift reports |
| WhyLogs | Data profiling | Lightweight statistical fingerprints |
| Arize Phoenix | LLM observability | Notebook-friendly tracing |
| Langfuse | LLM observability | Open-source, framework agnostic |
| LiteLLM | LLM gateway | Unified API across providers |
| Platform | Provider | Strength |
|---|---|---|
| Amazon SageMaker | AWS | End-to-end platform with notebooks, training, serving, registry, and pipelines; deep integration with the AWS ecosystem |
| Vertex AI | Google Cloud | Unified ML platform with AutoML, custom training, feature store, Pipelines, and tight integration with Gemini |
| Azure Machine Learning | Azure | Enterprise ML with responsible AI dashboards, managed endpoints, and tight integration with the Microsoft ecosystem |
| Databricks | Databricks | Lakehouse-centric with Mosaic AI, Unity Catalog, and managed MLflow |
| Snowflake Cortex | Snowflake | Brings models close to data inside the warehouse |
| Tool | Category | Notes |
|---|---|---|
| Weights & Biases | Experiment tracking, model registry, evaluations | The dominant commercial alternative to MLflow Tracking |
| Neptune | Experiment tracking | Strong for large-scale experiment management |
| Comet | Experiment tracking | Includes LLM features as Opik |
| Tecton | Feature platform | Born from the Uber Michelangelo team |
| Modal | Compute platform | Serverless GPUs for training and inference |
| Anyscale | Managed Ray | Ray-as-a-service for distributed ML |
| Hugging Face | Model and dataset hub plus Inference Endpoints | The de facto registry for open-weight models |
| Arize | Model and LLM observability | Combines classical ML monitoring with LLM tracing |
| Fiddler AI | Model performance management | Explainability and bias monitoring |
| WhyLabs | Monitoring | Built on the open-source WhyLogs library |
| LangSmith | LLM observability | LangChain-integrated tracing and evaluations |
| Helicone | LLM observability | Proxy-based, low setup overhead |
| Portkey | LLM gateway | Routing, fallbacks, guardrails, budgets |
| Humanloop | Prompt evaluation | Product-team-friendly LLM workflows |
The consensus pattern in 2026 is composition. A typical mature stack might combine MLflow for experiment tracking and registry, Feast or Tecton for features, a Kubernetes-based orchestrator like Kubeflow or Flyte for pipelines, KServe or Triton for serving, and Evidently or Arize for monitoring. LLM workloads add LangSmith or Langfuse for tracing and a gateway like LiteLLM or Portkey for provider abstraction.
Michelangelo is the longest-running production MLOps platform at internet scale [6]. It has gone through three major architectural generations: an initial monolithic platform focused on classical ML, a redesign that introduced a Lego-like plug-and-play model so teams could swap in best-of-breed open-source components, and a third generation that added support for deep learning and generative AI. By 2024 the platform was running 100 percent of Uber's business-critical ML workloads, including LLM-based customer service bots. Michelangelo's early feature store was an industry milestone, and the team that built it later founded Tecton.
Netflix's machine learning platform team built Metaflow with the explicit goal of optimizing for the data scientist's experience [8]. Metaflow's Python DSL lets a scientist describe a workflow as decorated Python steps, and Metaflow handles versioning, scheduling, and resource provisioning. Backing Metaflow are Titus (Netflix's container platform on AWS), Maestro (Netflix's open-source workflow orchestrator), Atlas (Netflix's time-series telemetry system, processing 17 billion metrics per day), and Metaflow Hosting for deploying models behind a managed inference endpoint. The system supports hundreds of personalization models in production and serves predictions to more than 230 million members.
FBLearner Flow is the longest-tenured large-scale workflow platform in the industry's MLOps story [5]. By the time of public disclosure in 2016 it was running ML workflows for ranking, ads, and content understanding across Facebook. Its evolution illustrates an important pattern: a platform built for ML often becomes a general-purpose workflow runner, because the orchestration, versioning, and resource-management problems are not actually ML-specific.
TFX is the open-source distillation of how Google runs production ML on TensorFlow [7]. A TFX pipeline is a sequence of components (ExampleGen, StatisticsGen, SchemaGen, Transform, Trainer, Evaluator, Pusher) that share data through ML Metadata, a service that records artifact lineage and configuration. TFX runs on multiple orchestrators (Apache Beam, Apache Airflow, Kubeflow Pipelines, Vertex AI Pipelines), which means a single TFX pipeline definition can move between local development, on-premise Kubernetes, and managed cloud execution.
MLOps is a sociotechnical practice. The technology is necessary but not sufficient.
The MLOps Engineer role emerged as a distinct title around 2020 and has become common at companies running ML at scale. The role sits between data science, DevOps, and platform engineering. According to 2025-2026 salary data, MLOps engineers in the United States earn on average around 130,000 to 161,000 dollars per year, with senior roles reaching 200,000 dollars or more [16]. Compensation has grown roughly 20 percent year over year through 2025, reflecting strong demand for people who can bridge ML and operations.
A common organizational model in larger companies is to have a dedicated ML platform team that builds internal tools (a feature store, a model registry, training infrastructure, monitoring) used by multiple product teams. The platform team's job is to make ML deployment a self-service experience for data scientists rather than a custom engineering project for each model. This is essentially platform engineering applied to ML.
Smaller organizations often combine MLOps with data engineering, DevOps, or ML engineering. The boundary between roles matters less than the practices: someone has to own the production pipeline, someone has to monitor it, and someone has to retrain when it decays.
MLOps in 2026 is mature in the sense that the core practices and tools are well established, but the field is still moving rapidly because the underlying technology continues to evolve.
A few patterns dominate the recent landscape:
The convergence of MLOps and LLMOps is well underway. The same platforms increasingly support both classical models and LLMs, and many MLOps teams now serve both kinds of workload. MLflow's 3.0 release in 2025 added generative AI tracking. Vertex AI, SageMaker, and Azure ML have all added LLM-specific features. Tools that started as LLM observability platforms (Arize Phoenix, Langfuse) are increasingly used for classical ML too.
Modular and composable architectures have replaced the monolithic vision of "one platform for all of MLOps." Teams pick best-of-breed tools and connect them through standard interfaces, especially the model registry and feature store. The AI gateway has emerged as a new architectural component for the LLM stack.
Governance has shifted from optional to mandatory for many systems. The EU AI Act, NIST AI RMF, ISO/IEC 42001, and sector-specific regulations (FDA for medical devices, OCC for credit, FINRA for finance) all require structured documentation and ongoing oversight. MLOps teams that already practiced version control, monitoring, and audit logging are well positioned; those that did not are scrambling.
Cost optimization has become central to MLOps for LLM workloads. Token-based pricing, large context windows, and high-volume inference make the cost dimension impossible to ignore. Techniques like model routing, prompt caching, retrieval to reduce context size, quantization, and serving smaller distilled models for less demanding queries are now standard practice.
Agent orchestration is the newest frontier. Production AI systems in 2026 are increasingly composed of multiple interacting components: foundation models, fine-tuned adapters, retrieval systems, tools, guardrails, routing logic, and feedback loops. Each piece has its own lifecycle and failure modes. Tracing tools that visualize an entire agent run as a tree of nested spans (Langfuse, LangSmith, Arize Phoenix, OpenTelemetry semantic conventions for AI) are becoming the operational equivalent of distributed tracing for microservices.
The enduring mantra of MLOps remains "version everything": code, data, models, prompts, configurations, environments. Teams that adopt this discipline along with automated CI/CD and production-grade monitoring consistently achieve better outcomes than those that rely on manual processes. The Sculley paper from 2015 turned out to be exactly right: the model itself is the smallest part of an ML system, and most of the work is in everything around it.