MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. Created by Databricks and first released in June 2018, MLflow provides tools for experiment tracking, model packaging, a model registry, and deployment. The project was co-created by Matei Zaharia, who also created Apache Spark, alongside other engineers at Databricks. MLflow is licensed under the Apache 2.0 license and joined the Linux Foundation in 2020 as a vendor-neutral open-source project.
With over 24,000 GitHub stars, more than 60 million monthly downloads, and 900+ contributors, MLflow has become one of the most widely adopted MLOps platforms in the industry. It supports Python, Java, R, and REST APIs, and integrates with a wide range of ML and deep learning frameworks.
MLflow was announced on June 5, 2018, at the Spark + AI Summit in San Francisco. Databricks released it as an open-source alpha, aiming to address a persistent problem in machine learning: the difficulty of tracking experiments, reproducing results, and deploying models in a consistent manner. At the time of its introduction, the ML ecosystem lacked standardized tools for managing the full model lifecycle, and teams often relied on ad hoc scripts and manual processes.
The initial alpha included three core components: MLflow Tracking, MLflow Projects, and MLflow Models. The Model Registry was added later to provide centralized model management with versioning and stage transitions.
MLflow 1.0 was released on June 4, 2019, marking the project's first stable release with guaranteed API stability across Python, Java, R, and REST interfaces. By this time, MLflow had already accumulated a growing user base with over 1 million downloads.
In June 2020, Databricks contributed MLflow to the Linux Foundation, establishing it as a vendor-neutral project governed by an independent community. This move was intended to ensure long-term open-source stewardship and encourage broader industry participation.
MLflow 2.0 arrived on November 15, 2022, with a focus on simplifying data science workflows and introducing new evaluation and deployment capabilities. The 2.x release series also brought initial support for large language models (LLMs), including the MLflow AI Gateway and tracing features.
MLflow 3.0 was released on June 9, 2025, representing a substantial architectural shift. This version introduced first-class support for generative AI applications and agents, a unified evaluation framework, and the LoggedModel entity as a new core abstraction. MLflow 3 removed several deprecated components, including MLflow Recipes and the fastai flavor.
As of March 2026, the latest stable release is MLflow 3.10.x, with active development continuing on the 3.x series.
| Version | Release date | Highlights |
|---|---|---|
| 0.1 (alpha) | June 2018 | Initial release with Tracking, Projects, and Models |
| 1.0 | June 2019 | Stable API; Python, Java, R, and REST interface guarantees |
| 2.0 | November 2022 | Revamped Tracking UI, MLflow Recipes, Keras/TensorFlow unification |
| 2.7 | 2023 | Experimental AI Gateway for LLM providers |
| 2.8 | Late 2023 | LLM-as-a-Judge evaluation metrics for RAG applications |
| 3.0 | June 2025 | LoggedModel entity, GenAI-first architecture, removal of Recipes |
| 3.10.x | February 2026 | Latest stable release |
MLflow is organized around several distinct components that can be used independently or together. Each component addresses a specific stage of the machine learning lifecycle.
MLflow Tracking is an API and UI for logging parameters, code versions, metrics, and artifacts during ML experiments. It is organized around the concept of "runs," where each run represents a single execution of training code (for example, a single invocation of a Python training script). For each run, MLflow records:
| Data type | Description | Examples |
|---|---|---|
| Parameters | Input configuration values | Learning rate, batch size, number of epochs |
| Metrics | Output measurements logged over time | Accuracy, loss, F1 score |
| Artifacts | Output files from the run | Model weights, images, serialized pipelines |
| Tags and metadata | Custom labels and run information | Source code version, start and end times |
Runs are grouped into "experiments," which allow users to organize related training sessions. The Tracking UI provides visualization tools for comparing runs side by side, plotting metric curves, and filtering results. MLflow Tracking supports multiple backend storage options, including local file storage, SQLite, PostgreSQL, MySQL, and cloud-based solutions such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
Autologging is a feature that automatically captures parameters and metrics from supported frameworks without requiring manual instrumentation. Frameworks with autologging support include scikit-learn, TensorFlow, PyTorch, Keras, XGBoost, LightGBM, Spark MLlib, and Statsmodels.
MLflow Models provides a standard format for packaging ML models so they can be used across different serving environments. The packaging format uses "flavors" to describe how a model can be interpreted by different tools. A single model can have multiple flavors; for example, a scikit-learn model might be saved with both an sklearn flavor (for native scikit-learn loading) and a python_function flavor (for generic Python-based inference).
Every MLflow Model includes an MLmodel YAML file that lists the flavors the model supports. The python_function flavor is the most universal, providing a generic Python interface for inference regardless of the original training framework. This allows any MLflow Model to be deployed to any platform that supports Python.
The built-in flavors supported by MLflow include:
| Flavor | Module | Description |
|---|---|---|
| Python Function | mlflow.pyfunc | Generic Python callable; all other flavors can be loaded as pyfunc |
| Scikit-learn | mlflow.sklearn | Classification, regression, and clustering models |
| TensorFlow | mlflow.tensorflow | TensorFlow SavedModel format with autologging |
| Keras | mlflow.keras | Keras 3.0 with multi-backend support (TensorFlow, JAX, PyTorch) |
| PyTorch | mlflow.pytorch | PyTorch models with custom training loop tracking |
| Spark MLlib | mlflow.spark | Apache Spark ML pipeline models |
| XGBoost | mlflow.xgboost | Gradient boosting models |
| LightGBM | mlflow.lightgbm | Microsoft LightGBM models |
| CatBoost | mlflow.catboost | Yandex CatBoost models |
| ONNX | mlflow.onnx | Open Neural Network Exchange format for cross-platform deployment |
| Transformers | mlflow.transformers | Hugging Face Transformers models for NLP and LLMs |
| Sentence Transformers | mlflow.sentence_transformers | Embedding and similarity models |
| spaCy | mlflow.spacy | NLP pipeline models |
| Statsmodels | mlflow.statsmodels | Statistical models |
| Prophet | mlflow.prophet | Facebook Prophet time series forecasting |
| Pmdarima | mlflow.pmdarima | Auto-ARIMA time series models |
| H2O | mlflow.h2o | H2O.ai models |
| John Snow Labs | mlflow.johnsnowlabs | Healthcare and biomedical NLP |
Beyond built-in flavors, the community has developed additional flavors for frameworks such as sktime, orbit, and other specialized libraries through the MLflavors package.
Models can be deployed using mlflow models serve to create a local REST API endpoint, built into Docker containers using mlflow models build-docker, or deployed to cloud platforms such as Amazon SageMaker, Azure ML, and Databricks Model Serving.
The Model Registry is a centralized store for managing the full lifecycle of MLflow Models. It provides model versioning, stage transitions, and annotations. Teams can use the registry to:
The registry exposes both a UI and a set of APIs for programmatic access. On Databricks, the Model Registry integrates with Unity Catalog for access control and governance.
MLflow Projects is a format for packaging reusable and reproducible data science code. A project is simply a directory or Git repository containing code, along with an MLproject file that specifies dependencies and entry points. The MLproject file can reference a Conda environment, a Docker container, or a system environment to define the execution context.
Projects allow users to run the same code on different platforms (local machine, cloud, or Kubernetes) with consistent behavior. They also support parameterized execution, so users can pass different hyperparameters or data paths at runtime. For example:
mlflow run git@github.com:mlflow/mlflow-example.git -P alpha=0.5
This command fetches the project from GitHub and executes it with the specified parameter.
MLflow Recipes, previously known as MLflow Pipelines, was a framework that provided predefined templates for common ML tasks such as regression and classification. Recipes automated many steps of the ML workflow, including data ingestion, feature engineering, model training, and evaluation. The framework included an intelligent execution engine that cached intermediate results and re-ran only the steps affected by code changes.
MLflow Recipes was deprecated in MLflow 2.x and removed entirely in MLflow 3.0. Users who relied on Recipes are encouraged to use standard MLflow Tracking and Model Registry functionality directly, or to adopt MLflow Projects for reproducible workflows.
The MLflow 2.x release series (2022 to 2025) expanded the platform beyond traditional ML to support large language models, generative AI applications, and AI agents. This evolution reflects the broader industry shift toward LLM-powered applications.
MLflow 2.x added native support for logging and evaluating LLM outputs. The mlflow.evaluate() API allows users to run evaluation suites against model outputs, using built-in or custom metrics. Evaluation metrics for LLMs include answer relevance, faithfulness, toxicity, and other quality dimensions. Users can evaluate both live model endpoints and pre-computed output datasets.
The evaluation framework supports two categories of metrics:
| Metric type | How it works | Examples |
|---|---|---|
| Heuristic-based | Deterministic scoring functions | ROUGE, BLEU, Flesch-Kincaid readability, latency |
| LLM-as-a-Judge | Uses a language model to assess quality | Faithfulness, answer correctness, toxicity, custom criteria |
LLM-as-a-Judge metrics, introduced in MLflow 2.8, use language models to score output quality. They address the limitations of heuristic metrics for nuanced language tasks and can reduce evaluation time from weeks (with human evaluators) to under an hour while maintaining useful quality approximations. MLflow supports multiple LLM providers as judges, including OpenAI, Anthropic, Amazon Bedrock, and Mistral AI.
The MLflow AI Gateway (introduced experimentally in MLflow 2.7) is a centralized proxy that sits between applications and LLM providers. It provides:
The AI Gateway integrates natively with MLflow Tracing, so every request routed through the gateway automatically becomes a trace. This provides a complete audit trail of LLM interactions across the organization without requiring additional instrumentation in application code.
MLflow Tracing captures the complete execution flow of LLM applications and AI agents. Built on OpenTelemetry, it records inputs, outputs, and metadata for each step of a request, including LLM calls, retrieval operations, tool invocations, and error details.
Key tracing capabilities include:
Since MLflow Tracing is built on OpenTelemetry, it is compatible with any language or framework that supports the OTLP standard, including Java, Go, and Rust. The MLflow tracking server exposes an OTLP endpoint at /v1/traces for direct ingestion. MLflow 3.6.0 added formal support for ingesting OpenTelemetry traces directly through this endpoint, enabling teams to combine MLflow SDK instrumentation with OpenTelemetry auto-instrumentation from third-party libraries.
Tracing supports automatic instrumentation for over 20 frameworks and libraries, including LangChain, LlamaIndex, OpenAI, Anthropic, Amazon Bedrock, Google ADK, PydanticAI, and smolagents.
MLflow 3.0 (released June 9, 2025) introduced architectural changes to support generative AI workloads as first-class citizens alongside traditional ML. The release was built around three major pillars: observability, systematic quality evaluation, and application lifecycle management.
MLflow 3 introduced the LoggedModel as a new first-class entity, moving beyond the run-centric model that characterized earlier versions. A LoggedModel tracks the complete identity of a model or agent, including its lineage, evaluation results, and deployment status. This allows users to compare model variants and GenAI agents within and across experiments more effectively.
The evaluation framework in MLflow 3 supports customizable scorers that can assess multiple quality dimensions simultaneously. Users can define custom evaluation judges or use pre-built judges for tasks like relevance scoring, hallucination detection, and safety assessment. The framework works for both GenAI applications and traditional ML models through a consistent API.
MLflow 3 treats GenAI applications as versioned artifacts. A complete application, including model weights, prompts, retrieval logic, and dependencies, can be packaged and versioned as a single unit. This enables atomic deployments and rollbacks, bringing the same rigor to GenAI application management that the Model Registry brought to traditional ML models.
The Prompt Registry, introduced as a standalone component, enables versioning, tracking, and reuse of prompts across an organization. Each prompt can be versioned independently, tagged with metadata, and referenced by downstream applications.
MLflow 3 removed several deprecated components to simplify the framework:
These removals were part of an effort to focus on core functionality and the GenAI capabilities that are central to the 3.x series.
Databricks offers a fully managed version of MLflow as part of the Databricks Data Intelligence Platform. Managed MLflow extends the open-source version with enterprise features designed for production workloads at scale.
| Feature | Open-source MLflow | Managed MLflow (Databricks) |
|---|---|---|
| Experiment tracking | Yes | Yes, with managed storage and automatic scaling |
| Model Registry | Yes | Integrated with Unity Catalog |
| AI Gateway | Yes | Managed endpoints with enterprise governance |
| Tracing | Yes | Production-scale with managed infrastructure |
| Access control | Manual configuration | Unity Catalog role-based access control |
| Data lineage | Basic run-level | End-to-end with lakehouse integration |
| Hosting | Self-managed | Fully managed by Databricks |
| Multi-cloud support | Manual deployment | AWS, Azure, GCP via Databricks |
| Model serving | CLI/Docker-based | One-click REST API deployment with auto-scaling |
| Feature store | Not included | Integrated feature store with automated lookups |
Unity Catalog integration is a distinguishing feature of managed MLflow. It allows organizations to enforce access controls, track lineage across models and data, and maintain compliance policies from a central governance layer. Models registered in managed MLflow can be discovered and shared across teams using the Unity Catalog interface.
Managed MLflow is available on all three major cloud providers through the Databricks platform: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform. Azure Databricks includes native MLflow integration documented through the Microsoft Learn platform.
MLflow competes with several other platforms for experiment tracking and ML lifecycle management. The following table compares MLflow with three widely used alternatives.
| Feature | MLflow | Weights & Biases (W&B) | Neptune | ClearML |
|---|---|---|---|---|
| License | Apache 2.0 (open source) | Proprietary (free tier available) | Proprietary (free tier available) | SSPL (open source) + managed offering |
| Pricing | Free (self-hosted); paid via Databricks | Free for individuals; team plans ~$50/user/month | Usage-based; team plans from ~$49/month | Free (self-hosted); managed plans negotiable |
| Experiment tracking | Yes | Yes (advanced interactive dashboards) | Yes (high-scale querying and comparison) | Yes |
| Model registry | Yes | Yes (Artifacts + Registry) | Yes | Yes |
| Hyperparameter tuning | Via integrations (Optuna, Ray Tune) | Sweeps (built-in) | Via integrations | HyperParameter Optimizer (built-in) |
| LLM/GenAI support | AI Gateway, Tracing, LLM evaluation | Weave (tracing and evaluation) | Limited | Limited |
| Deployment tools | Built-in serving, Docker, cloud | No built-in deployment | No built-in deployment | Built-in serving and orchestration |
| Visualization UI | Functional, improving | Best-in-class interactive dashboards | Advanced querying and filtering | Comprehensive dashboard |
| Collaboration | Basic (shared tracking server) | Strong (teams, reports, annotations) | Strong (workspaces, sharing) | Moderate |
| Self-hosting | Full support | Enterprise plan only | No | Full support |
| Framework integrations | 19+ built-in model flavors | Broad framework support | Broad framework support | Broad framework support |
| Community size | ~24,800 GitHub stars | ~20,000 GitHub stars | Smaller community | ~6,000 GitHub stars |
MLflow's primary advantage over proprietary alternatives is its open-source nature and the absence of licensing costs for self-hosted deployments. It is also the only platform in this comparison with a built-in AI Gateway and native LLM tracing based on OpenTelemetry. Weights & Biases is often preferred for its visualization capabilities and collaboration features, while Neptune is known for its ability to handle high-volume experiment metadata efficiently. ClearML offers a modular, all-in-one approach with built-in pipeline orchestration but has a steeper initial setup process compared to MLflow.
MLflow integrates with a broad range of ML and AI frameworks. Beyond the built-in model flavors listed above, MLflow supports automatic logging (autologging) for several popular libraries. When autologging is enabled, MLflow automatically captures parameters, metrics, and model artifacts without requiring manual instrumentation code.
| Framework | Autologging support | What gets captured |
|---|---|---|
| Scikit-learn | Yes | Parameters, metrics, and model for classifiers and regressors |
| TensorFlow/Keras | Yes | Training parameters, epoch metrics, and model checkpoints |
| PyTorch Lightning | Yes | Lightning-specific parameters, metrics, and checkpoints |
| XGBoost | Yes | Booster parameters, evaluation metrics, and feature importance |
| LightGBM | Yes | Training parameters and evaluation metrics |
| Spark MLlib | Yes | Pipeline parameters and model artifacts |
| Statsmodels | Yes | Model summary statistics |
| Hugging Face Transformers | Yes | Training arguments, metrics, and model artifacts |
| OpenAI | Yes | API calls, token usage, prompts, and completions |
| LangChain | Yes | Chain traces, model signatures, and input/output examples |
MLflow also integrates with orchestration and deployment tools, including Kubernetes, Docker, Amazon SageMaker, Azure ML, and Databricks Model Serving. The ONNX flavor allows models trained in one framework to be exported and deployed in another, supporting cross-platform inference scenarios.
The MLflow Tracking Server consists of two storage components:
MLflow supports several deployment configurations:
| Topology | Description | Suitable for |
|---|---|---|
| Local | Tracking server and storage on a single machine | Individual development and prototyping |
| Remote tracking server | Centralized server with database backend and cloud artifact store | Team collaboration |
| Databricks managed | Fully hosted on the Databricks platform | Enterprise production workloads |
The tracking server exposes REST APIs that clients use to log and query experiment data. Multiple team members can connect to a shared tracking server to collaborate on experiments. Starting with MLflow 3.0, the server also exposes an OTLP endpoint for ingesting OpenTelemetry traces from applications written in any language.
MLflow has experienced steady growth since its initial release in 2018. Key adoption milestones include:
| Year | Milestone |
|---|---|
| 2018 | MLflow released as open-source alpha at Spark + AI Summit |
| 2019 | MLflow 1.0 released; surpassed 1 million total downloads |
| 2020 | MLflow joins the Linux Foundation |
| 2021 | Surpassed 10 million monthly downloads |
| 2022 | MLflow 2.0 released; surpassed 100 million total downloads |
| 2024 | Surpassed 200 million total downloads |
| 2025 | MLflow 3.0 released; reached 20,000 GitHub stars |
As of early 2026, the MLflow GitHub repository reports over 24,000 stars, more than 5,500 forks, and contributions from over 900 developers. The project receives more than 60 million downloads per month from PyPI.
MLflow is used by thousands of organizations across industries including technology, finance, healthcare, and retail. Major cloud providers have built integrations with MLflow: Amazon SageMaker supports MLflow tracking, Microsoft Azure Machine Learning has native MLflow integration, and Google Cloud Vertex AI provides MLflow compatibility.
The project maintains active communication channels including a GitHub Discussions forum, a Slack workspace, and regular community meetups. Contributions are accepted through the standard GitHub pull request process, and the project follows a regular release cadence.