Kubeflow is an open-source machine learning platform designed to run on Kubernetes. Originally created by Google engineers and announced at KubeCon + CloudNativeCon North America in December 2017, the project aims to make deploying, scaling, and managing ML workflows on Kubernetes simple, portable, and composable. Kubeflow is licensed under the Apache 2.0 license and is currently a Cloud Native Computing Foundation (CNCF) incubating project.
The name "Kubeflow" combines "Kube" (from Kubernetes) and "flow" (from TensorFlow), reflecting its origins as a way for Google to open-source how they ran TensorFlow jobs internally. Over the years, the platform has expanded well beyond TensorFlow to support a broad range of ML frameworks, including PyTorch, XGBoost, JAX, MXNet, and others.
Kubeflow was first announced at KubeCon + CloudNativeCon North America 2017 by Google engineers David Aronchick, Jeremy Lewi, and Vishnu Kannan. The project was created to address a perceived lack of flexible, open-source options for building production-ready machine learning systems on Kubernetes. At the time, organizations frequently struggled with the gap between developing ML models in notebooks and deploying them in production environments. Kubeflow was designed to bridge that gap by providing a consistent, Kubernetes-native platform for the entire ML lifecycle.
The first official release, Kubeflow 0.1, was announced at KubeCon + CloudNativeCon Europe 2018. This release established the foundational architecture and included early versions of several core components: TFJob for distributed TensorFlow training, JupyterHub integration for interactive notebooks, and basic support for model serving. The release attracted significant community interest, and contributions began flowing in from organizations beyond Google, including Cisco, Red Hat, IBM, and others.
Kubeflow 1.0 was released on March 2, 2020, marking a major milestone. This release signaled that many Kubeflow components had graduated to "stable status" and were considered ready for production use. Key stable components in version 1.0 included Kubeflow Pipelines for orchestrating ML workflows, the Jupyter Notebook server for interactive development, the training operators (TFJob and PyTorchJob), and Katib for hyperparameter tuning.
In October 2022, Google announced that the Kubeflow project had applied to join the Cloud Native Computing Foundation. On July 25, 2023, the CNCF Technical Oversight Committee voted to accept Kubeflow as an incubating-stage project. This was a significant endorsement of the project's maturity, governance, and community health. At the time of acceptance, the project had over 150 contributing companies and ten commercial distributions based on Kubeflow.
| Version | Release Date | Key Highlights |
|---|---|---|
| 1.0 | March 2020 | First stable release; Pipelines, Notebooks, TFJob, Katib stable |
| 1.3 | September 2021 | Unified Training Operator merging TFJob, PyTorchJob, MPIJob |
| 1.7 | March 2023 | KServe improvements, Pipelines v2 backend |
| 1.8 | October 2023 | Enhanced multi-tenancy, Pipelines v2 SDK |
| 1.9 | July 2024 | Simplified LLM workflows, security improvements |
| 1.10 | March 2025 | Trainer v2, new Katib API, Spark Operator integration, Model Registry UI |
Kubeflow is not a single monolithic application. Instead, it is a collection of loosely coupled components, each addressing a different stage of the machine learning lifecycle. These components are deployed as Kubernetes-native resources (Custom Resource Definitions, Pods, Services, and so on) and are managed through the Kubeflow Central Dashboard.
The platform follows a microservices architecture where each component can be installed independently or as part of the full Kubeflow platform. All components share a common authentication layer (typically Istio and Dex for identity management) and a multi-tenancy system based on Kubernetes namespaces.
At a high level, the Kubeflow architecture covers these stages of the ML lifecycle:
Kubeflow Pipelines (KFP) is one of the most widely used components of the platform. It provides a system for building, deploying, and managing multi-step ML workflows as reusable, portable pipelines. Each step in a pipeline runs in its own container, which ensures reproducibility and isolation.
Kubeflow Pipelines consists of several sub-components:
@component and @pipeline) to author pipelines in pure Python.The v2 release of Kubeflow Pipelines introduced several major improvements. The updated intermediate representation makes pipelines executable by backends other than Argo. Input and output artifacts (datasets, models, metrics) are treated as first-class nodes in the DAG visualization. The v2 SDK also allows compiling and running individual components independently, and supports nesting pipelines as components of larger pipelines. The v2 backend maintains backward compatibility with v1 APIs.
Kubeflow Notebooks provides a way to run web-based interactive development environments inside a Kubernetes cluster. Rather than developing on local workstations, data scientists can spin up notebook servers directly within the cluster, gaining immediate access to cluster resources, GPUs, and shared storage.
Kubeflow Notebooks natively supports three types of development environments:
Each notebook server runs as a container inside a Kubernetes Pod. Users can select custom container images with pre-installed libraries, request specific resource allocations (CPU, memory, GPUs), and attach persistent volumes for data storage.
In early versions of Kubeflow (v0.4 and earlier), the notebook component relied on JupyterHub for managing notebook servers. Starting with Kubeflow v0.5, the project replaced JupyterHub with a custom Jupyter web application that provides tighter integration with Kubeflow's authentication, multi-tenancy, and namespace management systems.
KServe is a Kubernetes-native inference platform for deploying and serving machine learning models. Originally developed under the name KFServing, the project was renamed to KServe to reflect its broader scope and independence as a standalone component that can be used with or without the rest of Kubeflow.
KServe provides out-of-the-box support for serving models trained with a wide variety of frameworks:
| Framework | Runtime | Protocol |
|---|---|---|
| TensorFlow | TensorFlow Serving | REST/gRPC |
| PyTorch | TorchServe | REST/gRPC |
| scikit-learn | MLServer | REST/gRPC |
| XGBoost | MLServer | REST/gRPC |
| ONNX | Triton Inference Server | REST/gRPC |
| LightGBM | MLServer | REST/gRPC |
| Custom Models | Any container | REST/gRPC |
Each KServe deployment, called an InferenceService, can consist of up to three components:
canaryTrafficPercent to split traffic between a new (canary) version and the stable version.The Kubeflow Training Operator provides Kubernetes custom resources for running distributed training jobs. Originally, Kubeflow maintained separate operators for each framework (TFJob, PyTorchJob, MPIJob, and XGBoostJob). In version 1.3, these were merged into a single unified Training Operator to simplify maintenance and provide a consistent user experience.
| Job Type | Framework | Use Case |
|---|---|---|
| TFJob | TensorFlow | Distributed TensorFlow training with parameter servers or workers |
| PyTorchJob | PyTorch | Distributed PyTorch training using torch.distributed |
| MPIJob | MPI / Horovod | High-performance computing workloads using Message Passing Interface |
| XGBoostJob | XGBoost | Distributed gradient boosting training |
| PaddleJob | PaddlePaddle | Distributed PaddlePaddle training |
The Training Operator handles the complexity of coordinating workers, parameter servers, and chief nodes for distributed training. It manages Pod creation, networking between workers, and fault tolerance. If a worker fails, the operator can restart it without restarting the entire job.
Introduced in Kubeflow 1.10, Trainer v2 represents a significant redesign of the training component. It simplifies the process of fine-tuning large language models on Kubernetes by providing high-level abstractions. The first release supports TorchTune as a built-in LLM trainer with pre-configured runtimes for models like Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct. Trainer v2 also provides dedicated initializers for datasets and models, reducing the amount of boilerplate configuration needed.
Katib is Kubeflow's Kubernetes-native component for automated machine learning (AutoML). It supports both hyperparameter tuning and neural architecture search (NAS), allowing users to define search spaces and optimization objectives while Katib handles the rest.
Katib supports a rich set of optimization algorithms, many of which are provided through integrations with frameworks like Hyperopt and Optuna:
| Algorithm | Type | Description |
|---|---|---|
| Random Search | Basic | Samples hyperparameters randomly from the search space |
| Grid Search | Basic | Exhaustively evaluates all combinations in a discretized search space |
| Bayesian Optimization | Advanced | Uses Gaussian processes to model the objective function |
| Tree of Parzen Estimators (TPE) | Advanced | Models good and bad hyperparameter distributions separately |
| Multivariate TPE | Advanced | Extension of TPE that considers parameter dependencies |
| Hyperband | Advanced | Early-stopping-based method that allocates resources efficiently |
| CMA-ES | Advanced | Covariance Matrix Adaptation Evolution Strategy for continuous optimization |
| ENAS | NAS | Efficient Neural Architecture Search using parameter sharing |
| DARTS | NAS | Differentiable Architecture Search using continuous relaxation |
| PBT | Advanced | Population Based Training that jointly optimizes hyperparameters and training |
Katib integrates directly with the Training Operator, allowing it to launch PyTorchJob, TFJob, or other training jobs as trial runs during hyperparameter optimization. This integration means that hyperparameter tuning can be performed on distributed training jobs, not just single-node experiments.
In Kubeflow 1.10, Katib introduced a new high-level Python API that integrates Katib and the Training Operator to automate hyperparameter optimization for LLM fine-tuning workflows, reducing the amount of manual configuration required.
The Kubeflow Central Dashboard serves as the unified web interface for the entire platform. It provides a single entry point for accessing all Kubeflow components, including Notebooks, Pipelines, training jobs, Katib experiments, and model serving endpoints.
The Dashboard also manages Kubeflow's multi-tenancy system. Each user has a profile, which maps to a Kubernetes namespace. Users can only see and interact with resources in namespaces they have access to. Profile owners can add contributors with either view (read-only) or modify (read and write) access.
The Model Registry is a newer Kubeflow component that provides a centralized catalog for ML models, versions, and their associated metadata. It allows teams to track which models have been trained, their lineage, performance metrics, and serving status. In Kubeflow 1.10, the Model Registry received a new web UI and deeper integration with KServe, enabling a smoother path from model registration to deployment.
The Spark Operator, integrated as a core Kubeflow component in version 1.10, enables running Apache Spark jobs on Kubernetes. It was rebuilt with Controller Runtime for improved architecture and includes YuniKorn gang scheduling support for efficient group scheduling of driver and executor pods. This component is particularly useful for data preprocessing and feature engineering stages of ML workflows.
Kubeflow is designed to be cloud-agnostic and can be deployed on any conformant Kubernetes cluster. However, the major cloud providers offer specific distributions and documentation for running Kubeflow on their managed Kubernetes services.
Google Cloud provides first-party support for Kubeflow on GKE. The deployment leverages Google Cloud's IAP (Identity-Aware Proxy) for authentication and integrates with Google Cloud Storage for artifact storage. Google also offers Vertex AI Pipelines, which is a managed service based on Kubeflow Pipelines, providing a fully managed alternative for organizations that prefer not to manage the infrastructure themselves.
Kubeflow on AWS is an open-source distribution maintained by AWS Labs. It integrates Kubeflow with AWS-native services such as Amazon S3 for artifact storage, Amazon RDS for metadata storage, and Amazon Cognito for authentication. Canonical also offers Charmed Kubeflow, which can be deployed on EKS.
Kubeflow on AKS provides a straightforward deployment path for organizations running ML workloads on Microsoft's cloud. The deployment integrates with Azure Active Directory for identity management and Azure Blob Storage for artifacts.
Kubeflow can also be deployed on bare-metal Kubernetes clusters, OpenShift, Rancher, and local development environments using Kind or Minikube. The Kubeflow manifests repository provides community-maintained configurations for a variety of Kubernetes distributions.
Kubeflow provides a robust multi-tenancy model built on top of Kubernetes namespaces. The Profile custom resource wraps a Kubernetes namespace and associates it with an owner. Access control is enforced through Kubernetes RBAC, and the Kubeflow Central Dashboard ensures that users can only see resources within namespaces they are authorized to access.
Authentication is typically handled through Istio's ingress gateway combined with Dex, an OpenID Connect identity provider. This setup supports integration with external identity providers such as LDAP, SAML, and OAuth2 providers like Google, GitHub, and Microsoft.
Kubeflow Pipelines enforces namespace-level isolation, meaning that pipeline runs, experiments, and artifacts in one namespace are not visible to users in other namespaces unless access has been explicitly granted.
Kubeflow occupies a specific niche in the MLOps ecosystem. The following table compares it with other popular platforms:
| Feature | Kubeflow | MLflow | Amazon SageMaker | Vertex AI |
|---|---|---|---|---|
| Type | Open-source platform | Open-source toolkit | Managed cloud service | Managed cloud service |
| License | Apache 2.0 | Apache 2.0 | Proprietary | Proprietary |
| Infrastructure | Any Kubernetes cluster | Any (standalone) | AWS only | Google Cloud only |
| Pipeline Orchestration | Kubeflow Pipelines (Argo) | MLflow Recipes | SageMaker Pipelines | Vertex AI Pipelines (KFP-based) |
| Experiment Tracking | MLMD, Pipelines UI | MLflow Tracking (native) | SageMaker Experiments | Vertex ML Metadata |
| Model Serving | KServe | MLflow Models (basic) | SageMaker Endpoints | Vertex AI Prediction |
| Distributed Training | Training Operator (native) | Not built-in | SageMaker Training | Vertex AI Training |
| Hyperparameter Tuning | Katib | Not built-in | Automatic Model Tuning | Vertex AI Vizier |
| Notebook Support | JupyterLab, RStudio, VS Code | Not built-in | SageMaker Studio | Vertex AI Workbench |
| Multi-Tenancy | Native (Profiles/Namespaces) | Not built-in | IAM-based | IAM-based |
| Cloud Lock-in | None | None | AWS | Google Cloud |
| Setup Complexity | High | Low | Low (managed) | Low (managed) |
| Operational Overhead | High (self-managed) | Low to Medium | Low (managed) | Low (managed) |
| Best For | Kubernetes-native orgs, multi-cloud | Experiment tracking, small teams | AWS-centric enterprises | Google Cloud users |
Kubeflow vs. MLflow: MLflow excels at experiment tracking and model registry and is often the easiest tool to adopt for small teams. It does not provide infrastructure for distributed training, model serving, or pipeline orchestration at the same level as Kubeflow. Many organizations use MLflow for experiment tracking alongside Kubeflow for orchestration and serving.
Kubeflow vs. SageMaker: Amazon SageMaker is a fully managed service that eliminates the operational burden of managing infrastructure. It provides a tightly integrated experience within the AWS ecosystem. Kubeflow, by contrast, runs on any Kubernetes cluster and avoids vendor lock-in, but requires significant operational expertise to deploy and maintain.
Kubeflow vs. Vertex AI: Google's Vertex AI is, in many ways, the managed cloud evolution of Kubeflow. Vertex AI Pipelines is built on top of the Kubeflow Pipelines SDK, meaning pipeline code is largely portable between the two. Organizations that want the Kubeflow programming model without the operational overhead often choose Vertex AI if they are on Google Cloud.
Kubeflow is best suited for organizations that meet several of the following criteria:
Kubeflow may not be the right choice in these scenarios:
Kubeflow has a large and active community. As of 2025, the project has accumulated over 14,000 GitHub stars on the main repository, with contributions from over 200 developers across more than 30 organizations. The broader Kubeflow organization on GitHub, which includes repositories for all sub-projects, has over 22,000 stars combined.
The community communicates through the CNCF Slack workspace, the kubeflow-discuss mailing list, and regular community meetings. Kubeflow Meetups have over 3,500 members globally, and over 5,000 students have enrolled in free Kubeflow training courses.
Notable adopters of Kubeflow include Google, Spotify, Bloomberg, US Department of Defense, and various financial institutions. The project maintains an ADOPTERS.md file where organizations can publicly declare their use of Kubeflow.
Several companies offer commercial distributions and support for Kubeflow:
The simplest way to try Kubeflow locally is to deploy it on a local Kubernetes cluster using Kind (Kubernetes in Docker) or Minikube. The official Kubeflow manifests repository on GitHub provides configuration files for these environments. For production deployments, the recommended approach is to use one of the cloud-specific distributions or the community manifests tailored to the target Kubernetes platform.
The minimum Kubernetes version required varies by Kubeflow release, but version 1.10 generally requires Kubernetes 1.27 or later. Kubeflow requires a cluster with at least 16 GB of RAM and 4 CPUs for the platform components, with additional resources needed for actual ML workloads.