Kubeflow
Last reviewed
Sources
15 citations
Review status
Source-backed
Revision
v6 ยท 4,123 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
15 citations
Review status
Source-backed
Revision
v6 ยท 4,123 words
Add missing citations, update stale details, or suggest a clearer explanation.
Kubeflow is an open-source MLOps platform that runs the entire machine learning lifecycle on Kubernetes, described by its creators as a project "dedicated to making using ML stacks on Kubernetes easy, fast and extensible." [1] Google open-sourced Kubeflow on December 21, 2017, originally as a way to simplify running TensorFlow jobs on Kubernetes, and the project was accepted into the Cloud Native Computing Foundation (CNCF) as an incubating-stage project on July 25, 2023. [1][3] It is licensed under Apache 2.0 and bundles a suite of components, including Kubeflow Pipelines, Notebooks, the Training Operator, Katib (hyperparameter tuning), and KServe (model serving), under a single Central Dashboard.
The name "Kubeflow" combines "Kube" (from Kubernetes) and "flow" (from TensorFlow), reflecting its origins as a way for Google to open-source how it ran TensorFlow jobs internally. Over the years, the platform has expanded well beyond TensorFlow to support a broad range of ML frameworks, including PyTorch, XGBoost, JAX, MXNet, and others. As of its CNCF acceptance, Kubeflow had more than 28,000 GitHub stars, over 15,000 committers, more than 150 contributing companies, and 10 commercial distributions. [3]
Kubeflow was created to close the gap between developing ML models on a laptop and deploying them reliably in production. The original 2017 announcement framed the core problem this way: "Building any production-ready machine learning system involves various components, often mixing vendors and hand-rolled solutions. Connecting and managing these services for even moderately sophisticated setups introduces huge barriers of complexity in adopting machine learning." [1]
A second motivation was portability. Before Kubeflow, ML stacks were often "so tied to the clusters they have been deployed to that these stacks are immobile," making it "effectively impossible" to move a model from a laptop to a scalable cloud cluster without significant re-architecture. [1] By packaging the ML toolchain as Kubernetes-native resources, Kubeflow aimed to let the same workflow run anywhere Kubernetes runs. As the announcement put it: "Because this solution relies on Kubernetes, it runs wherever Kubernetes runs. Just spin up a cluster and go!" [1]
Kubeflow was first announced on December 21, 2017, at KubeCon + CloudNativeCon North America by Google engineers David Aronchick, Jeremy Lewi, and Vishnu Kannan. [1] The project drew on Google's internal experience running TensorFlow at scale and was created to address a perceived lack of flexible, open-source options for building production-ready machine learning systems on Kubernetes. At the time, organizations frequently struggled with the gap between developing ML models in notebooks and deploying them in production environments. Kubeflow was designed to bridge that gap by providing a consistent, Kubernetes-native platform for the entire ML lifecycle. The initial repository shipped with JupyterHub for managing interactive notebooks, a TensorFlow Custom Resource (CRD) that could target CPUs or GPUs, and a TensorFlow Serving container for model serving. [1]
The first official release, Kubeflow 0.1, was announced at KubeCon + CloudNativeCon Europe 2018. [2] This release established the foundational architecture and included early versions of several core components: TFJob for distributed TensorFlow training, JupyterHub integration for interactive notebooks, and basic support for model serving. The release attracted significant community interest, and contributions began flowing in from organizations beyond Google, including Cisco, Red Hat, IBM, and others.
Kubeflow 1.0 was released on March 2, 2020, marking a major milestone. [4] This release signaled that a core set of Kubeflow applications had graduated to "stable" status and were considered ready for production use. The stable components in version 1.0 included the Central Dashboard, Kubeflow Notebooks, the Profile Controller for multi-tenancy, TensorFlow Serving, and the training operators (TFJob and PyTorchJob). Kubeflow Pipelines and Katib (hyperparameter tuning) shipped in 1.0 as beta components and were matured toward stable status in later releases. [4]
Yes. In October 2022, Google announced that the Kubeflow project had applied to join the Cloud Native Computing Foundation. [5] On July 25, 2023, the CNCF Technical Oversight Committee voted to accept Kubeflow as an incubating-stage project. [3] This was a significant endorsement of the project's maturity, governance, and community health. At the time of acceptance, the project reported over 150 contributing companies, 10 commercial distributions based on Kubeflow, more than 28,000 GitHub stars, over 15,000 total committers, and 15 major releases since 2017. [3] As of 2025, Kubeflow remains at the CNCF incubating maturity level and the community is actively working toward graduation; it has not yet been promoted to the graduated tier. [6][7]
Ricardo Rocha, the project's CNCF Technical Oversight Committee sponsor, described Kubeflow's role at acceptance: "Kubeflow helps fill a gap by delivering machine learning pipelines and MLOps while working closely with its extensive community and other tools and initiatives." [3]
| Version | Release Date | Key Highlights |
|---|---|---|
| 1.0 | March 2020 | First stable release; Notebooks, Central Dashboard, TFJob/PyTorchJob stable; Pipelines and Katib beta |
| 1.3 | September 2021 | Unified Training Operator merging TFJob, PyTorchJob, MPIJob |
| 1.7 | March 2023 | KServe improvements, Pipelines v2 backend |
| 1.8 | October 2023 | Enhanced multi-tenancy, Pipelines v2 SDK |
| 1.9 | July 2024 | Simplified LLM workflows, security improvements |
| 1.10 | March 2025 | Trainer v2, new Katib API, Spark Operator integration, Model Registry UI |
Kubeflow is not a single monolithic application. Instead, it is a collection of loosely coupled components, each addressing a different stage of the machine learning lifecycle. These components are deployed as Kubernetes-native resources (Custom Resource Definitions, Pods, Services, and so on) and are managed through the Kubeflow Central Dashboard.
The platform follows a microservices architecture where each component can be installed independently or as part of the full Kubeflow platform. All components share a common authentication layer (typically Istio and Dex for identity management) and a multi-tenancy system based on Kubernetes namespaces.
At a high level, the Kubeflow architecture covers these stages of the ML lifecycle:
Kubeflow Pipelines (KFP) is one of the most widely used components of the platform. It provides a system for building, deploying, and managing multi-step ML workflows as reusable, portable pipelines. Each step in a pipeline runs in its own container, which ensures reproducibility and isolation.
Kubeflow Pipelines consists of several sub-components:
@component and @pipeline) to author pipelines in pure Python.The v2 release of Kubeflow Pipelines introduced several major improvements. [11] The updated intermediate representation makes pipelines executable by backends other than Argo. Input and output artifacts (datasets, models, metrics) are treated as first-class nodes in the DAG visualization. The v2 SDK also allows compiling and running individual components independently, and supports nesting pipelines as components of larger pipelines. The v2 backend maintains backward compatibility with v1 APIs.
Kubeflow Notebooks provides a way to run web-based interactive development environments inside a Kubernetes cluster. Rather than developing on local workstations, data scientists can spin up notebook servers directly within the cluster, gaining immediate access to cluster resources, GPUs, and shared storage.
Kubeflow Notebooks natively supports three types of development environments:
Each notebook server runs as a container inside a Kubernetes Pod. Users can select custom container images with pre-installed libraries, request specific resource allocations (CPU, memory, GPUs), and attach persistent volumes for data storage.
In early versions of Kubeflow (v0.4 and earlier), the notebook component relied on JupyterHub for managing notebook servers. Starting with Kubeflow v0.5, the project replaced JupyterHub with a custom Jupyter web application that provides tighter integration with Kubeflow's authentication, multi-tenancy, and namespace management systems.
KServe is a Kubernetes-native inference platform for deploying and serving machine learning models. Originally developed under the name KFServing, the project was renamed to KServe to reflect its broader scope and independence as a standalone component that can be used with or without the rest of Kubeflow. [9] KServe itself became a CNCF incubating project in November 2025. [9]
KServe provides out-of-the-box support for serving models trained with a wide variety of frameworks:
| Framework | Runtime | Protocol |
|---|---|---|
| TensorFlow | TensorFlow Serving | REST/gRPC |
| PyTorch | TorchServe | REST/gRPC |
| scikit-learn | MLServer | REST/gRPC |
| XGBoost | MLServer | REST/gRPC |
| ONNX | Triton Inference Server | REST/gRPC |
| LightGBM | MLServer | REST/gRPC |
| Custom Models | Any container | REST/gRPC |
Each KServe deployment, called an InferenceService, can consist of up to three components:
canaryTrafficPercent to split traffic between a new (canary) version and the stable version.The Kubeflow Training Operator provides Kubernetes custom resources for running distributed training jobs. Originally, Kubeflow maintained separate operators for each framework (TFJob, PyTorchJob, MPIJob, and XGBoostJob). In version 1.3, these were merged into a single unified Training Operator to simplify maintenance and provide a consistent user experience. [13]
| Job Type | Framework | Use Case |
|---|---|---|
| TFJob | TensorFlow | Distributed TensorFlow training with parameter servers or workers |
| PyTorchJob | PyTorch | Distributed PyTorch training using torch.distributed |
| MPIJob | MPI / Horovod | High-performance computing workloads using Message Passing Interface |
| XGBoostJob | XGBoost | Distributed gradient boosting training |
| PaddleJob | PaddlePaddle | Distributed PaddlePaddle training |
The Training Operator handles the complexity of coordinating workers, parameter servers, and chief nodes for distributed training. It manages Pod creation, networking between workers, and fault tolerance. If a worker fails, the operator can restart it without restarting the entire job.
Introduced in Kubeflow 1.10, Trainer v2 represents a significant redesign of the training component, building on more than seven years of experience running ML workloads on Kubernetes. [12] It replaces the multiple framework-specific CRDs with a unified TrainJob API and simplifies the process of fine-tuning large language models on Kubernetes by providing high-level abstractions. The first release supports TorchTune as a built-in LLM trainer with pre-configured runtimes for models like Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct. [12] Trainer v2 also provides dedicated initializers for datasets and models, reducing the amount of boilerplate configuration needed.
Katib is Kubeflow's Kubernetes-native component for automated machine learning (AutoML). [8] It supports both hyperparameter tuning and neural architecture search (NAS), allowing users to define search spaces and optimization objectives while Katib handles the rest. The name comes from the Arabic word for "secretary" or "scribe."
Katib supports a rich set of optimization algorithms, many of which are provided through integrations with frameworks like Hyperopt and Optuna:
| Algorithm | Type | Description |
|---|---|---|
| Random Search | Basic | Samples hyperparameters randomly from the search space |
| Grid Search | Basic | Exhaustively evaluates all combinations in a discretized search space |
| Bayesian Optimization | Advanced | Uses Gaussian processes to model the objective function |
| Tree of Parzen Estimators (TPE) | Advanced | Models good and bad hyperparameter distributions separately |
| Multivariate TPE | Advanced | Extension of TPE that considers parameter dependencies |
| Hyperband | Advanced | Early-stopping-based method that allocates resources efficiently |
| CMA-ES | Advanced | Covariance Matrix Adaptation Evolution Strategy for continuous optimization |
| ENAS | NAS | Efficient Neural Architecture Search using parameter sharing |
| DARTS | NAS | Differentiable Architecture Search using continuous relaxation |
| PBT | Advanced | Population Based Training that jointly optimizes hyperparameters and training |
Katib integrates directly with the Training Operator, allowing it to launch PyTorchJob, TFJob, or other training jobs as trial runs during hyperparameter optimization. This integration means that hyperparameter tuning can be performed on distributed training jobs, not just single-node experiments.
In Kubeflow 1.10, Katib introduced a new high-level Python API that integrates Katib and the Training Operator to automate hyperparameter optimization for LLM fine-tuning workflows, reducing the amount of manual configuration required. [5]
The Kubeflow Central Dashboard serves as the unified web interface for the entire platform. It provides a single entry point for accessing all Kubeflow components, including Notebooks, Pipelines, training jobs, Katib experiments, and model serving endpoints.
The Dashboard also manages Kubeflow's multi-tenancy system. Each user has a profile, which maps to a Kubernetes namespace. Users can only see and interact with resources in namespaces they have access to. Profile owners can add contributors with either view (read-only) or modify (read and write) access.
The Model Registry is a newer Kubeflow component that provides a centralized catalog for ML models, versions, and their associated metadata. It allows teams to track which models have been trained, their lineage, performance metrics, and serving status. In Kubeflow 1.10, the Model Registry received a new web UI and deeper integration with KServe, enabling a smoother path from model registration to deployment. [5]
The Spark Operator, integrated as a core Kubeflow component in version 1.10, enables running Apache Spark jobs on Kubernetes. [5] It was rebuilt with Controller Runtime for improved architecture and includes YuniKorn gang scheduling support for efficient group scheduling of driver and executor pods. This component is particularly useful for data preprocessing and feature engineering stages of ML workflows.
Kubeflow is designed to be cloud-agnostic and can be deployed on any conformant Kubernetes cluster. However, the major cloud providers offer specific distributions and documentation for running Kubeflow on their managed Kubernetes services.
Google Cloud provides first-party support for Kubeflow on GKE. The deployment leverages Google Cloud's IAP (Identity-Aware Proxy) for authentication and integrates with Google Cloud Storage for artifact storage. Google also offers Vertex AI Pipelines, which is a managed service based on Kubeflow Pipelines, providing a fully managed alternative for organizations that prefer not to manage the infrastructure themselves.
Kubeflow on AWS is an open-source distribution maintained by AWS Labs. It integrates Kubeflow with AWS-native services such as Amazon S3 for artifact storage, Amazon RDS for metadata storage, and Amazon Cognito for authentication. Canonical also offers Charmed Kubeflow, which can be deployed on EKS.
Kubeflow on AKS provides a straightforward deployment path for organizations running ML workloads on Microsoft's cloud. The deployment integrates with Azure Active Directory for identity management and Azure Blob Storage for artifacts.
Kubeflow can also be deployed on bare-metal Kubernetes clusters, OpenShift, Rancher, and local development environments using Kind or Minikube. The Kubeflow manifests repository provides community-maintained configurations for a variety of Kubernetes distributions.
Kubeflow provides a robust multi-tenancy model built on top of Kubernetes namespaces. The Profile custom resource wraps a Kubernetes namespace and associates it with an owner. Access control is enforced through Kubernetes RBAC, and the Kubeflow Central Dashboard ensures that users can only see resources within namespaces they are authorized to access.
Authentication is typically handled through Istio's ingress gateway combined with Dex, an OpenID Connect identity provider. This setup supports integration with external identity providers such as LDAP, SAML, and OAuth2 providers like Google, GitHub, and Microsoft.
Kubeflow Pipelines enforces namespace-level isolation, meaning that pipeline runs, experiments, and artifacts in one namespace are not visible to users in other namespaces unless access has been explicitly granted.
Kubeflow occupies a specific niche in the MLOps ecosystem. The following table compares it with other popular platforms:
| Feature | Kubeflow | MLflow | Amazon SageMaker | Vertex AI |
|---|---|---|---|---|
| Type | Open-source platform | Open-source toolkit | Managed cloud service | Managed cloud service |
| License | Apache 2.0 | Apache 2.0 | Proprietary | Proprietary |
| Infrastructure | Any Kubernetes cluster | Any (standalone) | AWS only | Google Cloud only |
| Pipeline Orchestration | Kubeflow Pipelines (Argo) | MLflow Recipes | SageMaker Pipelines | Vertex AI Pipelines (KFP-based) |
| Experiment Tracking | MLMD, Pipelines UI | MLflow Tracking (native) | SageMaker Experiments | Vertex ML Metadata |
| Model Serving | KServe | MLflow Models (basic) | SageMaker Endpoints | Vertex AI Prediction |
| Distributed Training | Training Operator (native) | Not built-in | SageMaker Training | Vertex AI Training |
| Hyperparameter Tuning | Katib | Not built-in | Automatic Model Tuning | Vertex AI Vizier |
| Notebook Support | JupyterLab, RStudio, VS Code | Not built-in | SageMaker Studio | Vertex AI Workbench |
| Multi-Tenancy | Native (Profiles/Namespaces) | Not built-in | IAM-based | IAM-based |
| Cloud Lock-in | None | None | AWS | Google Cloud |
| Setup Complexity | High | Low | Low (managed) | Low (managed) |
| Operational Overhead | High (self-managed) | Low to Medium | Low (managed) | Low (managed) |
| Best For | Kubernetes-native orgs, multi-cloud | Experiment tracking, small teams | AWS-centric enterprises | Google Cloud users |
Kubeflow vs. MLflow: MLflow excels at experiment tracking and model registry and is often the easiest tool to adopt for small teams. It does not provide infrastructure for distributed training, model serving, or pipeline orchestration at the same level as Kubeflow. Many organizations use MLflow for experiment tracking alongside Kubeflow for orchestration and serving.
Kubeflow vs. SageMaker: Amazon SageMaker is a fully managed service that eliminates the operational burden of managing infrastructure. It provides a tightly integrated experience within the AWS ecosystem. Kubeflow, by contrast, runs on any Kubernetes cluster and avoids vendor lock-in, but requires significant operational expertise to deploy and maintain.
Kubeflow vs. Vertex AI: Google's Vertex AI is, in many ways, the managed cloud evolution of Kubeflow. Vertex AI Pipelines is built on top of the Kubeflow Pipelines SDK, meaning pipeline code is largely portable between the two. Organizations that want the Kubeflow programming model without the operational overhead often choose Vertex AI if they are on Google Cloud.
Kubeflow is best suited for organizations that meet several of the following criteria:
Kubeflow may not be the right choice in these scenarios:
Kubeflow has a large and active community. At the time of its CNCF acceptance in July 2023, the project reported more than 28,000 GitHub stars on the main repository, over 15,000 total committers, more than 55,000 total GitHub contributions, and over 9,000 members in its Slack workspace. [3] The project had also shipped 15 major releases since its 2017 launch. [3]
The community communicates through the CNCF Slack workspace, the kubeflow-discuss mailing list, and regular community meetings. Notable adopters of Kubeflow include Google, Spotify, Bloomberg, the US Department of Defense, and various financial institutions. The project maintains an ADOPTERS.md file where organizations can publicly declare their use of Kubeflow.
At the time of its CNCF acceptance, Kubeflow had 10 commercial distributions. [3] Several companies offer commercial distributions and support for Kubeflow:
The simplest way to try Kubeflow locally is to deploy it on a local Kubernetes cluster using Kind (Kubernetes in Docker) or Minikube. [14] The official Kubeflow manifests repository on GitHub provides configuration files for these environments. For production deployments, the recommended approach is to use one of the cloud-specific distributions or the community manifests tailored to the target Kubernetes platform.
The minimum Kubernetes version required varies by Kubeflow release. Kubeflow requires a cluster with at least 16 GB of RAM and 4 CPUs for the platform components, with additional resources needed for actual ML workloads.