# Kubeflow

> Source: https://aiwiki.ai/wiki/kubeflow
> Updated: 2026-06-24
> Categories: Developer Tools, MLOps, Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Kubeflow** is an open-source [MLOps](/wiki/mlops) platform that runs the entire [machine learning](/wiki/machine_learning) lifecycle on [Kubernetes](/wiki/kubernetes), described by its creators as a project "dedicated to making using ML stacks on Kubernetes easy, fast and extensible." [1] Google open-sourced Kubeflow on December 21, 2017, originally as a way to simplify running [TensorFlow](/wiki/tensorflow) jobs on Kubernetes, and the project was accepted into the [Cloud Native Computing Foundation](/wiki/cncf) (CNCF) as an incubating-stage project on July 25, 2023. [1][3] It is licensed under Apache 2.0 and bundles a suite of components, including Kubeflow Pipelines, Notebooks, the Training Operator, Katib (hyperparameter tuning), and KServe (model serving), under a single Central Dashboard.

The name "Kubeflow" combines "Kube" (from Kubernetes) and "flow" (from TensorFlow), reflecting its origins as a way for Google to open-source how it ran TensorFlow jobs internally. Over the years, the platform has expanded well beyond TensorFlow to support a broad range of ML frameworks, including [PyTorch](/wiki/pytorch), [XGBoost](/wiki/xgboost), [JAX](/wiki/jax), [MXNet](/wiki/mxnet), and others. As of its CNCF acceptance, Kubeflow had more than 28,000 GitHub stars, over 15,000 committers, more than 150 contributing companies, and 10 commercial distributions. [3]

## What problem does Kubeflow solve?

Kubeflow was created to close the gap between developing ML models on a laptop and deploying them reliably in production. The original 2017 announcement framed the core problem this way: "Building any production-ready machine learning system involves various components, often mixing vendors and hand-rolled solutions. Connecting and managing these services for even moderately sophisticated setups introduces huge barriers of complexity in adopting machine learning." [1]

A second motivation was portability. Before Kubeflow, ML stacks were often "so tied to the clusters they have been deployed to that these stacks are immobile," making it "effectively impossible" to move a model from a laptop to a scalable cloud cluster without significant re-architecture. [1] By packaging the ML toolchain as Kubernetes-native resources, Kubeflow aimed to let the same workflow run anywhere Kubernetes runs. As the announcement put it: "Because this solution relies on Kubernetes, it runs wherever Kubernetes runs. Just spin up a cluster and go!" [1]

## History and Development

### When was Kubeflow created?

Kubeflow was first announced on December 21, 2017, at KubeCon + CloudNativeCon North America by Google engineers David Aronchick, Jeremy Lewi, and Vishnu Kannan. [1] The project drew on Google's internal experience running TensorFlow at scale and was created to address a perceived lack of flexible, open-source options for building production-ready machine learning systems on Kubernetes. At the time, organizations frequently struggled with the gap between developing ML models in notebooks and deploying them in production environments. Kubeflow was designed to bridge that gap by providing a consistent, Kubernetes-native platform for the entire ML lifecycle. The initial repository shipped with JupyterHub for managing interactive notebooks, a TensorFlow Custom Resource (CRD) that could target CPUs or GPUs, and a TensorFlow Serving container for model serving. [1]

### Kubeflow 0.1 (2018)

The first official release, Kubeflow 0.1, was announced at KubeCon + CloudNativeCon Europe 2018. [2] This release established the foundational architecture and included early versions of several core components: TFJob for distributed TensorFlow training, JupyterHub integration for interactive notebooks, and basic support for model serving. The release attracted significant community interest, and contributions began flowing in from organizations beyond Google, including Cisco, Red Hat, IBM, and others.

### Kubeflow 1.0 (March 2020)

Kubeflow 1.0 was released on March 2, 2020, marking a major milestone. [4] This release signaled that a core set of Kubeflow applications had graduated to "stable" status and were considered ready for production use. The stable components in version 1.0 included the Central Dashboard, Kubeflow Notebooks, the Profile Controller for multi-tenancy, TensorFlow Serving, and the training operators (TFJob and PyTorchJob). Kubeflow Pipelines and Katib (hyperparameter tuning) shipped in 1.0 as beta components and were matured toward stable status in later releases. [4]

### Is Kubeflow a CNCF project?

Yes. In October 2022, Google announced that the Kubeflow project had applied to join the Cloud Native Computing Foundation. [5] On July 25, 2023, the CNCF Technical Oversight Committee voted to accept Kubeflow as an incubating-stage project. [3] This was a significant endorsement of the project's maturity, governance, and community health. At the time of acceptance, the project reported over 150 contributing companies, 10 commercial distributions based on Kubeflow, more than 28,000 GitHub stars, over 15,000 total committers, and 15 major releases since 2017. [3] As of 2025, Kubeflow remains at the CNCF incubating maturity level and the community is actively working toward graduation; it has not yet been promoted to the graduated tier. [6][7]

Ricardo Rocha, the project's CNCF Technical Oversight Committee sponsor, described Kubeflow's role at acceptance: "Kubeflow helps fill a gap by delivering machine learning pipelines and MLOps while working closely with its extensive community and other tools and initiatives." [3]

### Recent Releases

| Version | Release Date | Key Highlights |
|---|---|---|
| 1.0 | March 2020 | First stable release; Notebooks, Central Dashboard, TFJob/PyTorchJob stable; Pipelines and Katib beta |
| 1.3 | September 2021 | Unified Training Operator merging TFJob, PyTorchJob, MPIJob |
| 1.7 | March 2023 | KServe improvements, Pipelines v2 backend |
| 1.8 | October 2023 | Enhanced multi-tenancy, Pipelines v2 SDK |
| 1.9 | July 2024 | Simplified LLM workflows, security improvements |
| 1.10 | March 2025 | Trainer v2, new Katib API, Spark Operator integration, Model Registry UI |

## Architecture Overview

Kubeflow is not a single monolithic application. Instead, it is a collection of loosely coupled components, each addressing a different stage of the [machine learning lifecycle](/wiki/mlops). These components are deployed as Kubernetes-native resources (Custom Resource Definitions, Pods, Services, and so on) and are managed through the Kubeflow Central Dashboard.

The platform follows a microservices architecture where each component can be installed independently or as part of the full Kubeflow platform. All components share a common authentication layer (typically [Istio](/wiki/istio) and Dex for identity management) and a multi-tenancy system based on Kubernetes namespaces.

At a high level, the Kubeflow architecture covers these stages of the ML lifecycle:

1. **Data exploration and model development** using Kubeflow Notebooks
2. **Pipeline orchestration** using Kubeflow Pipelines
3. **Distributed model training** using the Training Operator
4. **[Hyperparameter](/wiki/hyperparameter) tuning and AutoML** using Katib
5. **Model serving and inference** using KServe
6. **Model management** using the Model Registry

## Core Components

### Kubeflow Pipelines

Kubeflow Pipelines (KFP) is one of the most widely used components of the platform. It provides a system for building, deploying, and managing multi-step ML workflows as reusable, portable pipelines. Each step in a pipeline runs in its own container, which ensures reproducibility and isolation.

#### Architecture

Kubeflow Pipelines consists of several sub-components:

- **Python SDK**: Used to define pipeline components and compose them into directed acyclic graphs (DAGs). The SDK provides decorators (`@component` and `@pipeline`) to author pipelines in pure Python.
- **Pipeline Compiler**: Converts the Python pipeline definition into an intermediate representation (IR) that can be executed by a backend engine.
- **Workflow Engine**: By default, KFP uses [Argo Workflows](/wiki/argo_workflows) as the underlying execution engine. Argo translates the pipeline IR into Kubernetes-native workflow resources.
- **ML Metadata (MLMD)**: Tracks execution lineage, artifacts, and metadata throughout pipeline runs. MLMD enables caching, provenance tracking, and lineage visualization.
- **Pipeline UI**: A web-based interface for viewing pipeline runs, comparing experiments, inspecting artifacts, and visualizing the DAG structure.

#### Pipelines v2

The v2 release of Kubeflow Pipelines introduced several major improvements. [11] The updated intermediate representation makes pipelines executable by backends other than Argo. Input and output artifacts (datasets, models, metrics) are treated as first-class nodes in the DAG visualization. The v2 SDK also allows compiling and running individual components independently, and supports nesting pipelines as components of larger pipelines. The v2 backend maintains backward compatibility with v1 APIs.

### Kubeflow Notebooks

Kubeflow Notebooks provides a way to run web-based interactive development environments inside a Kubernetes cluster. Rather than developing on local workstations, data scientists can spin up notebook servers directly within the cluster, gaining immediate access to cluster resources, GPUs, and shared storage.

#### Supported Environments

Kubeflow Notebooks natively supports three types of development environments:

- **JupyterLab**: The most popular choice for interactive Python development and data exploration.
- **RStudio**: For data scientists working with the R programming language.
- **Visual Studio Code (code-server)**: A browser-based version of VS Code for general-purpose development.

Each notebook server runs as a container inside a Kubernetes Pod. Users can select custom container images with pre-installed libraries, request specific resource allocations (CPU, memory, GPUs), and attach persistent volumes for data storage.

#### Evolution from JupyterHub

In early versions of Kubeflow (v0.4 and earlier), the notebook component relied on JupyterHub for managing notebook servers. Starting with Kubeflow v0.5, the project replaced JupyterHub with a custom Jupyter web application that provides tighter integration with Kubeflow's authentication, multi-tenancy, and namespace management systems.

### What is KServe (formerly KFServing)?

[KServe](/wiki/kserve) is a Kubernetes-native inference platform for deploying and serving machine learning models. Originally developed under the name KFServing, the project was renamed to KServe to reflect its broader scope and independence as a standalone component that can be used with or without the rest of Kubeflow. [9] KServe itself became a CNCF incubating project in November 2025. [9]

#### Supported Frameworks

KServe provides out-of-the-box support for serving models trained with a wide variety of frameworks:

| Framework | Runtime | Protocol |
|---|---|---|
| [TensorFlow](/wiki/tensorflow) | TensorFlow Serving | REST/gRPC |
| [PyTorch](/wiki/pytorch) | TorchServe | REST/gRPC |
| [scikit-learn](/wiki/scikit-learn) | MLServer | REST/gRPC |
| [XGBoost](/wiki/xgboost) | MLServer | REST/gRPC |
| [ONNX](/wiki/onnx) | Triton Inference Server | REST/gRPC |
| [LightGBM](/wiki/lightgbm) | MLServer | REST/gRPC |
| Custom Models | Any container | REST/gRPC |

#### InferenceService Architecture

Each KServe deployment, called an InferenceService, can consist of up to three components:

- **Predictor**: The required component that loads the model and serves predictions. This is the core of the InferenceService.
- **[Transformer](/wiki/transformer)**: An optional component for pre-processing and post-processing requests and responses. KServe provides built-in transformers for common use cases such as feature retrieval from Feast.
- **Explainer**: An optional component that provides model explanations alongside predictions. Built-in explainers include integrations with Alibi Explain for techniques like SHAP and anchors.

#### Key Features

- **Autoscaling and Scale-to-Zero**: KServe supports request-based autoscaling, including the ability to scale down to zero replicas when no traffic is being received and scale back up when requests arrive.
- **Canary Deployments**: Supports gradual traffic shifting between model versions. Operators can configure a `canaryTrafficPercent` to split traffic between a new (canary) version and the stable version.
- **[Inference](/wiki/inference) Graphs**: The InferenceGraph resource enables building complex inference pipelines where multiple models work together. It supports four routing node types: Sequence, Switch, Ensemble, and Splitter.
- **GPU Autoscaling**: Automatic scaling of GPU-backed inference endpoints based on request load.
- **Batching**: Automatic request batching to improve throughput for high-volume inference workloads.

### Training Operators

The Kubeflow Training Operator provides Kubernetes custom resources for running distributed training jobs. Originally, Kubeflow maintained separate operators for each framework (TFJob, PyTorchJob, MPIJob, and XGBoostJob). In version 1.3, these were merged into a single unified Training Operator to simplify maintenance and provide a consistent user experience. [13]

#### Supported Job Types

| Job Type | Framework | Use Case |
|---|---|---|
| TFJob | [TensorFlow](/wiki/tensorflow) | Distributed TensorFlow training with parameter servers or workers |
| PyTorchJob | [PyTorch](/wiki/pytorch) | Distributed PyTorch training using `torch.distributed` |
| MPIJob | MPI / [Horovod](/wiki/horovod) | High-performance computing workloads using Message Passing Interface |
| XGBoostJob | [XGBoost](/wiki/xgboost) | Distributed gradient boosting training |
| PaddleJob | PaddlePaddle | Distributed PaddlePaddle training |

The Training Operator handles the complexity of coordinating workers, parameter servers, and chief nodes for distributed training. It manages Pod creation, networking between workers, and fault tolerance. If a worker fails, the operator can restart it without restarting the entire job.

#### Trainer v2

Introduced in Kubeflow 1.10, Trainer v2 represents a significant redesign of the training component, building on more than seven years of experience running ML workloads on Kubernetes. [12] It replaces the multiple framework-specific CRDs with a unified TrainJob API and simplifies the process of fine-tuning [large language models](/wiki/large_language_model) on Kubernetes by providing high-level abstractions. The first release supports TorchTune as a built-in LLM trainer with pre-configured runtimes for models like Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct. [12] Trainer v2 also provides dedicated initializers for datasets and models, reducing the amount of boilerplate configuration needed.

### What is Katib used for?

Katib is Kubeflow's Kubernetes-native component for automated machine learning ([AutoML](/wiki/automl)). [8] It supports both [hyperparameter tuning](/wiki/hyperparameter_optimization) and [neural architecture search](/wiki/neural_architecture_search) (NAS), allowing users to define search spaces and optimization objectives while Katib handles the rest. The name comes from the Arabic word for "secretary" or "scribe."

#### Supported Algorithms

Katib supports a rich set of optimization algorithms, many of which are provided through integrations with frameworks like Hyperopt and Optuna:

| Algorithm | Type | Description |
|---|---|---|
| Random Search | Basic | Samples hyperparameters randomly from the search space |
| Grid Search | Basic | Exhaustively evaluates all combinations in a discretized search space |
| Bayesian Optimization | Advanced | Uses Gaussian processes to model the objective function |
| Tree of Parzen Estimators (TPE) | Advanced | Models good and bad hyperparameter distributions separately |
| Multivariate TPE | Advanced | Extension of TPE that considers parameter dependencies |
| Hyperband | Advanced | Early-stopping-based method that allocates resources efficiently |
| CMA-ES | Advanced | Covariance Matrix Adaptation Evolution Strategy for continuous optimization |
| ENAS | NAS | Efficient Neural Architecture Search using parameter sharing |
| DARTS | NAS | Differentiable Architecture Search using continuous relaxation |
| PBT | Advanced | Population Based Training that jointly optimizes hyperparameters and training |

#### Integration with Training Operator

Katib integrates directly with the Training Operator, allowing it to launch PyTorchJob, TFJob, or other training jobs as trial runs during hyperparameter optimization. This integration means that hyperparameter tuning can be performed on distributed training jobs, not just single-node experiments.

In Kubeflow 1.10, Katib introduced a new high-level Python API that integrates Katib and the Training Operator to automate hyperparameter optimization for LLM fine-tuning workflows, reducing the amount of manual configuration required. [5]

### Central Dashboard

The Kubeflow Central Dashboard serves as the unified web interface for the entire platform. It provides a single entry point for accessing all Kubeflow components, including Notebooks, Pipelines, training jobs, Katib experiments, and model serving endpoints.

The Dashboard also manages Kubeflow's multi-tenancy system. Each user has a profile, which maps to a Kubernetes namespace. Users can only see and interact with resources in namespaces they have access to. Profile owners can add contributors with either view (read-only) or modify (read and write) access.

### Model Registry

The Model Registry is a newer Kubeflow component that provides a centralized catalog for ML models, versions, and their associated metadata. It allows teams to track which models have been trained, their lineage, performance metrics, and serving status. In Kubeflow 1.10, the Model Registry received a new web UI and deeper integration with KServe, enabling a smoother path from model registration to deployment. [5]

### Spark Operator

The Spark Operator, integrated as a core Kubeflow component in version 1.10, enables running Apache Spark jobs on Kubernetes. [5] It was rebuilt with Controller Runtime for improved architecture and includes YuniKorn gang scheduling support for efficient group scheduling of driver and executor pods. This component is particularly useful for data preprocessing and feature engineering stages of ML workflows.

## Deployment on Cloud Platforms

Kubeflow is designed to be cloud-agnostic and can be deployed on any conformant Kubernetes cluster. However, the major cloud providers offer specific distributions and documentation for running Kubeflow on their managed Kubernetes services.

### Google Kubernetes Engine (GKE)

Google Cloud provides first-party support for Kubeflow on [GKE](/wiki/google_kubernetes_engine). The deployment leverages Google Cloud's IAP (Identity-Aware Proxy) for authentication and integrates with Google Cloud Storage for artifact storage. Google also offers [Vertex AI](/wiki/vertex_ai) Pipelines, which is a managed service based on Kubeflow Pipelines, providing a fully managed alternative for organizations that prefer not to manage the infrastructure themselves.

### Amazon Elastic Kubernetes Service (EKS)

Kubeflow on [AWS](/wiki/amazon_web_services) is an open-source distribution maintained by AWS Labs. It integrates Kubeflow with AWS-native services such as Amazon S3 for artifact storage, Amazon RDS for metadata storage, and Amazon Cognito for authentication. Canonical also offers Charmed Kubeflow, which can be deployed on EKS.

### Azure Kubernetes Service (AKS)

Kubeflow on [AKS](/wiki/azure_openai) provides a straightforward deployment path for organizations running ML workloads on Microsoft's cloud. The deployment integrates with Azure Active Directory for identity management and Azure Blob Storage for artifacts.

### On-Premises and Other Platforms

Kubeflow can also be deployed on bare-metal Kubernetes clusters, OpenShift, Rancher, and local development environments using Kind or Minikube. The Kubeflow manifests repository provides community-maintained configurations for a variety of Kubernetes distributions.

## Multi-Tenancy and Security

Kubeflow provides a robust multi-tenancy model built on top of Kubernetes namespaces. The Profile custom resource wraps a Kubernetes namespace and associates it with an owner. Access control is enforced through Kubernetes RBAC, and the Kubeflow Central Dashboard ensures that users can only see resources within namespaces they are authorized to access.

Authentication is typically handled through Istio's ingress gateway combined with Dex, an OpenID Connect identity provider. This setup supports integration with external identity providers such as LDAP, SAML, and OAuth2 providers like Google, GitHub, and Microsoft.

Kubeflow Pipelines enforces namespace-level isolation, meaning that pipeline runs, experiments, and artifacts in one namespace are not visible to users in other namespaces unless access has been explicitly granted.

## How does Kubeflow compare with other ML platforms?

Kubeflow occupies a specific niche in the MLOps ecosystem. The following table compares it with other popular platforms:

| Feature | Kubeflow | [MLflow](/wiki/mlflow) | [Amazon SageMaker](/wiki/amazon_sagemaker) | [Vertex AI](/wiki/vertex_ai) |
|---|---|---|---|---|
| **Type** | Open-source platform | Open-source toolkit | Managed cloud service | Managed cloud service |
| **License** | Apache 2.0 | Apache 2.0 | Proprietary | Proprietary |
| **Infrastructure** | Any Kubernetes cluster | Any (standalone) | AWS only | Google Cloud only |
| **Pipeline Orchestration** | Kubeflow Pipelines (Argo) | MLflow Recipes | SageMaker Pipelines | Vertex AI Pipelines (KFP-based) |
| **Experiment Tracking** | MLMD, Pipelines UI | MLflow Tracking (native) | SageMaker Experiments | Vertex ML Metadata |
| **Model Serving** | KServe | MLflow Models (basic) | SageMaker Endpoints | Vertex AI Prediction |
| **Distributed Training** | Training Operator (native) | Not built-in | SageMaker Training | Vertex AI Training |
| **Hyperparameter Tuning** | Katib | Not built-in | Automatic Model Tuning | Vertex AI Vizier |
| **Notebook Support** | JupyterLab, RStudio, VS Code | Not built-in | SageMaker Studio | Vertex AI Workbench |
| **Multi-Tenancy** | Native (Profiles/Namespaces) | Not built-in | IAM-based | IAM-based |
| **Cloud Lock-in** | None | None | AWS | Google Cloud |
| **Setup Complexity** | High | Low | Low (managed) | Low (managed) |
| **Operational Overhead** | High (self-managed) | Low to Medium | Low (managed) | Low (managed) |
| **Best For** | Kubernetes-native orgs, multi-cloud | Experiment tracking, small teams | AWS-centric enterprises | Google Cloud users |

### Key Differentiators

**Kubeflow vs. MLflow**: MLflow excels at experiment tracking and model registry and is often the easiest tool to adopt for small teams. It does not provide infrastructure for distributed training, model serving, or pipeline orchestration at the same level as Kubeflow. Many organizations use MLflow for experiment tracking alongside Kubeflow for orchestration and serving.

**Kubeflow vs. SageMaker**: Amazon SageMaker is a fully managed service that eliminates the operational burden of managing infrastructure. It provides a tightly integrated experience within the AWS ecosystem. Kubeflow, by contrast, runs on any Kubernetes cluster and avoids vendor lock-in, but requires significant operational expertise to deploy and maintain.

**Kubeflow vs. Vertex AI**: Google's Vertex AI is, in many ways, the managed cloud evolution of Kubeflow. Vertex AI Pipelines is built on top of the Kubeflow Pipelines SDK, meaning pipeline code is largely portable between the two. Organizations that want the Kubeflow programming model without the operational overhead often choose Vertex AI if they are on Google Cloud.

## When should you use Kubeflow?

Kubeflow is best suited for organizations that meet several of the following criteria:

- **Existing Kubernetes expertise**: Kubeflow requires a working Kubernetes cluster and operational knowledge to manage. Teams that already run production workloads on Kubernetes will find the learning curve more manageable.
- **Multi-cloud or hybrid-cloud requirements**: Because Kubeflow runs on any Kubernetes cluster, it is an excellent choice for organizations that need to avoid cloud vendor lock-in or operate across multiple cloud providers.
- **Large-scale distributed training**: The Training Operator provides native support for distributed training across multiple nodes and GPUs, which is essential for training large models.
- **End-to-end ML platform needs**: Kubeflow covers the entire ML lifecycle from experimentation to production serving. Organizations looking for a single platform rather than stitching together multiple tools may benefit from this integrated approach.
- **Regulatory or data sovereignty requirements**: Organizations that need to keep ML workloads on-premises or in specific regions can deploy Kubeflow on their own infrastructure.

### When Not to Use Kubeflow

Kubeflow may not be the right choice in these scenarios:

- **Small teams without Kubernetes experience**: The operational complexity of running Kubeflow is significant. Small teams may be better served by managed services like SageMaker, Vertex AI, or simpler tools like MLflow.
- **Single-cloud environments**: If an organization is fully committed to a single cloud provider, the managed ML platforms from that provider (SageMaker, Vertex AI, Azure ML) typically offer a smoother experience with less operational burden.
- **Quick prototyping**: For rapid experimentation and prototyping, lighter tools such as MLflow or even simple notebook environments may be more appropriate.

## Community and Ecosystem

Kubeflow has a large and active community. At the time of its CNCF acceptance in July 2023, the project reported more than 28,000 GitHub stars on the main repository, over 15,000 total committers, more than 55,000 total GitHub contributions, and over 9,000 members in its Slack workspace. [3] The project had also shipped 15 major releases since its 2017 launch. [3]

The community communicates through the CNCF Slack workspace, the kubeflow-discuss mailing list, and regular community meetings. Notable adopters of Kubeflow include Google, Spotify, Bloomberg, the US Department of Defense, and various financial institutions. The project maintains an ADOPTERS.md file where organizations can publicly declare their use of Kubeflow.

### Commercial Distributions

At the time of its CNCF acceptance, Kubeflow had 10 commercial distributions. [3] Several companies offer commercial distributions and support for Kubeflow:

- **Canonical Charmed Kubeflow**: An enterprise distribution from Canonical (the company behind Ubuntu) that provides lifecycle management through Juju.
- **Google Cloud Kubeflow on GKE**: Google's supported distribution with integration into Google Cloud services.
- **AWS Kubeflow on EKS**: Amazon's open-source distribution with AWS service integrations.
- **Red Hat Open Data Hub**: Includes Kubeflow components integrated with the OpenShift ecosystem.

## Getting Started

The simplest way to try Kubeflow locally is to deploy it on a local Kubernetes cluster using Kind (Kubernetes in Docker) or Minikube. [14] The official Kubeflow manifests repository on GitHub provides configuration files for these environments. For production deployments, the recommended approach is to use one of the cloud-specific distributions or the community manifests tailored to the target Kubernetes platform.

The minimum Kubernetes version required varies by Kubeflow release. Kubeflow requires a cluster with at least 16 GB of RAM and 4 CPUs for the platform components, with additional resources needed for actual ML workloads.

## See Also

- [MLflow](/wiki/mlflow)
- [Kubernetes](/wiki/kubernetes)
- [MLOps](/wiki/mlops)
- [KServe](/wiki/kserve)
- [Argo Workflows](/wiki/argo_workflows)
- [TensorFlow](/wiki/tensorflow)
- [PyTorch](/wiki/pytorch)

## References

1. "Introducing Kubeflow - A Composable, Portable, Scalable ML Stack Built for Kubernetes." Kubernetes Blog, December 21, 2017. https://kubernetes.io/blog/2017/12/introducing-kubeflow-composable/
2. "Announcing Kubeflow 0.1." Kubernetes Blog, May 4, 2018. https://kubernetes.io/blog/2018/05/04/announcing-kubeflow-0.1/
3. "Kubeflow brings MLOps to the CNCF Incubator." CNCF Blog, July 25, 2023. https://www.cncf.io/blog/2023/07/25/kubeflow-brings-mlops-to-the-cncf-incubator/
4. "Kubeflow 1.0 - Cloud Native ML for Everyone." Kubeflow Blog, March 2, 2020. https://blog.kubeflow.org/releases/2020/03/02/kubeflow-1-0-cloud-native-ml-for-everyone.html
5. "Kubeflow 1.10 Release Announcement." Kubeflow Blog, March 2025. https://blog.kubeflow.org/kubeflow-1.10-release/
6. "Kubeflow Advances Cloud Native AI: A Glimpse into KubeCon + CloudNativeCon Europe 2025." CNCF Blog, June 6, 2025. https://www.cncf.io/blog/2025/06/06/kubeflow-advances-cloud-native-ai-a-glimpse-into-kubecon-cloudnativecon-europe-2025/
7. "Kubeflow." CNCF Project Page. https://www.cncf.io/projects/kubeflow/
8. "Overview - Katib." Kubeflow Documentation. https://www.kubeflow.org/docs/components/katib/overview/
9. "KServe becomes a CNCF incubating project." CNCF Blog, November 11, 2025. https://www.cncf.io/blog/2025/11/11/kserve-becomes-a-cncf-incubating-project/
10. "Kubeflow Pipelines." GitHub. https://github.com/kubeflow/pipelines
11. "What's new in Kubeflow Pipelines v2." Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/whats-new-in-kubeflow-pipelines-v2/
12. "Democratizing AI Model Training on Kubernetes: Introducing Kubeflow Trainer V2." Kubeflow Blog. https://blog.kubeflow.org/trainer/intro/
13. "Unified Training Operator release announcement." Kubeflow Blog. https://blog.kubeflow.org/unified-training-operator-1.3-release/
14. "Installing Kubeflow." Kubeflow Documentation. https://www.kubeflow.org/docs/started/installing-kubeflow/
15. "Kubeflow." Wikipedia. https://en.wikipedia.org/wiki/Kubeflow