Kubeflow

Developer Tools MLOps Machine Learning

21 min read

Updated Jun 24, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 24, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v6 · 4,123 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Kubeflow is an open-source MLOps platform that runs the entire machine learning lifecycle on Kubernetes, described by its creators as a project "dedicated to making using ML stacks on Kubernetes easy, fast and extensible." ^[1] Google open-sourced Kubeflow on December 21, 2017, originally as a way to simplify running TensorFlow jobs on Kubernetes, and the project was accepted into the Cloud Native Computing Foundation (CNCF) as an incubating-stage project on July 25, 2023. ^[1]^[3] It is licensed under Apache 2.0 and bundles a suite of components, including Kubeflow Pipelines, Notebooks, the Training Operator, Katib (hyperparameter tuning), and KServe (model serving), under a single Central Dashboard.

The name "Kubeflow" combines "Kube" (from Kubernetes) and "flow" (from TensorFlow), reflecting its origins as a way for Google to open-source how it ran TensorFlow jobs internally. Over the years, the platform has expanded well beyond TensorFlow to support a broad range of ML frameworks, including PyTorch, XGBoost, JAX, MXNet, and others. As of its CNCF acceptance, Kubeflow had more than 28,000 GitHub stars, over 15,000 committers, more than 150 contributing companies, and 10 commercial distributions. ^[3]

What problem does Kubeflow solve?

Kubeflow was created to close the gap between developing ML models on a laptop and deploying them reliably in production. The original 2017 announcement framed the core problem this way: "Building any production-ready machine learning system involves various components, often mixing vendors and hand-rolled solutions. Connecting and managing these services for even moderately sophisticated setups introduces huge barriers of complexity in adopting machine learning." ^[1]

A second motivation was portability. Before Kubeflow, ML stacks were often "so tied to the clusters they have been deployed to that these stacks are immobile," making it "effectively impossible" to move a model from a laptop to a scalable cloud cluster without significant re-architecture. ^[1] By packaging the ML toolchain as Kubernetes-native resources, Kubeflow aimed to let the same workflow run anywhere Kubernetes runs. As the announcement put it: "Because this solution relies on Kubernetes, it runs wherever Kubernetes runs. Just spin up a cluster and go!" ^[1]

History and Development

When was Kubeflow created?

Kubeflow was first announced on December 21, 2017, at KubeCon + CloudNativeCon North America by Google engineers David Aronchick, Jeremy Lewi, and Vishnu Kannan. ^[1] The project drew on Google's internal experience running TensorFlow at scale and was created to address a perceived lack of flexible, open-source options for building production-ready machine learning systems on Kubernetes. At the time, organizations frequently struggled with the gap between developing ML models in notebooks and deploying them in production environments. Kubeflow was designed to bridge that gap by providing a consistent, Kubernetes-native platform for the entire ML lifecycle. The initial repository shipped with JupyterHub for managing interactive notebooks, a TensorFlow Custom Resource (CRD) that could target CPUs or GPUs, and a TensorFlow Serving container for model serving. ^[1]

Kubeflow 0.1 (2018)

The first official release, Kubeflow 0.1, was announced at KubeCon + CloudNativeCon Europe 2018. ^[2] This release established the foundational architecture and included early versions of several core components: TFJob for distributed TensorFlow training, JupyterHub integration for interactive notebooks, and basic support for model serving. The release attracted significant community interest, and contributions began flowing in from organizations beyond Google, including Cisco, Red Hat, IBM, and others.

Kubeflow 1.0 (March 2020)

Kubeflow 1.0 was released on March 2, 2020, marking a major milestone. ^[4] This release signaled that a core set of Kubeflow applications had graduated to "stable" status and were considered ready for production use. The stable components in version 1.0 included the Central Dashboard, Kubeflow Notebooks, the Profile Controller for multi-tenancy, TensorFlow Serving, and the training operators (TFJob and PyTorchJob). Kubeflow Pipelines and Katib (hyperparameter tuning) shipped in 1.0 as beta components and were matured toward stable status in later releases. ^[4]

Is Kubeflow a CNCF project?

Yes. In October 2022, Google announced that the Kubeflow project had applied to join the Cloud Native Computing Foundation. ^[5] On July 25, 2023, the CNCF Technical Oversight Committee voted to accept Kubeflow as an incubating-stage project. ^[3] This was a significant endorsement of the project's maturity, governance, and community health. At the time of acceptance, the project reported over 150 contributing companies, 10 commercial distributions based on Kubeflow, more than 28,000 GitHub stars, over 15,000 total committers, and 15 major releases since 2017. ^[3] As of 2025, Kubeflow remains at the CNCF incubating maturity level and the community is actively working toward graduation; it has not yet been promoted to the graduated tier. ^[6]^[7]

Ricardo Rocha, the project's CNCF Technical Oversight Committee sponsor, described Kubeflow's role at acceptance: "Kubeflow helps fill a gap by delivering machine learning pipelines and MLOps while working closely with its extensive community and other tools and initiatives." ^[3]

Recent Releases

Version	Release Date	Key Highlights
1.0	March 2020	First stable release; Notebooks, Central Dashboard, TFJob/PyTorchJob stable; Pipelines and Katib beta
1.3	September 2021	Unified Training Operator merging TFJob, PyTorchJob, MPIJob
1.7	March 2023	KServe improvements, Pipelines v2 backend
1.8	October 2023	Enhanced multi-tenancy, Pipelines v2 SDK
1.9	July 2024	Simplified LLM workflows, security improvements
1.10	March 2025	Trainer v2, new Katib API, Spark Operator integration, Model Registry UI

Architecture Overview

Kubeflow is not a single monolithic application. Instead, it is a collection of loosely coupled components, each addressing a different stage of the machine learning lifecycle. These components are deployed as Kubernetes-native resources (Custom Resource Definitions, Pods, Services, and so on) and are managed through the Kubeflow Central Dashboard.

The platform follows a microservices architecture where each component can be installed independently or as part of the full Kubeflow platform. All components share a common authentication layer (typically Istio and Dex for identity management) and a multi-tenancy system based on Kubernetes namespaces.

At a high level, the Kubeflow architecture covers these stages of the ML lifecycle:

Data exploration and model development using Kubeflow Notebooks
Pipeline orchestration using Kubeflow Pipelines
Distributed model training using the Training Operator
Hyperparameter tuning and AutoML using Katib
Model serving and inference using KServe
Model management using the Model Registry

Core Components

Kubeflow Pipelines

Kubeflow Pipelines (KFP) is one of the most widely used components of the platform. It provides a system for building, deploying, and managing multi-step ML workflows as reusable, portable pipelines. Each step in a pipeline runs in its own container, which ensures reproducibility and isolation.

Architecture

Kubeflow Pipelines consists of several sub-components:

Python SDK: Used to define pipeline components and compose them into directed acyclic graphs (DAGs). The SDK provides decorators (@component and @pipeline) to author pipelines in pure Python.
Pipeline Compiler: Converts the Python pipeline definition into an intermediate representation (IR) that can be executed by a backend engine.
Workflow Engine: By default, KFP uses Argo Workflows as the underlying execution engine. Argo translates the pipeline IR into Kubernetes-native workflow resources.
ML Metadata (MLMD): Tracks execution lineage, artifacts, and metadata throughout pipeline runs. MLMD enables caching, provenance tracking, and lineage visualization.
Pipeline UI: A web-based interface for viewing pipeline runs, comparing experiments, inspecting artifacts, and visualizing the DAG structure.

Pipelines v2

The v2 release of Kubeflow Pipelines introduced several major improvements. ^[11] The updated intermediate representation makes pipelines executable by backends other than Argo. Input and output artifacts (datasets, models, metrics) are treated as first-class nodes in the DAG visualization. The v2 SDK also allows compiling and running individual components independently, and supports nesting pipelines as components of larger pipelines. The v2 backend maintains backward compatibility with v1 APIs.

Kubeflow Notebooks

Kubeflow Notebooks provides a way to run web-based interactive development environments inside a Kubernetes cluster. Rather than developing on local workstations, data scientists can spin up notebook servers directly within the cluster, gaining immediate access to cluster resources, GPUs, and shared storage.

Supported Environments

Kubeflow Notebooks natively supports three types of development environments:

JupyterLab: The most popular choice for interactive Python development and data exploration.
RStudio: For data scientists working with the R programming language.
Visual Studio Code (code-server): A browser-based version of VS Code for general-purpose development.

Each notebook server runs as a container inside a Kubernetes Pod. Users can select custom container images with pre-installed libraries, request specific resource allocations (CPU, memory, GPUs), and attach persistent volumes for data storage.

Evolution from JupyterHub

In early versions of Kubeflow (v0.4 and earlier), the notebook component relied on JupyterHub for managing notebook servers. Starting with Kubeflow v0.5, the project replaced JupyterHub with a custom Jupyter web application that provides tighter integration with Kubeflow's authentication, multi-tenancy, and namespace management systems.

What is KServe (formerly KFServing)?

KServe is a Kubernetes-native inference platform for deploying and serving machine learning models. Originally developed under the name KFServing, the project was renamed to KServe to reflect its broader scope and independence as a standalone component that can be used with or without the rest of Kubeflow. ^[9] KServe itself became a CNCF incubating project in November 2025. ^[9]

Supported Frameworks

KServe provides out-of-the-box support for serving models trained with a wide variety of frameworks:

Framework	Runtime	Protocol
TensorFlow	TensorFlow Serving	REST/gRPC
PyTorch	TorchServe	REST/gRPC
scikit-learn	MLServer	REST/gRPC
XGBoost	MLServer	REST/gRPC
ONNX	Triton Inference Server	REST/gRPC
LightGBM	MLServer	REST/gRPC
Custom Models	Any container	REST/gRPC

InferenceService Architecture

Each KServe deployment, called an InferenceService, can consist of up to three components:

Predictor: The required component that loads the model and serves predictions. This is the core of the InferenceService.
Transformer: An optional component for pre-processing and post-processing requests and responses. KServe provides built-in transformers for common use cases such as feature retrieval from Feast.
Explainer: An optional component that provides model explanations alongside predictions. Built-in explainers include integrations with Alibi Explain for techniques like SHAP and anchors.

Key Features

Autoscaling and Scale-to-Zero: KServe supports request-based autoscaling, including the ability to scale down to zero replicas when no traffic is being received and scale back up when requests arrive.
Canary Deployments: Supports gradual traffic shifting between model versions. Operators can configure a canaryTrafficPercent to split traffic between a new (canary) version and the stable version.
Inference Graphs: The InferenceGraph resource enables building complex inference pipelines where multiple models work together. It supports four routing node types: Sequence, Switch, Ensemble, and Splitter.
GPU Autoscaling: Automatic scaling of GPU-backed inference endpoints based on request load.
Batching: Automatic request batching to improve throughput for high-volume inference workloads.

Training Operators

The Kubeflow Training Operator provides Kubernetes custom resources for running distributed training jobs. Originally, Kubeflow maintained separate operators for each framework (TFJob, PyTorchJob, MPIJob, and XGBoostJob). In version 1.3, these were merged into a single unified Training Operator to simplify maintenance and provide a consistent user experience. ^[13]

Supported Job Types

Job Type	Framework	Use Case
TFJob	TensorFlow	Distributed TensorFlow training with parameter servers or workers
PyTorchJob	PyTorch	Distributed PyTorch training using `torch.distributed`
MPIJob	MPI / Horovod	High-performance computing workloads using Message Passing Interface
XGBoostJob	XGBoost	Distributed gradient boosting training
PaddleJob	PaddlePaddle	Distributed PaddlePaddle training

The Training Operator handles the complexity of coordinating workers, parameter servers, and chief nodes for distributed training. It manages Pod creation, networking between workers, and fault tolerance. If a worker fails, the operator can restart it without restarting the entire job.

Trainer v2

Introduced in Kubeflow 1.10, Trainer v2 represents a significant redesign of the training component, building on more than seven years of experience running ML workloads on Kubernetes. ^[12] It replaces the multiple framework-specific CRDs with a unified TrainJob API and simplifies the process of fine-tuning large language models on Kubernetes by providing high-level abstractions. The first release supports TorchTune as a built-in LLM trainer with pre-configured runtimes for models like Llama-3.2-1B-Instruct and Llama-3.2-3B-Instruct. ^[12] Trainer v2 also provides dedicated initializers for datasets and models, reducing the amount of boilerplate configuration needed.

What is Katib used for?

Katib is Kubeflow's Kubernetes-native component for automated machine learning (AutoML). ^[8] It supports both hyperparameter tuning and neural architecture search (NAS), allowing users to define search spaces and optimization objectives while Katib handles the rest. The name comes from the Arabic word for "secretary" or "scribe."

Supported Algorithms

Katib supports a rich set of optimization algorithms, many of which are provided through integrations with frameworks like Hyperopt and Optuna:

Algorithm	Type	Description
Random Search	Basic	Samples hyperparameters randomly from the search space
Grid Search	Basic	Exhaustively evaluates all combinations in a discretized search space
Bayesian Optimization	Advanced	Uses Gaussian processes to model the objective function
Tree of Parzen Estimators (TPE)	Advanced	Models good and bad hyperparameter distributions separately
Multivariate TPE	Advanced	Extension of TPE that considers parameter dependencies
Hyperband	Advanced	Early-stopping-based method that allocates resources efficiently
CMA-ES	Advanced	Covariance Matrix Adaptation Evolution Strategy for continuous optimization
ENAS	NAS	Efficient Neural Architecture Search using parameter sharing
DARTS	NAS	Differentiable Architecture Search using continuous relaxation
PBT	Advanced	Population Based Training that jointly optimizes hyperparameters and training

Integration with Training Operator

Katib integrates directly with the Training Operator, allowing it to launch PyTorchJob, TFJob, or other training jobs as trial runs during hyperparameter optimization. This integration means that hyperparameter tuning can be performed on distributed training jobs, not just single-node experiments.

In Kubeflow 1.10, Katib introduced a new high-level Python API that integrates Katib and the Training Operator to automate hyperparameter optimization for LLM fine-tuning workflows, reducing the amount of manual configuration required. ^[5]

Central Dashboard

The Kubeflow Central Dashboard serves as the unified web interface for the entire platform. It provides a single entry point for accessing all Kubeflow components, including Notebooks, Pipelines, training jobs, Katib experiments, and model serving endpoints.

The Dashboard also manages Kubeflow's multi-tenancy system. Each user has a profile, which maps to a Kubernetes namespace. Users can only see and interact with resources in namespaces they have access to. Profile owners can add contributors with either view (read-only) or modify (read and write) access.

Model Registry

The Model Registry is a newer Kubeflow component that provides a centralized catalog for ML models, versions, and their associated metadata. It allows teams to track which models have been trained, their lineage, performance metrics, and serving status. In Kubeflow 1.10, the Model Registry received a new web UI and deeper integration with KServe, enabling a smoother path from model registration to deployment. ^[5]

Spark Operator

The Spark Operator, integrated as a core Kubeflow component in version 1.10, enables running Apache Spark jobs on Kubernetes. ^[5] It was rebuilt with Controller Runtime for improved architecture and includes YuniKorn gang scheduling support for efficient group scheduling of driver and executor pods. This component is particularly useful for data preprocessing and feature engineering stages of ML workflows.

Deployment on Cloud Platforms

Kubeflow is designed to be cloud-agnostic and can be deployed on any conformant Kubernetes cluster. However, the major cloud providers offer specific distributions and documentation for running Kubeflow on their managed Kubernetes services.

Google Kubernetes Engine (GKE)

Google Cloud provides first-party support for Kubeflow on GKE. The deployment leverages Google Cloud's IAP (Identity-Aware Proxy) for authentication and integrates with Google Cloud Storage for artifact storage. Google also offers Vertex AI Pipelines, which is a managed service based on Kubeflow Pipelines, providing a fully managed alternative for organizations that prefer not to manage the infrastructure themselves.

Amazon Elastic Kubernetes Service (EKS)

Kubeflow on AWS is an open-source distribution maintained by AWS Labs. It integrates Kubeflow with AWS-native services such as Amazon S3 for artifact storage, Amazon RDS for metadata storage, and Amazon Cognito for authentication. Canonical also offers Charmed Kubeflow, which can be deployed on EKS.

Azure Kubernetes Service (AKS)

Kubeflow on AKS provides a straightforward deployment path for organizations running ML workloads on Microsoft's cloud. The deployment integrates with Azure Active Directory for identity management and Azure Blob Storage for artifacts.

On-Premises and Other Platforms

Kubeflow can also be deployed on bare-metal Kubernetes clusters, OpenShift, Rancher, and local development environments using Kind or Minikube. The Kubeflow manifests repository provides community-maintained configurations for a variety of Kubernetes distributions.

Multi-Tenancy and Security

Kubeflow provides a robust multi-tenancy model built on top of Kubernetes namespaces. The Profile custom resource wraps a Kubernetes namespace and associates it with an owner. Access control is enforced through Kubernetes RBAC, and the Kubeflow Central Dashboard ensures that users can only see resources within namespaces they are authorized to access.

Authentication is typically handled through Istio's ingress gateway combined with Dex, an OpenID Connect identity provider. This setup supports integration with external identity providers such as LDAP, SAML, and OAuth2 providers like Google, GitHub, and Microsoft.

Kubeflow Pipelines enforces namespace-level isolation, meaning that pipeline runs, experiments, and artifacts in one namespace are not visible to users in other namespaces unless access has been explicitly granted.

How does Kubeflow compare with other ML platforms?

Kubeflow occupies a specific niche in the MLOps ecosystem. The following table compares it with other popular platforms:

Feature	Kubeflow	MLflow	Amazon SageMaker	Vertex AI
Type	Open-source platform	Open-source toolkit	Managed cloud service	Managed cloud service
License	Apache 2.0	Apache 2.0	Proprietary	Proprietary
Infrastructure	Any Kubernetes cluster	Any (standalone)	AWS only	Google Cloud only
Pipeline Orchestration	Kubeflow Pipelines (Argo)	MLflow Recipes	SageMaker Pipelines	Vertex AI Pipelines (KFP-based)
Experiment Tracking	MLMD, Pipelines UI	MLflow Tracking (native)	SageMaker Experiments	Vertex ML Metadata
Model Serving	KServe	MLflow Models (basic)	SageMaker Endpoints	Vertex AI Prediction
Distributed Training	Training Operator (native)	Not built-in	SageMaker Training	Vertex AI Training
Hyperparameter Tuning	Katib	Not built-in	Automatic Model Tuning	Vertex AI Vizier
Notebook Support	JupyterLab, RStudio, VS Code	Not built-in	SageMaker Studio	Vertex AI Workbench
Multi-Tenancy	Native (Profiles/Namespaces)	Not built-in	IAM-based	IAM-based
Cloud Lock-in	None	None	AWS	Google Cloud
Setup Complexity	High	Low	Low (managed)	Low (managed)
Operational Overhead	High (self-managed)	Low to Medium	Low (managed)	Low (managed)
Best For	Kubernetes-native orgs, multi-cloud	Experiment tracking, small teams	AWS-centric enterprises	Google Cloud users

Key Differentiators

Kubeflow vs. MLflow: MLflow excels at experiment tracking and model registry and is often the easiest tool to adopt for small teams. It does not provide infrastructure for distributed training, model serving, or pipeline orchestration at the same level as Kubeflow. Many organizations use MLflow for experiment tracking alongside Kubeflow for orchestration and serving.

Kubeflow vs. SageMaker: Amazon SageMaker is a fully managed service that eliminates the operational burden of managing infrastructure. It provides a tightly integrated experience within the AWS ecosystem. Kubeflow, by contrast, runs on any Kubernetes cluster and avoids vendor lock-in, but requires significant operational expertise to deploy and maintain.

Kubeflow vs. Vertex AI: Google's Vertex AI is, in many ways, the managed cloud evolution of Kubeflow. Vertex AI Pipelines is built on top of the Kubeflow Pipelines SDK, meaning pipeline code is largely portable between the two. Organizations that want the Kubeflow programming model without the operational overhead often choose Vertex AI if they are on Google Cloud.

When should you use Kubeflow?

Kubeflow is best suited for organizations that meet several of the following criteria:

Existing Kubernetes expertise: Kubeflow requires a working Kubernetes cluster and operational knowledge to manage. Teams that already run production workloads on Kubernetes will find the learning curve more manageable.
Multi-cloud or hybrid-cloud requirements: Because Kubeflow runs on any Kubernetes cluster, it is an excellent choice for organizations that need to avoid cloud vendor lock-in or operate across multiple cloud providers.
Large-scale distributed training: The Training Operator provides native support for distributed training across multiple nodes and GPUs, which is essential for training large models.
End-to-end ML platform needs: Kubeflow covers the entire ML lifecycle from experimentation to production serving. Organizations looking for a single platform rather than stitching together multiple tools may benefit from this integrated approach.
Regulatory or data sovereignty requirements: Organizations that need to keep ML workloads on-premises or in specific regions can deploy Kubeflow on their own infrastructure.

When Not to Use Kubeflow

Kubeflow may not be the right choice in these scenarios:

Small teams without Kubernetes experience: The operational complexity of running Kubeflow is significant. Small teams may be better served by managed services like SageMaker, Vertex AI, or simpler tools like MLflow.
Single-cloud environments: If an organization is fully committed to a single cloud provider, the managed ML platforms from that provider (SageMaker, Vertex AI, Azure ML) typically offer a smoother experience with less operational burden.
Quick prototyping: For rapid experimentation and prototyping, lighter tools such as MLflow or even simple notebook environments may be more appropriate.

Community and Ecosystem

Kubeflow has a large and active community. At the time of its CNCF acceptance in July 2023, the project reported more than 28,000 GitHub stars on the main repository, over 15,000 total committers, more than 55,000 total GitHub contributions, and over 9,000 members in its Slack workspace. ^[3] The project had also shipped 15 major releases since its 2017 launch. ^[3]

The community communicates through the CNCF Slack workspace, the kubeflow-discuss mailing list, and regular community meetings. Notable adopters of Kubeflow include Google, Spotify, Bloomberg, the US Department of Defense, and various financial institutions. The project maintains an ADOPTERS.md file where organizations can publicly declare their use of Kubeflow.

Commercial Distributions

At the time of its CNCF acceptance, Kubeflow had 10 commercial distributions. ^[3] Several companies offer commercial distributions and support for Kubeflow:

Canonical Charmed Kubeflow: An enterprise distribution from Canonical (the company behind Ubuntu) that provides lifecycle management through Juju.
Google Cloud Kubeflow on GKE: Google's supported distribution with integration into Google Cloud services.
AWS Kubeflow on EKS: Amazon's open-source distribution with AWS service integrations.
Red Hat Open Data Hub: Includes Kubeflow components integrated with the OpenShift ecosystem.

Getting Started

The simplest way to try Kubeflow locally is to deploy it on a local Kubernetes cluster using Kind (Kubernetes in Docker) or Minikube. ^[14] The official Kubeflow manifests repository on GitHub provides configuration files for these environments. For production deployments, the recommended approach is to use one of the cloud-specific distributions or the community manifests tailored to the target Kubernetes platform.

The minimum Kubernetes version required varies by Kubeflow release. Kubeflow requires a cluster with at least 16 GB of RAM and 4 CPUs for the platform components, with additional resources needed for actual ML workloads.

References

"Introducing Kubeflow - A Composable, Portable, Scalable ML Stack Built for Kubernetes." Kubernetes Blog, December 21, 2017. https://kubernetes.io/blog/2017/12/introducing-kubeflow-composable/ ↩
"Announcing Kubeflow 0.1." Kubernetes Blog, May 4, 2018. https://kubernetes.io/blog/2018/05/04/announcing-kubeflow-0.1/ ↩
"Kubeflow brings MLOps to the CNCF Incubator." CNCF Blog, July 25, 2023. https://www.cncf.io/blog/2023/07/25/kubeflow-brings-mlops-to-the-cncf-incubator/ ↩
"Kubeflow 1.0 - Cloud Native ML for Everyone." Kubeflow Blog, March 2, 2020. https://blog.kubeflow.org/releases/2020/03/02/kubeflow-1-0-cloud-native-ml-for-everyone.html ↩
"Kubeflow 1.10 Release Announcement." Kubeflow Blog, March 2025. https://blog.kubeflow.org/kubeflow-1.10-release/ ↩
"Kubeflow Advances Cloud Native AI: A Glimpse into KubeCon + CloudNativeCon Europe 2025." CNCF Blog, June 6, 2025. https://www.cncf.io/blog/2025/06/06/kubeflow-advances-cloud-native-ai-a-glimpse-into-kubecon-cloudnativecon-europe-2025/ ↩
"Kubeflow." CNCF Project Page. https://www.cncf.io/projects/kubeflow/ ↩
"Overview - Katib." Kubeflow Documentation. https://www.kubeflow.org/docs/components/katib/overview/ ↩
"KServe becomes a CNCF incubating project." CNCF Blog, November 11, 2025. https://www.cncf.io/blog/2025/11/11/kserve-becomes-a-cncf-incubating-project/ ↩
"Kubeflow Pipelines." GitHub. https://github.com/kubeflow/pipelines
"What's new in Kubeflow Pipelines v2." Google Cloud Blog. https://cloud.google.com/blog/products/ai-machine-learning/whats-new-in-kubeflow-pipelines-v2/ ↩
"Democratizing AI Model Training on Kubernetes: Introducing Kubeflow Trainer V2." Kubeflow Blog. https://blog.kubeflow.org/trainer/intro/ ↩
"Unified Training Operator release announcement." Kubeflow Blog. https://blog.kubeflow.org/unified-training-operator-1.3-release/ ↩
"Installing Kubeflow." Kubeflow Documentation. https://www.kubeflow.org/docs/started/installing-kubeflow/ ↩
"Kubeflow." Wikipedia. https://en.wikipedia.org/wiki/Kubeflow

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

Amazon SageMaker Google Vertex AI Linux Foundation MLOps MLflow Model deployment NVIDIA Triton Inference Server Pipelining

What problem does Kubeflow solve?

History and Development

When was Kubeflow created?

Kubeflow 0.1 (2018)

Kubeflow 1.0 (March 2020)

Is Kubeflow a CNCF project?

Recent Releases

Architecture Overview

Core Components

Kubeflow Pipelines

Architecture

Pipelines v2

Kubeflow Notebooks

Supported Environments

Evolution from JupyterHub

What is KServe (formerly KFServing)?

Supported Frameworks

InferenceService Architecture

Key Features

Training Operators

Supported Job Types

Trainer v2

What is Katib used for?

Supported Algorithms

Integration with Training Operator

Central Dashboard

Model Registry

Spark Operator

Deployment on Cloud Platforms

Google Kubernetes Engine (GKE)

Amazon Elastic Kubernetes Service (EKS)

Azure Kubernetes Service (AKS)

On-Premises and Other Platforms

Multi-Tenancy and Security

How does Kubeflow compare with other ML platforms?

Key Differentiators

When should you use Kubeflow?

When Not to Use Kubeflow

Community and Ecosystem

Commercial Distributions

Getting Started

See Also

References

Improve this article

Related Articles

MLflow

Operation (op)

SavedModel

TensorFlow Serving

Replicate

AutoML (Automated Machine Learning)

What links here

Related Articles

MLflow

Operation (op)

SavedModel

TensorFlow Serving

Replicate

AutoML (Automated Machine Learning)

What links here