kubeflow

Open Source

# Kubeflow [![Join Kubeflow Slack](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) [![CLOMonitor](https://img.shields.io/endpoint?url=https://clomonitor.io/api/projects/cncf/kubeflow/badge)](https://clomonitor.io/projects/cncf/kubeflow) <img src="./logo/stacked.svg" width="120"> ## What is Kubeflow [Kubeflow](https://www.kubeflow.org/) is the foundation of tools for AI Platforms on Kubernetes. AI platform teams can build on top of Kubeflow by using each project independently or deploying the entire AI reference platform to meet their specific needs. The Kubeflow AI reference platform is composable, modular, portable, and scalable, backed by an ecosystem of Kubernetes-native projects that cover every stage of the [AI lifecycle](https://www.kubeflow.org/docs/started/architecture/#kubeflow-projects-in-the-ai-lifecycle). Whether you’re an AI practitioner, a platform administrator, or a team of developers, Kubeflow offers modular, scalable, and extensible tools to support your AI use cases. Please refer to [the official documentation](https://www.kubeflow.org/docs/) for more information. ## What are Kubeflow Projects Kubeflow is composed of multiple open source projects that address different aspects of the AI lifecycle. These projects are designed to be usable both independently and as part of the Kubeflow AI reference platform. This provides flexibility for users who may not need the full end-to-end AI platform capabilities but want to leverage specific functionalities, such as model training or model serving. | Kubeflow Project | Source Code | | ----------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | | [KServe](https://www.kubeflow.org/docs/external-add-ons/kserve/) | [`kserve/kserve`](https://github.com/kserve/kserve) | | [Kubeflow Katib](https://www.kubeflow.org/docs/components/katib/) | [`kubeflow/katib`](https://github.com/kubeflow/katib) | | [Kubeflow Model Registry](https://www.kubeflow.org/docs/components/model-registry/) | [`kubeflow/model-registry`](https://github.com/kubeflow/model-registry) | | [Kubeflow Notebooks](https://www.kubeflow.org/docs/components/notebooks/) | [`kubeflow/notebooks`](https://github.com/kubeflow/notebooks) | | [Kubeflow Pipelines](https://www.kubeflow.org/docs/components/pipelines/) | [`kubeflow/pipelines`](https://github.com/kubeflow/pipelines) | | [Kubeflow SDK](https://github.com/kubeflow/sdk) | [`kubeflow/sdk`](https://github.com/kubeflow/sdk) | | [Kubeflow Spark Operator](https://www.kubeflow.org/docs/components/spark-operator/) | [`kubeflow/spark-operator`](https://github.com/kubeflow/spark-operator) | | [Kubeflow Trainer](https://www.kubeflow.org/docs/components/trainer/) | [`kubeflow/trainer`](https://github.com/kubeflow/trainer) | ## What is the Kubeflow AI Reference Platform The Kubeflow AI reference platform refers to the full suite of Kubeflow projects bundled together with additional integration and management tools. Kubeflow AI reference platform deploys the comprehensive toolkit for the entire AI lifecycle. The Kubeflow AI reference platform can be installed via [Packaged Distributions](https://www.kubeflow.org/docs/started/installing-kubeflow/#packaged-distributions) or [Kubeflow Manifests](https://www.kubeflow.org/docs/started/installing-kubeflow/#kubeflow-manifests). | Kubeflow AI Reference Platform Tool | Source Code | | --------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- | | [Central Dashboard](https://www.kubeflow.org/docs/components/central-dash/) | [`kubeflow/dashboard`](https://github.com/kubeflow/dashboard) | | [Profile Controller](https://www.kubeflow.org/docs/components/central-dash/profiles/) | [`kubeflow/dashboard`](https://github.com/kubeflow/dashboard) | | [Kubeflow Manifests](https://www.kubeflow.org/docs/started/installing-kubeflow/#kubeflow-manifests) | [`kubeflow/manifests`](https://github.com/kubeflow/manifests) | ## Kubeflow Community Kubeflow is a community-led project maintained by the [Kubeflow Working Groups](https://www.kubeflow.org/docs/about/governance/#4-working-groups) under the guidance of the [Kubeflow Steering Committee](https://www.kubeflow.org/docs/about/governance/#2-kubeflow-steering-committee-ksc). We encourage you to learn about the [Kubeflow Community](https://www.kubeflow.org/docs/about/community/) and how to [contribute](https://www.kubeflow.org/docs/about/contributing/) to the project!

DevOps & Infrastructure ML Frameworks

15.7K Github Stars

Open Source

pipelines

Podcast Analytics Data collected from Spotify and stored in Open Podcast API

DevOps & Infrastructure ML Frameworks Event Tracking & CDP

14 Github Stars

Open Source

trainer

# Kubeflow Trainer [![Join Slack](https://img.shields.io/badge/Join_Slack-blue?logo=slack)](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) [![Coverage Status](https://coveralls.io/repos/github/kubeflow/trainer/badge.svg?branch=master)](https://coveralls.io/github/kubeflow/trainer?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/trainer)](https://goreportcard.com/report/github.com/kubeflow/trainer) [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/10435/badge)](https://www.bestpractices.dev/projects/10435) [![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/kubeflow/trainer) [![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fkubeflow%2Ftrainer.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2Fkubeflow%2Ftrainer?ref=badge_shield) <h1 align="center"> <img src="./docs/images/trainer-logo.svg" alt="logo" width="200"> <br> </h1> Latest News 🔥 - [2026/03] Kubeflow Trainer v2.2 is officially released with support for JAX and XGBoost Training Runtimes, enhanced observability with metrics propagation to TrainJob status, and Flux Framework integration for HPC and MPI workloads. Check out [the blog post announcement](https://blog.kubeflow.org/kubeflow-trainer-v2.2-release/). - [2025/11] Kubeflow Trainer v2.1 is officially released with support of [Distributed Data Cache](https://www.kubeflow.org/docs/components/trainer/user-guides/data-cache/), topology aware scheduling with Kueue and Volcano, and LLM post-training enhancements. Check out [the GitHub release notes](https://github.com/kubeflow/trainer/releases/tag/v2.1.0). - [2025/09] Kubeflow SDK v0.1 is officially released with support for CustomTrainer, BuiltinTrainer, and local PyTorch execution. Check out [the GitHub release notes](https://github.com/kubeflow/sdk/releases/tag/0.1.0). - [2025/07] PyTorch on Kubernetes: Kubeflow Trainer Joins the PyTorch Ecosystem. Find the announcement in [the PyTorch blog post](https://pytorch.org/blog/pytorch-on-kubernetes-kubeflow-trainer-joins-the-pytorch-ecosystem/). <details> <summary>More</summary> - [2025/07] Kubeflow Trainer v2.0 has been officially released. Check out [the blog post announcement](https://blog.kubeflow.org/trainer/intro/) and [the release notes](https://github.com/kubeflow/trainer/releases/tag/v2.0.0). - [2025/04] From High Performance Computing To AI Workloads on Kubernetes: MPI Runtime in Kubeflow TrainJob. See the [KubeCon + CloudNativeCon London talk](https://youtu.be/Fnb1a5Kaxgo) </details> ## Overview Kubeflow Trainer is a Kubernetes-native distributed AI platform for scalable large language model (LLM) fine-tuning and training of AI models across a wide range of frameworks, including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and more. Kubeflow Trainer brings MPI to Kubernetes, orchestrating multi-node, multi-GPU distributed jobs efficiently across high-performance computing (HPC) clusters. This enables high-throughput communication between processes, making it ideal for large-scale AI training that requires ultra-fast synchronization between GPUs nodes. Kubeflow Trainer seamlessly integrates with the Cloud Native AI ecosystem, including [Kueue](https://kueue.sigs.k8s.io/docs/tasks/run/trainjobs/) for topology-aware scheduling and multi-cluster job dispatching, as well as [JobSet](https://github.com/kubernetes-sigs/jobset) and [LeaderWorkerSet](https://github.com/kubernetes-sigs/lws) for AI workload orchestration. Kubeflow Trainer provides a distributed data cache designed to stream large-scale data with zero-copy transfer directly to GPU nodes. This ensures memory-efficient training jobs while maximizing GPU utilization. With [the Kubeflow Python SDK](https://github.com/kubeflow/sdk), AI practitioners can effortlessly develop and fine-tune LLMs while leveraging the Kubeflow Trainer APIs: TrainJob and Runtimes. <h1 align="center"> <img src="./docs/images/trainer-tech-stack.drawio.svg" alt="logo" width="500"> <br> </h1> ## Kubeflow Trainer Introduction Checkout following KubeCon + CloudNativeCon talks for Kubeflow Trainer capabilities: [![Kubeflow Trainer](https://img.youtube.com/vi/Lgy4ir1AhYw/0.jpg)](https://www.youtube.com/watch?v=Lgy4ir1AhYw) Additional talks: - [From High Performance Computing To AI Workloads on Kubernetes: MPI Runtime in Kubeflow TrainJob](https://youtu.be/Fnb1a5Kaxgo) - [Streamline LLM Fine-tuning on Kubernetes With Kubeflow LLM Trainer](https://youtu.be/O7cNlaz3Hqs) ## Getting Started Please check [the official Kubeflow Trainer documentation](https://www.kubeflow.org/docs/components/trainer/getting-started) to install and get started with Kubeflow Trainer. ## Community The following links provide information on how to get involved in the community: - Join our [`#kubeflow-trainer` Slack channel](https://www.kubeflow.org/docs/about/community/#kubeflow-slack). - Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV) community meeting. - Check out [who is using Kubeflow Trainer](ADOPTERS.md). ## Contributing Please refer to the [CONTRIBUTING guide](CONTRIBUTING.md). ## Changelog Please refer to the [CHANGELOG](CHANGELOG.md). ## Kubeflow Training Operator V1 Kubeflow Trainer project is currently in <strong>alpha</strong> status, and APIs may change. If you are using Kubeflow Training Operator V1, please refer [to this migration document](https://www.kubeflow.org/docs/components/trainer/operator-guides/migration/). Kubeflow Community will maintain the Training Operator V1 source code at [the `release-1.9` branch](https://github.com/kubeflow/trainer/tree/release-1.9). You can find the documentation for Kubeflow Training Operator V1 in [these guides](https://www.kubeflow.org/docs/components/trainer/legacy-v1). ## Acknowledgement This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators. - PyTorch Operator: [list of contributors](https://github.com/kubeflow/pytorch-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/pytorch-operator/blob/master/OWNERS). - MPI Operator: [list of contributors](https://github.com/kubeflow/mpi-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS). - XGBoost Operator: [list of contributors](https://github.com/kubeflow/xgboost-operator/graphs/contributors) and [maintainers](https://github.com/kubeflow/xgboost-operator/blob/master/OWNERS). - Common library: [list of contributors](https://github.com/kubeflow/common/graphs/contributors) and [maintainers](https://github.com/kubeflow/common/blob/master/OWNERS).

ML Frameworks Container Management

2.1K Github Stars

Open Source

katib

# Kubeflow Katib [![Build Status](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml/badge.svg?branch=master)](https://github.com/kubeflow/katib/actions/workflows/test-go.yaml?branch=master) [![Coverage Status](https://coveralls.io/repos/github/kubeflow/katib/badge.svg?branch=master)](https://coveralls.io/github/kubeflow/katib?branch=master) [![Go Report Card](https://goreportcard.com/badge/github.com/kubeflow/katib)](https://goreportcard.com/report/github.com/kubeflow/katib) [![Releases](https://img.shields.io/github/release-pre/kubeflow/katib.svg?sort=semver)](https://github.com/kubeflow/katib/releases) [![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) [![OpenSSF Best Practices](https://www.bestpractices.dev/projects/9941/badge)](https://www.bestpractices.dev/projects/9941) [![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fkubeflow%2Fkatib.svg?type=shield)](https://app.fossa.com/projects/git%2Bgithub.com%2Fkubeflow%2Fkatib?ref=badge_shield) <h1 align="center"> <img src="./docs/images/logo-title.png" alt="logo" width="200"> <br> </h1> Kubeflow Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports [Hyperparameter Tuning](https://en.wikipedia.org/wiki/Hyperparameter_optimization), [Early Stopping](https://en.wikipedia.org/wiki/Early_stopping) and [Neural Architecture Search](https://en.wikipedia.org/wiki/Neural_architecture_search). Katib is the project which is agnostic to machine learning (ML) frameworks. It can tune hyperparameters of applications written in any language of the users’ choice and natively supports many ML frameworks, such as [TensorFlow](https://www.tensorflow.org/), [PyTorch](https://pytorch.org/), [XGBoost](https://xgboost.readthedocs.io/en/latest/), and others. Katib can perform training jobs using any Kubernetes [Custom Resources](https://www.kubeflow.org/docs/components/katib/trial-template/) with out of the box support for [Kubeflow Training Operator](https://github.com/kubeflow/training-operator), [Argo Workflows](https://github.com/argoproj/argo-workflows), [Tekton Pipelines](https://github.com/tektoncd/pipeline) and many more. Katib stands for `secretary` in Arabic. ## Search Algorithms Katib supports several search algorithms. Follow the [Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#hp-tuning-algorithms) to know more about each algorithm and check the [this guide](https://www.kubeflow.org/docs/components/katib/user-guides/hp-tuning/configure-algorithm/#use-custom-algorithm-in-katib) to implement your custom algorithm. <table> <tbody> <tr align="center"> <td> <b>Hyperparameter Tuning</b> </td> <td> <b>Neural Architecture Search</b> </td> <td> <b>Early Stopping</b> </td> </tr> <tr align="center"> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#random-search">Random Search</a> </td> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#neural-architecture-search-based-on-enas">ENAS</a> </td> <td> <a href="https://www.kubeflow.org/docs/components/katib/early-stopping/#median-stopping-rule">Median Stop</a> </td> </tr> <tr align="center"> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#grid-search">Grid Search</a> </td> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#differentiable-architecture-search-darts">DARTS</a> </td> <td> </td> </tr> <tr align="center"> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#bayesian-optimization">Bayesian Optimization</a> </td> <td> </td> <td> </td> </tr> <tr align="center"> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#tree-of-parzen-estimators-tpe">TPE</a> </td> <td> </td> <td> </td> </tr> <tr align="center"> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#multivariate-tpe">Multivariate TPE</a> </td> <td> </td> <td> </td> </tr> <tr align="center"> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#covariance-matrix-adaptation-evolution-strategy-cma-es">CMA-ES</a> </td> <td> </td> <td> </td> </tr> <tr align="center"> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#sobols-quasirandom-sequence">Sobol's Quasirandom Sequence</a> </td> <td> </td> <td> </td> </tr> <tr align="center"> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#hyperband">HyperBand</a> </td> <td> </td> <td> </td> </tr> <tr align="center"> <td> <a href="https://www.kubeflow.org/docs/components/katib/experiment/#pbt">Population Based Training</a> </td> <td> </td> <td> </td> </tr> </tbody> </table> To perform the above algorithms Katib supports the following frameworks: - [Goptuna](https://github.com/c-bata/goptuna) - [Hyperopt](https://github.com/hyperopt/hyperopt) - [Optuna](https://github.com/optuna/optuna) - [Scikit Optimize](https://github.com/scikit-optimize/scikit-optimize) ## Prerequisites Please check [the official Kubeflow documentation](https://www.kubeflow.org/docs/components/katib/installation/#prerequisites) for prerequisites to install Katib. ## Installation Please follow [the Kubeflow Katib guide](https://www.kubeflow.org/docs/components/katib/installation/#installing-katib) for the detailed instructions on how to install Katib. ### Installing the Control Plane Run the following command to install the latest stable release of Katib control plane: ``` kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.17.0" ``` Run the following command to install the latest changes of Katib control plane: ``` kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master" ``` For the Katib Experiments check the [complete examples list](./examples/v1beta1). ### Installing the Python SDK Katib implements [a Python SDK](https://pypi.org/project/kubeflow-katib/) to simplify creation of hyperparameter tuning jobs for Data Scientists. Run the following command to install the latest stable release of Katib SDK: ```sh pip install -U kubeflow-katib ``` ## Getting Started Please refer to [the getting started guide](https://www.kubeflow.org/docs/components/katib/getting-started/#getting-started-with-katib-python-sdk) to quickly create your first hyperparameter tuning Experiment using the Python SDK. ## Community The following links provide information on how to get involved in the community: - Attend [the bi-weekly AutoML and Training Working Group](https://bit.ly/2PWVCkV) community meeting. - Join our [`#kubeflow-katib`](https://www.kubeflow.org/docs/about/community/#kubeflow-slack-channels) Slack channel. - Check out [who is using Katib](ADOPTERS.md) and [presentations about Katib project](docs/presentations.md). ## Contributing Please refer to the [CONTRIBUTING guide](CONTRIBUTING.md). ## Citation If you use Katib in a scientific publication, we would appreciate citations to the following paper: [A Scalable and Cloud-Native Hyperparameter Tuning System](https://arxiv.org/abs/2006.02085), George _et al._, arXiv:2006.02085, 2020. Bibtex entry: ``` @misc{george2020katib, title={A Scalable and Cloud-Native Hyperparameter Tuning System}, author={Johnu George and Ce Gao and Richard Liu and Hou Gang Liu and Yuan Tang and Ramdoot Pydipaty and Amit Kumar Saha}, year={2020}, eprint={2006.02085}, archivePrefix={arXiv}, primaryClass={cs.DC} } ```

ML Frameworks Container Management

1.7K Github Stars

Open Source

sdk

TinyVG software development kit

Developer Tools Mobile Development AI Agents

301 Github Stars

kubeflow

Software by kubeflow

kubeflow

pipelines

trainer

katib

sdk