KPilot
Unified control plane for multi-cluster Kubernetes management, GPU compute scheduling, and model serving.
What is KPilot
KPilot is a control plane for running GPU workloads on Kubernetes. Cluster operations, Volcano-based batch scheduling, vGPU governance, hardware telemetry, plugin lifecycle, model serving, and an AI ops assistant that drives all of the above live behind one console with a consistent permission and audit surface.
Multi-cluster is the default — a single KPilot Server manages many clusters, with the in-cluster agent dialing back over a single long-lived multiplexed connection. No inbound ports on the cluster side, no shared kubeconfigs, no per-cloud divergence.
Why KPilot
-
Zero inbound ports; kubeconfigs never leave the cluster. Worker dials Server outbound over a single multiplexed connection — every K8s API call, Helm install, Pod log / exec session, embedded reverse proxy (Grafana, VictoriaMetrics, VictoriaLogs), and inference SSE stream shares the same connection without head-of-line blocking.
-
GPU + Volcano as one integrated platform. Volcano gang scheduling across 10 CR kinds with typed authoring forms and a visual scheduler-policy editor; vGPU slicing (slot / framebuffer / SM cores) parsed per-card with the Pods currently holding each slice; DCGM-driven GPU-Hour reports across 1h / 24h / 7d / 30d windows plus alerts on XID, ECC, thermal, and framebuffer pressure.
-
In-app model serving from catalog to API. One-click deploy curated open-source LLMs (Qwen3, DeepSeek-R1, Llama-4, Mistral, Phi-4, GLM-5.1, Gemma-4, Kimi-K2.6 — all on vLLM by default) to any managed cluster, debug each instance through the in-browser chat playground, then mint scoped OpenAI-compatible Bearer keys for application teams.
-
Plugin-first platform. KPilot's own observability stack — VictoriaMetrics, VictoriaLogs, Grafana, DCGM Exporter, Metrics Server, kube-state-metrics — ships through the same built-in Helm registry operators use to install arbitrary customer charts. Per-cluster enable / disable / upgrade from the UI; per-cluster values overrides; bring your own charts the same way.
-
An ops assistant that can act. Point KPilot AI at any OpenAI-compatible model and it drives the platform with KPilot's own capabilities as tools: query nodes / workloads / Pods across clusters, run PromQL and LogsQL, apply / delete resources, cordon nodes. Read tools run immediately; every write tool pauses for an in-chat approval card and is recorded to an audit log. Platform docs ship as progressive-disclosure skills, and KPilot AI distills new skills from experience and remembers operator preferences across sessions.
Architecture
Server owns the UI, API, and durable state (cluster registry, plugin metadata, accounts, API keys, model templates) but holds no kubeconfigs. Worker runs inside each managed cluster, dials the Server over a single long-lived multiplexed connection and brokers every Kubernetes operation on its behalf — no inbound ports, no shared credentials, no cross-cloud divergence. Plugins ship as Helm charts and reconcile via an in-cluster CRD, executing in the cluster's own RBAC context.
Quick Start
Server + local Worker in one shot (the common "manage the cluster you're installing into" path):
helm install kpilot oci://ghcr.io/togettoyou/charts/kpilot \
--version 0.0.0-dev \
--namespace kpilot-system --create-namespace \
--set server.admin.password='<change-me>' \
--set worker.enabled=true \
--set server.ai.baseURL='https://openrouter.ai/api/v1' \
--set server.ai.apiKey='<model endpoint API key>' \
--set server.ai.model='gpt-4o'
The chart auto-generates a shared bootstrap token, points the Worker at the in-cluster transport Service, and the Server registers a cluster row named local on first start. No need to click through the UI to mint a token — the cluster shows up Online within a few seconds.
The three
server.ai.*flags are optional — they enable KPilot AI (point at any OpenAI-compatible endpoint; the model must support function calling). Omit them and the platform still mounts but the chat page shows a "not configured" hint.
Port-forward the UI and log in with kpilot / <your password>:
kubectl -n kpilot-system port-forward svc/kpilot-server 8080:80
open http://localhost:8080
Optional: add a remote managed cluster (one per cluster). Create a cluster row in the UI, copy the one-time ClusterToken, then on the target cluster:
helm install kpilot-worker oci://ghcr.io/togettoyou/charts/kpilot \
--version 0.0.0-dev \
--namespace kpilot-system --create-namespace \
--set server.enabled=false,worker.enabled=true,postgresql.enabled=false \
--set worker.serverAddr='<Server transport external addr>:9090' \
--set worker.clusterToken='<paste-token>'
Production exposure (Ingress, external Postgres, image registry mirrors) is covered in deploy/README.md.
Key Features
Cluster Management
|
Compute Scheduling
|
GPU Observability
|
Plugin Management
|
Model Serving
|
OpenAI-Compatible Gateway
|
System Monitoring
|
KPilot AI
|
Screenshots
Cluster Management — docs/clusters.md
![]() |
![]() |
![]() |
![]() |
Compute Scheduling — docs/compute.md
![]() |
![]() |
![]() |
![]() |
Model Serving — docs/models.md
![]() |
![]() |
![]() |
![]() |
KPilot AI — docs/kpilot-ai.md
An ops assistant that drives the other platforms. Point KPilot AI at any OpenAI-compatible model and it uses KPilot's own capabilities as tools: query resources across clusters, run PromQL and LogsQL, apply / delete, cordon nodes. Read tools run immediately; write tools pause for an in-chat approval card and are fully audited. Platform docs ship as progressive-disclosure skills, and KPilot AI distills new skills from experience and remembers operator preferences across sessions. Enable it by setting KPILOT_AI_BASE_URL and KPILOT_AI_API_KEY.
![]() |
![]() |
![]() |
![]() |
Plugin Management — docs/plugins.md
![]() |
![]() |
System Management — docs/system.md
![]() |
![]() |



















