About kpilot

KPilot: Unified control plane for multi-cluster Kubernetes management, GPU compute scheduling, and model serving.

t

Published by

togettoyou

Visit View Profile

README.md

View on GitHub

KPilot

Unified control plane for multi-cluster Kubernetes management, GPU compute scheduling, and model serving.

English · 中文

What is KPilot

KPilot is a control plane for running GPU workloads on Kubernetes. Cluster operations, Volcano-based batch scheduling, vGPU governance, hardware telemetry, plugin lifecycle, model serving, and an AI ops assistant that drives all of the above live behind one console with a consistent permission and audit surface.

Multi-cluster is the default — a single KPilot Server manages many clusters, with the in-cluster agent dialing back over a single long-lived multiplexed connection. No inbound ports on the cluster side, no shared kubeconfigs, no per-cloud divergence.

Why KPilot

Zero inbound ports; kubeconfigs never leave the cluster. Worker dials Server outbound over a single multiplexed connection — every K8s API call, Helm install, Pod log / exec session, embedded reverse proxy (Grafana, VictoriaMetrics, VictoriaLogs), and inference SSE stream shares the same connection without head-of-line blocking.
GPU + Volcano as one integrated platform. Volcano gang scheduling across 10 CR kinds with typed authoring forms and a visual scheduler-policy editor; vGPU slicing (slot / framebuffer / SM cores) parsed per-card with the Pods currently holding each slice; DCGM-driven GPU-Hour reports across 1h / 24h / 7d / 30d windows plus alerts on XID, ECC, thermal, and framebuffer pressure.
In-app model serving from catalog to API. One-click deploy curated open-source LLMs (Qwen3, DeepSeek-R1, Llama-4, Mistral, Phi-4, GLM-5.1, Gemma-4, Kimi-K2.6 — all on vLLM by default) to any managed cluster, debug each instance through the in-browser chat playground, then mint scoped OpenAI-compatible Bearer keys for application teams.
Plugin-first platform. KPilot's own observability stack — VictoriaMetrics, VictoriaLogs, Grafana, DCGM Exporter, Metrics Server, kube-state-metrics — ships through the same built-in Helm registry operators use to install arbitrary customer charts. Per-cluster enable / disable / upgrade from the UI; per-cluster values overrides; bring your own charts the same way.
An ops assistant that can act. Point KPilot AI at any OpenAI-compatible model and it drives the platform with KPilot's own capabilities as tools: query nodes / workloads / Pods across clusters, run PromQL and LogsQL, apply / delete resources, cordon nodes. Read tools run immediately; every write tool pauses for an in-chat approval card and is recorded to an audit log. Platform docs ship as progressive-disclosure skills, and KPilot AI distills new skills from experience and remembers operator preferences across sessions.

Architecture

KPilot architecture (C4 container diagram)

Server owns the UI, API, and durable state (cluster registry, plugin metadata, accounts, API keys, model templates) but holds no kubeconfigs. Worker runs inside each managed cluster, dials the Server over a single long-lived multiplexed connection and brokers every Kubernetes operation on its behalf — no inbound ports, no shared credentials, no cross-cloud divergence. Plugins ship as Helm charts and reconcile via an in-cluster CRD, executing in the cluster's own RBAC context.

Quick Start

Server + local Worker in one shot (the common "manage the cluster you're installing into" path):

helm install kpilot oci://ghcr.io/togettoyou/charts/kpilot \
  --version 0.0.0-dev \
  --namespace kpilot-system --create-namespace \
  --set server.admin.password='<change-me>' \
  --set worker.enabled=true \
  --set server.ai.baseURL='https://openrouter.ai/api/v1' \
  --set server.ai.apiKey='<model endpoint API key>' \
  --set server.ai.model='gpt-4o'

The chart auto-generates a shared bootstrap token, points the Worker at the in-cluster transport Service, and the Server registers a cluster row named local on first start. No need to click through the UI to mint a token — the cluster shows up Online within a few seconds.

The three server.ai.* flags are optional — they enable KPilot AI (point at any OpenAI-compatible endpoint; the model must support function calling). Omit them and the platform still mounts but the chat page shows a "not configured" hint.

Port-forward the UI and log in with kpilot / <your password>:

kubectl -n kpilot-system port-forward svc/kpilot-server 8080:80
open http://localhost:8080

Optional: add a remote managed cluster (one per cluster). Create a cluster row in the UI, copy the one-time ClusterToken, then on the target cluster:

helm install kpilot-worker oci://ghcr.io/togettoyou/charts/kpilot \
  --version 0.0.0-dev \
  --namespace kpilot-system --create-namespace \
  --set server.enabled=false,worker.enabled=true,postgresql.enabled=false \
  --set worker.serverAddr='<Server transport external addr>:9090' \
  --set worker.clusterToken='<paste-token>'

Production exposure (Ingress, external Postgres, image registry mirrors) is covered in deploy/README.md.

Key Features


Cluster Management Multi-cluster onboarding via a single-use token; no kubeconfig sharing Live node and workload browser covering native and custom resources In-browser Pod logs, terminal, and per-container CPU / memory metrics Inline YAML editor with apply / describe / delete for any resource Self-rendered monitoring and logging — no Grafana iframe required	Compute Scheduling Volcano gang scheduling across Queue, Job, CronJob, PodGroup, HyperNode Fine-grained GPU sharing via volcano-vgpu-device-plugin (slot / framebuffer / SM cores) Multi-resource queue quotas with capability, guarantee, allocated, and deserved views Visual scheduler-policy editor for actions, tiers, and plugin parameters
GPU Observability Per-card panels for utilization, temperature, power, framebuffer, SM clock, tensor activity Multi-select node / GPU filter, threshold reference lines, and per-chart fullscreen "GPUs that need attention" — idle / hot / OOM-risk highlighted automatically DCGM-driven GPU-Hour reports across 1h / 24h / 7d / 30d windows Alerts on DCGM XID, ECC, thermal, and framebuffer-pressure conditions vGPU view mapping every physical card to its current slice holders	Plugin Management Built-in Helm registry covering Volcano, DCGM Exporter, VictoriaMetrics, VictoriaLogs, Grafana, Metrics Server, kube-state-metrics Per-cluster enable / disable / upgrade with the install log streamed live Bring-your-own charts with per-cluster values overrides The same pipeline that installs customer charts also bootstraps KPilot's own observability stack
Model Serving Curated catalog: Qwen3-0.6B/8B/14B/32B-Instruct, Qwen3-30B-A3B (MoE), DeepSeek-R1, Llama-4-Scout-17B-16E (MoE), Mistral-Small-3.2-24B, Phi-4, GLM-5.1, Gemma-4-31B, Kimi-K2.6 — all on vLLM by default One-click deploy to any managed cluster with `nvidia.com/gpu` or Volcano vGPU resource shaping Cross-cluster table of running instances with per-row chat / Describe / cascade Delete In-browser chat playground with token/sec, `<think>` chain-of-thought collapse, and markdown rendering	OpenAI-Compatible Gateway Per-deployment Bearer keys (`kp-sk-…`), shown once at creation; sha256-hashed at rest End-to-end SSE streaming — vLLM tokens reach the SDK live; client-side Stop kills the upstream call immediately Per-key token and request metering; operator-resettable counters Two-stage scope picker (cluster → deployment) prevents key/deployment mismatch Soft revoke (preserve audit row) and hard delete
System Monitoring Live Go runtime view of the KPilot server and every connected worker — goroutines, heap, GC pauses, scheduler latency, CPU and memory usage, file descriptors One day of history retained server-side; range picker (1h / 3h / 6h / 12h / 24h or custom) for arbitrary lookback — offline workers stay inspectable for post-mortem One-click pprof downloads (heap, goroutine, CPU, allocs, block, mutex, trace) for offline flame-graph analysis with `go tool pprof` Custom business collectors — yamux sessions and streams (split by cluster), HTTP RPS / 5xx / p99 latency, DB pool, SSE clients, inference proxy in-flight Self-observability for the control plane — independent of VictoriaMetrics / Grafana	KPilot AI An ops assistant that drives the other five platforms — its tools are KPilot's existing operations, its preset knowledge is the platform docs Read-only recon (cross-cluster resource queries / PromQL / LogsQL) runs immediately; writes (apply / delete / cordon) pause for an in-chat approval card the operator must approve Progressive-disclosure skills + a self-evolving curator + cross-session memory — it gets to know your clusters over time Every write action is fully audited Point at any OpenAI-compatible model (base_url / key via env); the platform stays visible when unconfigured

Screenshots

Cluster Management — `docs/clusters.md`

Compute Scheduling — `docs/compute.md`

Model Serving — `docs/models.md`

KPilot AI — `docs/kpilot-ai.md`

An ops assistant that drives the other platforms. Point KPilot AI at any OpenAI-compatible model and it uses KPilot's own capabilities as tools: query resources across clusters, run PromQL and LogsQL, apply / delete, cordon nodes. Read tools run immediately; write tools pause for an in-chat approval card and are fully audited. Platform docs ship as progressive-disclosure skills, and KPilot AI distills new skills from experience and remembers operator preferences across sessions. Enable it by setting KPILOT_AI_BASE_URL and KPILOT_AI_API_KEY.

Plugin Management — `docs/plugins.md`

System Management — `docs/system.md`

Runtime detail — KPIs + time series + pprof

kpilot

About kpilot

Platforms

Languages

Links

README.md

KPilot

What is KPilot

Why KPilot

Architecture

Quick Start

Key Features

Screenshots

Cluster Management — `docs/clusters.md`

Compute Scheduling — `docs/compute.md`

Model Serving — `docs/models.md`

KPilot AI — `docs/kpilot-ai.md`

Plugin Management — `docs/plugins.md`

System Management — `docs/system.md`

kpilot

About kpilot

Platforms

Languages

Links

README.md

KPilot

What is KPilot

Why KPilot

Architecture

Quick Start

Key Features

Screenshots

Cluster Management — docs/clusters.md

Compute Scheduling — docs/compute.md

Model Serving — docs/models.md

KPilot AI — docs/kpilot-ai.md

Plugin Management — docs/plugins.md

System Management — docs/system.md

Cluster Management — `docs/clusters.md`

Compute Scheduling — `docs/compute.md`

Model Serving — `docs/models.md`

KPilot AI — `docs/kpilot-ai.md`

Plugin Management — `docs/plugins.md`

System Management — `docs/system.md`