Home
Softono
kpilot

kpilot

Open source MIT Go
23
Stars
1
Forks
0
Issues
0
Watchers
1 week
Last Commit

About kpilot

KPilot: Unified control plane for multi-cluster Kubernetes management, GPU compute scheduling, and model serving.

Platforms

Web Self-hosted Cloud Kubernetes

Languages

Go

KPilot

Unified control plane for multi-cluster Kubernetes management, GPU compute scheduling, and model serving.

English · 中文

License Stars Last commit Helm chart


What is KPilot

KPilot is a control plane for running GPU workloads on Kubernetes. Cluster operations, Volcano-based batch scheduling, vGPU governance, hardware telemetry, plugin lifecycle, model serving, and an AI ops assistant that drives all of the above live behind one console with a consistent permission and audit surface.

Multi-cluster is the default — a single KPilot Server manages many clusters, with the in-cluster agent dialing back over a single long-lived multiplexed connection. No inbound ports on the cluster side, no shared kubeconfigs, no per-cloud divergence.

Why KPilot

  • Zero inbound ports; kubeconfigs never leave the cluster. Worker dials Server outbound over a single multiplexed connection — every K8s API call, Helm install, Pod log / exec session, embedded reverse proxy (Grafana, VictoriaMetrics, VictoriaLogs), and inference SSE stream shares the same connection without head-of-line blocking.

  • GPU + Volcano as one integrated platform. Volcano gang scheduling across 10 CR kinds with typed authoring forms and a visual scheduler-policy editor; vGPU slicing (slot / framebuffer / SM cores) parsed per-card with the Pods currently holding each slice; DCGM-driven GPU-Hour reports across 1h / 24h / 7d / 30d windows plus alerts on XID, ECC, thermal, and framebuffer pressure.

  • In-app model serving from catalog to API. One-click deploy curated open-source LLMs (Qwen3, DeepSeek-R1, Llama-4, Mistral, Phi-4, GLM-5.1, Gemma-4, Kimi-K2.6 — all on vLLM by default) to any managed cluster, debug each instance through the in-browser chat playground, then mint scoped OpenAI-compatible Bearer keys for application teams.

  • Plugin-first platform. KPilot's own observability stack — VictoriaMetrics, VictoriaLogs, Grafana, DCGM Exporter, Metrics Server, kube-state-metrics — ships through the same built-in Helm registry operators use to install arbitrary customer charts. Per-cluster enable / disable / upgrade from the UI; per-cluster values overrides; bring your own charts the same way.

  • An ops assistant that can act. Point KPilot AI at any OpenAI-compatible model and it drives the platform with KPilot's own capabilities as tools: query nodes / workloads / Pods across clusters, run PromQL and LogsQL, apply / delete resources, cordon nodes. Read tools run immediately; every write tool pauses for an in-chat approval card and is recorded to an audit log. Platform docs ship as progressive-disclosure skills, and KPilot AI distills new skills from experience and remembers operator preferences across sessions.

Architecture

KPilot architecture (C4 container diagram)

Server owns the UI, API, and durable state (cluster registry, plugin metadata, accounts, API keys, model templates) but holds no kubeconfigs. Worker runs inside each managed cluster, dials the Server over a single long-lived multiplexed connection and brokers every Kubernetes operation on its behalf — no inbound ports, no shared credentials, no cross-cloud divergence. Plugins ship as Helm charts and reconcile via an in-cluster CRD, executing in the cluster's own RBAC context.

Quick Start

Server + local Worker in one shot (the common "manage the cluster you're installing into" path):

helm install kpilot oci://ghcr.io/togettoyou/charts/kpilot \
  --version 0.0.0-dev \
  --namespace kpilot-system --create-namespace \
  --set server.admin.password='<change-me>' \
  --set worker.enabled=true \
  --set server.ai.baseURL='https://openrouter.ai/api/v1' \
  --set server.ai.apiKey='<model endpoint API key>' \
  --set server.ai.model='gpt-4o'

The chart auto-generates a shared bootstrap token, points the Worker at the in-cluster transport Service, and the Server registers a cluster row named local on first start. No need to click through the UI to mint a token — the cluster shows up Online within a few seconds.

The three server.ai.* flags are optional — they enable KPilot AI (point at any OpenAI-compatible endpoint; the model must support function calling). Omit them and the platform still mounts but the chat page shows a "not configured" hint.

Port-forward the UI and log in with kpilot / <your password>:

kubectl -n kpilot-system port-forward svc/kpilot-server 8080:80
open http://localhost:8080

Optional: add a remote managed cluster (one per cluster). Create a cluster row in the UI, copy the one-time ClusterToken, then on the target cluster:

helm install kpilot-worker oci://ghcr.io/togettoyou/charts/kpilot \
  --version 0.0.0-dev \
  --namespace kpilot-system --create-namespace \
  --set server.enabled=false,worker.enabled=true,postgresql.enabled=false \
  --set worker.serverAddr='<Server transport external addr>:9090' \
  --set worker.clusterToken='<paste-token>'

Production exposure (Ingress, external Postgres, image registry mirrors) is covered in deploy/README.md.

Key Features

Cluster Management
  • Multi-cluster onboarding via a single-use token; no kubeconfig sharing
  • Live node and workload browser covering native and custom resources
  • In-browser Pod logs, terminal, and per-container CPU / memory metrics
  • Inline YAML editor with apply / describe / delete for any resource
  • Self-rendered monitoring and logging — no Grafana iframe required
Compute Scheduling
  • Volcano gang scheduling across Queue, Job, CronJob, PodGroup, HyperNode
  • Fine-grained GPU sharing via volcano-vgpu-device-plugin (slot / framebuffer / SM cores)
  • Multi-resource queue quotas with capability, guarantee, allocated, and deserved views
  • Visual scheduler-policy editor for actions, tiers, and plugin parameters
GPU Observability
  • Per-card panels for utilization, temperature, power, framebuffer, SM clock, tensor activity
  • Multi-select node / GPU filter, threshold reference lines, and per-chart fullscreen
  • "GPUs that need attention" — idle / hot / OOM-risk highlighted automatically
  • DCGM-driven GPU-Hour reports across 1h / 24h / 7d / 30d windows
  • Alerts on DCGM XID, ECC, thermal, and framebuffer-pressure conditions
  • vGPU view mapping every physical card to its current slice holders
Plugin Management
  • Built-in Helm registry covering Volcano, DCGM Exporter, VictoriaMetrics, VictoriaLogs, Grafana, Metrics Server, kube-state-metrics
  • Per-cluster enable / disable / upgrade with the install log streamed live
  • Bring-your-own charts with per-cluster values overrides
  • The same pipeline that installs customer charts also bootstraps KPilot's own observability stack
Model Serving
  • Curated catalog: Qwen3-0.6B/8B/14B/32B-Instruct, Qwen3-30B-A3B (MoE), DeepSeek-R1, Llama-4-Scout-17B-16E (MoE), Mistral-Small-3.2-24B, Phi-4, GLM-5.1, Gemma-4-31B, Kimi-K2.6 — all on vLLM by default
  • One-click deploy to any managed cluster with nvidia.com/gpu or Volcano vGPU resource shaping
  • Cross-cluster table of running instances with per-row chat / Describe / cascade Delete
  • In-browser chat playground with token/sec, <think> chain-of-thought collapse, and markdown rendering
OpenAI-Compatible Gateway
  • Per-deployment Bearer keys (kp-sk-…), shown once at creation; sha256-hashed at rest
  • End-to-end SSE streaming — vLLM tokens reach the SDK live; client-side Stop kills the upstream call immediately
  • Per-key token and request metering; operator-resettable counters
  • Two-stage scope picker (cluster → deployment) prevents key/deployment mismatch
  • Soft revoke (preserve audit row) and hard delete
System Monitoring
  • Live Go runtime view of the KPilot server and every connected worker — goroutines, heap, GC pauses, scheduler latency, CPU and memory usage, file descriptors
  • One day of history retained server-side; range picker (1h / 3h / 6h / 12h / 24h or custom) for arbitrary lookback — offline workers stay inspectable for post-mortem
  • One-click pprof downloads (heap, goroutine, CPU, allocs, block, mutex, trace) for offline flame-graph analysis with go tool pprof
  • Custom business collectors — yamux sessions and streams (split by cluster), HTTP RPS / 5xx / p99 latency, DB pool, SSE clients, inference proxy in-flight
  • Self-observability for the control plane — independent of VictoriaMetrics / Grafana
KPilot AI
  • An ops assistant that drives the other five platforms — its tools are KPilot's existing operations, its preset knowledge is the platform docs
  • Read-only recon (cross-cluster resource queries / PromQL / LogsQL) runs immediately; writes (apply / delete / cordon) pause for an in-chat approval card the operator must approve
  • Progressive-disclosure skills + a self-evolving curator + cross-session memory — it gets to know your clusters over time
  • Every write action is fully audited
  • Point at any OpenAI-compatible model (base_url / key via env); the platform stays visible when unconfigured

Screenshots

Cluster Management — docs/clusters.md

Pod browser with live logs and terminal Self-rendered cluster monitoring
Cluster logging Embedded Grafana escape hatch

Compute Scheduling — docs/compute.md

Scheduler overview Visual scheduler policy editor
vGPU view Volcano Job authoring

Model Serving — docs/models.md

Model catalog One-click deploy drawer
In-browser chat playground API key management

KPilot AI — docs/kpilot-ai.md

An ops assistant that drives the other platforms. Point KPilot AI at any OpenAI-compatible model and it uses KPilot's own capabilities as tools: query resources across clusters, run PromQL and LogsQL, apply / delete, cordon nodes. Read tools run immediately; write tools pause for an in-chat approval card and are fully audited. Platform docs ship as progressive-disclosure skills, and KPilot AI distills new skills from experience and remembers operator preferences across sessions. Enable it by setting KPILOT_AI_BASE_URL and KPILOT_AI_API_KEY.

Chat — streaming replies + inline tool cards + write-approval cards Skills — bundled / learned, progressive disclosure
Cross-session long-term memory Write-action audit

Plugin Management — docs/plugins.md

Plugin admin Plugin install

System Management — docs/system.md

System monitoring — node overview Runtime detail — KPIs + time series + pprof