Leoflow
The workflow orchestrator that ate Apache Airflow's lunch.
Same UI. Same vocabulary. Ten times the speed. Zero of the pain.
Native map-reduce for ML/AI β fan-out + reduce as a Python list comprehension, no XCom plumbing, no broker, no special operator.
π Documentation
Full docs β https://neochaotic.github.io/leoflow/ (DAG authoring, deploy, API reference, architecture).
| Operating modes Β· Editions | Lite Β· Pro Β· Demo β the runtime split and the packaging split |
| DAG authoring | write a DAG; the Lite β deploy lifecycle |
| Map-reduce for ML | fan-out + reduce as a Python list comprehension |
| CI/CD & deploy examples | GitHub Actions Β· GitLab Β· Cloud Build/Run Β· generic |
| Helm chart | Pro install: values reference, hardening, PoC recipe |
| HTTP API (Scalar) Β· Go packages | API references |
| Concepts & glossary Β· Architecture | the model & the why |
β‘ Install
Leoflow ships in two editions. Pick the track that matches your target:
Lite β the full engine on a single host
The full Leoflow control plane on one machine β the same engine, the same
Airflow-3.2 UI, the same dag.py + leoflow.yaml as Pro β in a one-command,
no-Kubernetes package. Light to run, powerful where it counts: real
pod-per-task execution, a durable Postgres datastore, hot-reload, a real
admin login. It is not a stripped-down demo β it is the whole engine scoped to
one host, carrying real workloads from local development to light-to-medium
production on a laptop, a single VM, or an internal server (trusted network).
curl -fsSL https://raw.githubusercontent.com/neochaotic/leoflow/main/install.sh | sh
leoflow lite # hot-reload at http://localhost:8088 (LITE badge)
Installs the binaries and runs leoflow setup (ensures Python, provisions the
parser, creates your workspace at ~/leoflow/) β no sudo, no system Python,
no package manager. Docker is optional and only unlocks the Kubernetes
executor for higher-fidelity local runs. Linux + macOS, amd64 + arm64
(Windows via WSL2). Each DAG gets its own per-DAG venv under
~/.leoflow/dev/venvs/<dag_id>/ so conflicting dependencies between DAGs
coexist out of the box.
Pro β Kubernetes cluster (Helm)
helm repo add leoflow https://neochaotic.github.io/leoflow # (charts published per release)
kubectl create namespace leoflow
helm install lf leoflow/leoflow -n leoflow \
--set image.tag=v0.0.1 \
--set migrations.image.tag=v0.0.1 \
--set database.url='postgres://USER:PASS@HOST:5432/leoflow?sslmode=verify-full' \
--set redis.url='rediss://HOST:6380/0' \
--set auth.jwtSecret="$(openssl rand -base64 64)" \
--set secretKey="$(openssl rand -hex 16)" \
--set bootstrap.password='change-me'
Pro deploys the control plane on a real cluster (leoflow-server Deployment
- RBAC for the pod-per-task executor + a pre-install migrations Job). External
Postgres 13+ and Redis 6+ are required (the chart fails the install otherwise β
embedded datastores are Lite-only). Managed datastores work out of the box
(Cloud SQL / RDS / Memorystore / ElastiCache / Azure Cache), with optional
caConfigMapknobs for verified TLS. See the chart docs.
Full guide for both tracks β Installation.
The Five Wounds Apache Airflow Will Not Heal
Airflow is the most widely deployed workflow orchestrator on earth. It is also the one that bleeds the most in production. Every data engineer recognizes these wounds:
- The scheduler that stalls. Three seconds between tasks. Ten when the cluster is busy. Pipelines that should run in two minutes take twenty.
- The triggerer that suffocates. Above five hundred concurrent sensors, the Python asyncio loop chokes. Sensors stop firing. SLAs miss.
- The DAG file that re-parses itself to death. Every scheduler loop opens every
.pyfile in/dags. CPU spikes for nothing. Memory grows. Restarts become a ritual. - The worker that leaks until it dies. Long-running Celery workers accumulate file descriptors, database connections, half-loaded modules. OOMKilled at three in the morning. Always.
- The dependency hell that has no door.
pandas==1.0for the legacy DAG,pandas==2.0for the new one. One Airflow image. Pick a side. Cry either way.
Leoflow was built to close these five wounds, on day one, by construction.
How It Closes Them
Leoflow does not invent a new execution model. Pod-per-task is the right pattern, and Airflow's KubernetesExecutor proved it years ago. What Leoflow does is strip out the Python overhead from every layer of the orchestration stack:
| Wound | Airflow today | Leoflow |
|---|---|---|
| Scheduler latency | 3-10 seconds per decision | <200 ms β native Go, zero GIL |
| Sensor concurrency | ~500 (asyncio Triggerer) | 100,000+ β each sensor is a 2 KB goroutine |
| DAG parsing cost | Re-parsed every scheduler loop | Zero β DAG is pre-compiled to immutable JSON |
| Worker lifecycle | Long-lived, leak-prone | Ephemeral pod per task β spawn, run, die |
| Worker image size | 1.5 GB+ Airflow base | 200 MB typical β each DAG is its own slim image |
| Dependency isolation | Workaround via KubernetesPodOperator |
Native β every DAG is a container |
| Cold start | 15-45 s | 2-5 s target β agent is a 15 MB static binary |
| Observability | Retrofitted with effort | Native β Prometheus + OpenTelemetry + structured logs from commit one |
This is not marketing. This is what falls out of replacing a Python control plane with Go and embracing the container as the unit of isolation.
What Leoflow Is
Leoflow is a GitOps-first, container-native workflow orchestrator written in Go. Each phrase carries weight:
- GitOps-first. Your DAG is a versioned artifact (
dag.json+ container image), not live source code. CI builds it. The registry stores it. Rollback is a tag change. - Container-native. Each DAG is its own container image, with its own dependencies, its own Python version, its own everything. Built automatically from a one-page
leoflow.yamlβ you never touch Docker unless you want to. - Airflow-UI compatible. The MVP runs the unmodified Apache Airflow 3.2.x UI. Your team's muscle memory survives the migration. No new tool to learn.
- Go performance, Go discipline. Static binary. No GIL. Goroutines for concurrency. Test-driven from the first commit. Go Report Card A+ enforced in CI.
What It Looks Like to Use
A complete Leoflow DAG project. No Dockerfile. No requirements.txt. No CI plumbing to invent.
# leoflow.yaml
dag_id: etl_vendas
python_version: "3.11"
dependencies:
- pandas==2.1.0
- requests==2.31.0
# dag.py
from leoflow import DAG, task
@task
def fetch():
import requests
return requests.get("https://api.example.com/orders").json()
@task
def transform(orders):
return [{"id": o["id"], "value": o["amount"] * 1.1} for o in orders]
with DAG("etl_vendas", schedule="0 5 * * *") as dag:
raw = fetch()
processed = transform(raw)
leoflow compile . # generates Dockerfile, builds image, produces dag.json
leoflow push ./dag.json # registers with the control plane
That is the entire developer surface. The CLI builds the image on the published Leoflow task base (ghcr.io/neochaotic/leoflow-runtime:py3.11, selected by python_version), pushes to your registry, and registers a versioned DAG. The Airflow UI shows it at the next refresh.
Native map-reduce for ML/AI
Hyperparameter search, k-fold cross-validation, ensemble training, batch inference, sharded preprocessing, Monte Carlo β every parallel ML workload is map-reduce. Most orchestrators make you build it: an operator per fan-out, a broker for the intermediate values, shared storage for the artifacts, and a custom reducer that knows how to find them all. Leoflow expresses the whole pattern in two lines of Python:
from airflow.sdk import DAG, task
@task
def trial(lr: float) -> dict:
return train_one(lr) # map
@task
def select_best(trials: list[dict]) -> dict:
return max(trials, key=lambda r: r["score"]) # reduce
with DAG("hparam_search", schedule=None):
select_best([trial(lr) for lr in [0.001, 0.01, 0.05, 0.1, 0.5]])
That [trial(lr) for lr in β¦] is the whole map. trials: list[dict] is the
whole reduce. No XCom plumbing, no broker setup, no shared filesystem, no
special operator β the parser captures the list shape at compile time; the
runtime assembles the upstream XComs in declaration order and delivers them
as a real Python list. Per-trial isolation (own pod / own process), per-trial
retry, deterministic ordering, and a 256 KB cap per upstream β and a
null slot for any upstream that legitimately produced no result.
| ML pattern | Map | Reduce |
|---|---|---|
| Hyperparameter search | one task per (lr, batch, seed) triple |
pick the best metric |
| K-fold cross-validation | one task per fold | average the metrics |
| Ensemble training | one task per base model | combine predictions / stack |
| Batch inference | one task per partition | collect predictions to a sink |
| Monte Carlo | one task per worker | average / sum results |
Runnable example: examples/ml_hparam_search/. Full reference:
Map-reduce for ML β guarantees, limits, what
activates fan-in vs what does not, and the on-disk dag.json shape.
Architecture
A DAG is compiled into an immutable artifact (a dag.json spec plus a
container image) and pushed to the control plane. A Go control plane
schedules it and, for each task, dispatches an ephemeral worker pod whose
leoflow-agent runs the user code and reports back over gRPC. Postgres holds
metadata; Redis holds XCom values and live-log fan-out.
flowchart LR
subgraph dev["Author / CI"]
src["leoflow.yaml Β· dag.py Β· Dockerfile"]
end
subgraph cp["Control plane β Go"]
direction TB
api["HTTP API /api/v2<br/>JWT Β· RBAC Β· multi-tenant"]
sched["Scheduler<br/>state machine Β· cron<br/>leader election Β· retries"]
asvc["Agent gRPC service<br/>task spec Β· state Β· XCom Β· logs"]
api --- sched --- asvc
end
src -->|"leoflow compile / push"| api
sched --- pg[("Postgres<br/>metadata")]
asvc --- redis[("Redis<br/>XCom Β· log tail")]
sched -->|"dispatch: one pod per task"| pod
subgraph k8s["Kubernetes"]
pod["Worker pod = your DAG image<br/>leoflow-agent β your Python / Bash"]
end
pod -->|"gRPC: register Β· fetch spec Β· push XCom Β· stream logs Β· report state"| asvc
classDef store fill:#1f2937,stroke:#4b5563,color:#e5e7eb;
class pg,redis store;
Short-lived http_api tasks skip the pod and run inline as goroutines in the
control plane (capped); everything else runs pod-per-task. Read
the ADRs for the reasoning behind every decision.
Status
π§ͺ Experimental β pre-1.0. The HTTP API (/api/v2), CLI, and Helm chart
values may change between minor versions until v1.0.0 locks them. Production
Pro deployments are supported by the Helm chart today, with the usual caveats
that come with a pre-1.0 codebase: pin to a specific tag, read the
upgrades guide before bumping, and exercise
backup/restore before you need to.
Versioning follows ADR 0037:
v0.0.1 ends the pre-alpha series; every release after is
vX.Y.Z-rc.N β vX.Y.Z.
Implemented today (Phases 1β4):
- CLI + parser β
leoflow init / validate / compile / push / runs trigger / runs status / auth create-token; the Python DAG parser;compile --build / --pushbuilds and pushes the DAG image (out-of-process). - Control plane β Airflow-compatible
/api/v2API, JWT auth + RBAC + multi-tenant, the scheduler state machine with cron scheduling, Postgres advisory-lock leader election, task retries, embedded Scalar API docs, and Prometheus + OpenTelemetry observability. - Execution β real pod-per-task execution via the
leoflow-agentover gRPC (Kubernetes, ADR 0015), plus inlinehttp_apigoroutines for short calls; orphaned-pod reconciliation and completed-pod garbage collection. - Data flow β XCom on Redis (256 KB limit, TTL, optional schema validation) passed between tasks; log shipping to disk with a read API and live tailing over Redis pub/sub.
Not yet implemented: load tests (Phase 6) and S3/GCS log sinks. Tracked refinements live in the issue tracker.
Features in the MVP
Shipping in v0.1.0:
- Python, Bash, and HTTP API operators
- DAG-as-Image model with automatic image build via
leoflow.yaml - Hybrid DAG authoring: Python source parsed at compile time, or declarative YAML
- XCom on Redis with 256 KB limit, TTL, and optional schema validation
- Apache Airflow 3.2.x UI compatibility (no fork required)
- JWT authentication, RBAC, multi-tenant schema (OIDC-ready)
- Kubernetes-native execution (no worker pool to manage)
- Local development on Kubernetes (k3d/kind) or a dev-only subprocess executor (ADR 0015)
- Trigger rules:
all_success,all_failed,all_done,one_success,one_failed - Clear task instance to re-run failed tasks
- Leader election via Postgres advisory locks
- OpenSSF Best Practices compliance, signed releases (cosign), supply chain scanning (govulncheck + Trivy + CodeQL)
On the post-MVP roadmap:
- Optimized backfill (parallel execution with throttling)
- UI scaling for 10,000+ DAGs (caching, server-side pagination)
- Dynamic task mapping
- OIDC authentication (Google, Azure AD, Keycloak, Okta)
- Mark success/failed manually
- Custom UI (replacing the Airflow UI)
- Deferrable tasks (efficient dispatch + long-poll pattern, native Go implementation without a separate Triggerer process β see ADR 0016)
Getting Started
Try it locally (one command)
After the Lite install above, just run:
leoflow lite
β¦then open http://localhost:8088 (the LITE badge confirms you're on the
Lite instance). The first run provisions a managed Postgres + admin login, drops
example DAGs in ~/leoflow/examples/, and hot-reloads every save. Recover the
admin password any time with leoflow lite reset-password.
Lite is the primary local path. The legacy Docker-Compose demo profile (
docker compose --profile demo up --build, login[email protected]/admin) still works for CI / containerized-only environments β see docs/local-deploy.md. The pinned Airflow 3.2.x UI is a tactical MVP choice; a purpose-built Leoflow UI is the long-term direction (ADR 0018).
Local development
git clone https://github.com/neochaotic/leoflow
cd leoflow
make setup # Go tools, Python parser, pre-commit hook
make build # builds bin/leoflow, bin/leoflow-server, bin/leoflow-agent
# Start Postgres + Redis (Docker) and apply migrations
make dev-up # docker compose up --wait + migrate-up; `make dev-down` to stop
# Run the control plane (bootstraps a default admin user)
LEOFLOW_AUTH_JWT_SECRET=dev LEOFLOW_BOOTSTRAP_PASSWORD=admin123 ./bin/leoflow-server &
# API docs (Scalar) at http://localhost:8080/docs ; metrics at http://localhost:9090/metrics
# Author, compile, and register a DAG
./bin/leoflow init my-dag
./bin/leoflow compile my-dag --image my-dag:dev -o my-dag/dag.json
TOKEN=$(./bin/leoflow auth create-token --username [email protected] --password admin123)
./bin/leoflow push my-dag/dag.json --token "$TOKEN"
Two dev environments.
make dev-upruns Postgres + Redis as plain Docker containers on the host for a fast inner loop (control plane on the host). For full in-cluster execution (control plane and dependencies on a local Kubernetes cluster, mirroring production and exercising real task pods), the Helm chart is installable on any K8s cluster β chart-test CI gates every change withhelm lint+helm-unittest(41 tests) + kind install/upgrade smoke. Task execution is on Kubernetes only (ADR 0015); the host containers are dev dependencies, not the execution path.
The Airflow 3.2.1 UI ships embedded in the server and is served at
/(see the one-command demo above anddocs/ui-compatibility.md). The Scalar API reference is at/docs. Load tests are the remaining Phase 6 work.
Honest Comparison
We have no patience for marketing fiction. Here is where Leoflow sits in the landscape:
| Airflow | Argo Workflows | Prefect | Dagster | Leoflow | |
|---|---|---|---|---|---|
| Language of control plane | Python | Go | Python | Python | Go |
| Pod-per-task model | Optional (KubernetesExecutor) | Yes | Optional | Optional | Yes, only mode |
| Dependency isolation per DAG | Workaround | Manual | Partial | Partial | Native |
| UI familiar to Airflow users | Yes | No | No | No | Yes (Airflow UI) |
| GitOps-first DAG model | No | Yes | Partial | Partial | Yes |
| Scheduler in Go (no GIL) | No | Yes | No | No | Yes |
| Native observability | Add-on | Partial | Partial | Yes | Built-in |
| Mental model | Celery-era | K8s-native | Python-native | Software-defined assets | K8s-native + Airflow vocabulary |
We borrow from Argo Workflows (container-native), from Prefect (modern developer experience), and from Airflow (the UI and vocabulary). We do not pretend we invented any of those. We just put them together in a way nobody had.
Documentation
- Architecture overview
- Architecture Decision Records β every major decision, with its reasoning
- HTTP API reference (Scalar) β also rendered interactively at
/docsin the running server - DAG authoring β writing your first DAG, the Lite β deploy lifecycle
- Quickstart and Installation β getting Leoflow running locally
- Map-reduce for ML Β· Variables & Connections Β· Troubleshooting
- Security policy β how to report vulnerabilities
Engineering Discipline
Leoflow holds itself to a higher bar than most open source projects, because workflow orchestrators must be boring and reliable to be useful:
- Strict TDD β every line of production code is preceded by a failing test (ADR 0011)
- Go Report Card A+ β enforced in CI from the first commit (ADR 0012)
- GoDocs on every exported identifier β no exceptions
- Supply chain security from day one β govulncheck, gosec, Trivy, CodeQL, Scorecard, signed releases (ADR 0014)
- Per-phase coverage floors β rising from 70% to 85% across the MVP phases
- Native observability β Prometheus, OpenTelemetry, structured logs from commit one (ADR 0010)
If you contribute, read the CONTRIBUTING guide first.
License
Apache License 2.0. See LICENSE.
Acknowledgements
Leoflow stands on the shoulders of Apache Airflow. The team behind Airflow defined the vocabulary, proved the architecture, and built the UI that Leoflow reuses without modification in the MVP. This project would not exist without their work, and we credit them at every layer of our documentation.
We also studied the source of Argo Workflows, Prefect, and Dagster carefully. Each made decisions worth borrowing, and we did.
Star the repo if you have ever waited five seconds for an Airflow task to start. Watch the repo if you want a heads-up on every release. Open an issue if you have a chronic Airflow pain we have not addressed yet β pre-1.0 is the time to shape the API.