Home
Softono
clawdcursor

clawdcursor

Open source MIT TypeScript
342
Stars
51
Forks
12
Issues
2
Watchers
1 week
Last Commit

About clawdcursor

MCP-powered fallback layer that lets AI agents execute tasks through the GUI when APIs, tools, or direct integrations are unavailable. Cross-OS, accessibility-first, local-only.

Platforms

Web Self-hosted Linux Windows macOS

Languages

TypeScript

Clawd Cursor

Clawd Cursor

The local MCP server that gives any agent safe desktop control.
Any model. Any app. One MCP entry. Local-only.

MIT license Latest release Node 20+ Cross-platform Tests CodeQL Discord

Quickstart · Why · Toolbox · How it works · Platforms · Changelog


The fallback execution layer

Clawd Cursor is a local MCP server. Install it once. Any tool-calling agent on the machine — Claude Code, Cursor, Windsurf, OpenClaw, Claude Agent SDK, your own loop — connects via MCP and gets safe control of the real desktop. The agent clicks, types, reads the screen, opens apps, and drives any GUI the same way a human would.

No cloud. No telemetry by default. Server binds to 127.0.0.1. Screenshots stay in RAM unless you point a cloud model at them. With Ollama or any local model, nothing leaves the machine.

Single safety.evaluate() chokepoint. Every tool call — whether it comes from an editor host over stdio, from an external agent over HTTP, or from the built-in autonomous loop — routes through one safety gate before it touches the desktop. The agent cannot bypass this path.

Bearer-token auth on HTTP. The daemon binds to 127.0.0.1:3847. Every HTTP request needs Authorization: Bearer $(cat ~/.clawdcursor/token). Local-only by default; the bind address is configurable.

If a human can do it on a screen, your AI can do it too. No API? No integration? No problem.

No task is impossible. GUI plus a mouse plus a keyboard equals everything you need. There is no "I can't do that in this app" — only the right sequence of reads, clicks, keys, and waits. Clawd Cursor gives you all of them.

It's model-agnostic (Claude, GPT, Gemini, Llama, Kimi, Ollama, …), app-agnostic (drives any window via accessibility, OCR, or vision fallback), and OS-agnostic (one PlatformAdapter covers Windows, macOS, Linux X11, and Linux Wayland).

Use as a fallback, not first choice. Native API exists? Use it. CLI exists? Use it. Direct file edit possible? Do that. A Playwright script already wired up? Use that. Clawd Cursor is for the last mile — the click, the legacy app, the GUI with no public surface.


Toolbox — 6 compound tools (recommended)

Two catalogs ship side-by-side. The toolbox (this section) is 6 compound tools, each with an action enum that covers ~10-15 verbs. Tools (next section) is the 94 underlying granular primitives, one schema per verb.

Compound is the default surface. Catalog footprint is ~1,500 tokens (about 12× smaller than granular), which keeps small models focused on the action choice instead of drowning in primitives. Same computer_20250124 shape Anthropic uses, so editor hosts already know how to drive it.

Toolbox Actions
computer screenshot, click, double_click, right_click, triple_click, hover, scroll, scroll_horizontal, drag, drag_path, type, key, wait
accessibility read_tree, find, get_element, focused, invoke, focus, set_value, get_value, expand, collapse, toggle, select, state, list_children, wait_for
window list, active, focus, maximize, minimize, restore, close, resize, list_displays, screen_size, open_app, open_file, open_url, switch_tab, navigate
system clipboard_read, clipboard_write, system_time, ocr, undo, shortcuts_list, shortcuts_run, delegate, detect_webview, relaunch_with_cdp, system_prompt
browser connect, page_context, read_text, click, type, select_option, evaluate, wait_for, list_tabs, switch_tab, scroll
task {instruction: string} — delegate the whole task to the built-in thin agent loop (the configured model takes the wheel: perceive → act → iterate until done). No action enum. Requires clawdcursor agent with an LLM configured (clawdcursor doctor) — unavailable under --no-llm or stdio clawdcursor mcp. If your agent has its own brain, drive the other five toolboxes directly instead.
batch {steps: [...]} — collapse N tool calls into one round-trip. Each step is {name, arguments, expect?}. The executor re-perceives before each expect guard, routes every step through the same safety gate, and halts with a per-step trace on any guard miss, safety stop, or error. Use dryRun:true to pre-scan tiers. The efficiency lever for a driving agent: N calls → 1.

A typical turn:

computer({ action: "key", combo: "mod+s" })          // resolves to Cmd+S / Ctrl+S
accessibility({ action: "invoke", name: "Send" })
window({ action: "open_app", name: "Outlook" })
system({ action: "ocr" })                            // OS-level OCR, no LLM vision
task({ instruction: "open Notepad and type hello" }) // delegates to the thin agent loop
batch({ steps: [                                     // collapse N calls into 1 round-trip
  { name: "accessibility", arguments: { action: "set_value", name: "To", value: "[email protected]" } },
  { name: "accessibility", arguments: { action: "set_value", name: "Subject", value: "Hi" } },
  { name: "computer",      arguments: { action: "type", text: "Body here." } }
]})

Quickstart

Sixty seconds from zero to a tool-calling agent on your desktop.

Pick your mode first:

Your situation Use Why
AI lives in your editor (Claude Code, Cursor, Windsurf, Zed) clawdcursor mcp stdio MCP server. You never run this yourself — the editor/MCP host spawns it on demand from its config (you just add the JSON below). No daemon, no port.
You're building an agent that runs unattended clawdcursor agent HTTP MCP daemon on 127.0.0.1:3847. Has its own LLM brain optionally configured via doctor.
Your agent has its own brain — you just want the tools as an HTTP endpoint clawdcursor agent --no-llm Same daemon, no built-in agent loop, no scheduler startup, no credential validation. Pure tool surface.

Simplest — any OS (now on npm):

npm i -g clawdcursor

Works as-is on Windows and Linux. On macOS, also run clawdcursor grant afterward to build the native helper (Accessibility + Screen Recording). The OS installer scripts below do this step for you.

Or one line per OS (clones the repo, builds, and handles the macOS native build automatically):

Windows (PowerShell):

powershell -c "irm https://clawdcursor.com/install.ps1 | iex"

macOS / Linux:

curl -fsSL https://clawdcursor.com/install.sh | bash

Then:

clawdcursor consent --accept   # one-time desktop-control consent (required)
clawdcursor doctor             # verify permissions + (optionally) configure an LLM provider
clawdcursor agent              # OR `clawdcursor mcp` — see the table above

The installer clones into ~/clawdcursor, runs npm install, builds, and npm links a global shim. Runtime state lives at ~/.clawdcursor/ (auth token, pidfiles, logs). It does not edit any agent host config — that step is below.

Wire it into Claude Code, Cursor, Windsurf, or Zed:

// ~/.claude/settings.json  (or your editor's MCP config)
{
  "mcpServers": {
    "clawdcursor": {
      "command": "clawdcursor",
      "args": ["mcp", "--compact"]
    }
  }
}

That's it. Ask your agent to "open Outlook and reply to the latest email from Sarah" and watch it run.

Don't run clawdcursor mcp in a terminal yourself — your editor launches it automatically over stdio when it needs the server. The only commands you run by hand are the install, consent, and doctor steps above.

Editor permission allowlist (Claude Code, Cursor, …). If your editor maintains a per-tool permission allowlist (keys like mcp__clawdcursor__window), use the server-level wildcard "mcp__clawdcursor" instead. It covers every tool in one entry and is immune to tool renames across versions — per-tool entries silently break whenever a tool is added, removed, or renamed.

macOS first run. Run clawdcursor grant to walk through the permission dialogs, then open System Settings → Privacy & Security and enable the entry named ClawdCursor under both Accessibility and Screen Recording. v1.0.0 consolidates all desktop control under this single native-app identity — both entries are required. Linux: install tesseract-ocr, python3-gi, gir1.2-atspi-2.0, and (Wayland only) ydotool or wtype.


Why Clawd Cursor

Most "let an agent use the computer" tools are browser-only, single-OS, or vision-only. Clawd Cursor is the cross-OS, accessibility-first, MCP-native one — with a single safety gate every call routes through.

Clawd Cursor browser-use Playwright computer-use
Any desktop app, not just web web only web only
Cross-OS (Win + macOS + Linux) runs in a sandbox
Accessibility-first, not pixel-only ✅ a11y → OCR → vision DOM DOM vision only
Any model / vendor not an agent Claude only
MCP-native (one config, any host) library test framework tool-use API
Single safety chokepoint
Local-only, no cloud required screenshots → cloud

Two mechanisms the others don't have:

  • Cheapest-tier-first by design. Accessibility tree (free) → OCR (cheap) → screenshot (medium) → vision (expensive); the agent climbs only when it must, so token cost tracks task difficulty. The batch tool collapses deterministic stretches into one round-trip for additional efficiency.
  • One protocol, two transports. MCP over stdio for editor hosts, MCP over HTTP for daemons — same catalog, same JSON-RPC envelope.

How it works

Where the brain lives decides how clawdcursor is used. Both modes can run side-by-side — the daemon and editor-spawned stdio child are independent processes.

Brain lives... Mode Command What you call
In your editor (Claude Code, Cursor, Windsurf, Codex, Zed) Direct tools clawdcursor mcp Each tool individually, via stdio MCP
In a headless agent with its own LLM (OpenClaw, Claude Agent SDK, your own loop) Direct tools clawdcursor agent --no-llm Same, over HTTP MCP
Inside clawdcursor itself (scheduled tasks, "submit a task and walk away") Thin agent loop clawdcursor agent + doctor-configured LLM submit_task (or scheduled_task_create)
External brain that delegates sub-tasks to the built-in loop Direct tools + delegation clawdcursor agent + your client Direct tools normally; call task({instruction:...}) to hand off a sub-task to the built-in thin loop

Direct tools — your agent drives

Your LLM picks the calls; clawdcursor supplies safe actuation and fresh observations from the real desktop. This is the primary mode for any agent with its own reasoning loop.

flowchart TB
    task["User task"] --> loop["External agent LLM loop<br/>plans, chooses tools, verifies"]
    loop --> observe{"Cheapest observation<br/>that answers the question"}

    observe -- "obs·a11y — free<br/>accessibility.read_tree/find/get_element<br/>window.list/active" --> a11y["A11y observation<br/>(structured text + element handles)"]
    observe -- "obs·ocr — cheap<br/>system.ocr<br/>a11y tree empty or sparse" --> ocr["OCR observation<br/>(OS-level text, no vision LLM)"]
    observe -- "obs·dom — medium<br/>browser.read_text / page_context<br/>WebView / Electron / Chrome" --> dom["DOM observation<br/>(CDP, structured browser content)"]
    observe -- "obs·vision — expensive<br/>computer.screenshot<br/>canvas-only or pixel reasoning" --> vision["Vision observation<br/>(image bytes into LLM context)"]

    a11y --> act
    ocr --> act
    dom --> act
    vision --> act

    loop -- "delegate subtask" --> handoff["task({instruction:...})<br/>hand off to thin loop"]
    handoff --> thinloop["Thin agent loop<br/>(daemon LLM)"]
    thinloop --> safety

    act["Act on the desktop<br/>computer.click/type/key/drag<br/>accessibility.invoke/set_value<br/>window.open_app<br/>system.shortcuts_run<br/>browser.click/type<br/>batch — N steps in 1 call"] --> safety

    safety["Single safety gate<br/>safety.evaluate()<br/>allow / confirm / block"] -- allowed --> tools["clawdcursor tool registry<br/>94 granular + 6 compound"]
    safety -- needs user --> confirm["Human confirmation"] --> tools
    safety -- denied --> blocked["blocked"]

    tools --> desktop["Real desktop<br/>native app · browser · canvas"]
    desktop --> freshobs["Fresh observation<br/>(obs·a11y → obs·ocr → obs·dom → obs·vision)"]
    freshobs --> verify{"Does state match goal?"}

    verify -- pass --> done["done"]
    verify -- fail --> retry["retry with new state"]
    retry --> loop

    classDef input fill:#f8fafc,stroke:#64748b,color:#0f172a;
    classDef agentNode fill:#dbeafe,stroke:#2563eb,color:#0f172a;
    classDef gate fill:#ede9fe,stroke:#7c3aed,color:#0f172a;
    classDef desktopNode fill:#dcfce7,stroke:#16a34a,color:#0f172a;
    classDef obsNode fill:#fef9c3,stroke:#ca8a04,color:#0f172a;
    classDef actNode fill:#ffedd5,stroke:#ea580c,color:#0f172a;
    classDef handoffNode fill:#d1fae5,stroke:#047857,color:#0f172a;
    classDef stop fill:#fee2e2,stroke:#dc2626,color:#0f172a;

    class task,done input;
    class loop,verify,retry agentNode;
    class safety,confirm,tools gate;
    class desktop,freshobs desktopNode;
    class observe,a11y,ocr,dom,vision obsNode;
    class act actNode;
    class handoff,thinloop handoffNode;
    class blocked stop;

The loop: read the a11y tree (cheap) → act on named targets → verify from fresh observations → escalate perception only when needed (OCR → screenshot). Sparse a11y tree? Call system.detect_webview — Electron/WebView2 apps render inside Chromium, switch to browser.* via CDP. Canvas-only (Paint, Figma, games)? Screenshot + coord click.

batch for deterministic stretches. When the next N steps are known (no mid-sequence branching), collapse them into one batch call. Each step still routes through the safety gate; on any guard miss, safety stop, or error the batch halts and returns a per-step trace.

Task delegation. When the daemon has an LLM configured, your external agent can delegate at any point by calling task({"instruction":"&hellip;"}). The built-in thin loop takes the wheel, reasons and acts using the configured model, and reports back. Useful for delegating grunt work to a cheaper model — e.g. "open Outlook and reply to Sarah's latest about budget" — without burning your own LLM context on the step-level details.

Thin agent loop — clawdcursor drives

You hand off a task in plain English (submit_task, the web dashboard at :3847/, or a scheduled_task_create cron tick). The configured model perceives the desktop, selects tools, and iterates until the task is done or the turn budget is exhausted.

Single safety chokepoint. Every tool call — direct or via the thin loop — routes through safety.evaluate(). The agent cannot bypass this path; it is the only way tools execute.


Transports

One protocol — MCP — two transports. Same catalog, same JSON-RPC envelope.

Transport When to use Client config
stdio MCP Editor hosts: Claude Code, Cursor, Windsurf, Zed. Tools appear on demand — no daemon. {"command": "clawdcursor", "args": ["mcp", "--compact"]}
HTTP MCP Bring-your-own-agent, headless daemons, multi-process orchestration, Claude Agent SDK. POST JSON-RPC to http://127.0.0.1:3847/mcp. Run clawdcursor agent. Then tools/list returns the catalog and tools/call invokes any tool. Bearer token at ~/.clawdcursor/token.

Both transports are stateless. No session-init handshake. Bearer-token auth on every HTTP request; stdio inherits the parent process's trust.

# HTTP MCP — list tools
curl -s -X POST http://127.0.0.1:3847/mcp \
  -H "Authorization: Bearer $(cat ~/.clawdcursor/token)" \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","id":1,"method":"tools/list"}'

Tools — 94 granular primitives

The flat catalog. Each of the 6 compound toolboxes above dispatches to one of these under the hood. Use this surface directly when:

  • Compatibility — your agent runtime requires every action as a top-level MCP tool (no action enum). Run the daemon without --compact (granular is the default for clawdcursor agent) to expose them.
  • Debugging — you want to call a specific primitive directly (key_press, mouse_click, read_screen) without going through the compound dispatcher.

The full catalog — both compact toolboxes and granular tools — is always visible through MCP tools/list on either transport. Authoritative schema lives in schema.snapshot.json.

A typical turn:

key_press({ key: "mod+s" })
invoke_element({ name: "Send" })
open_app({ name: "Outlook" })
ocr_read_screen()
// ...94 tools total

Both forms produce identical effects through the same safety.evaluate() chokepoint.


Cost Tiers

Every perception source has a cost. Start at the cheapest rung that works and climb only when it fails — the same discipline whether your agent drives the tools directly or hands a sub-task to the built-in loop via task.

Tier Label Cost Source When to use
T1 structured ~free accessibility.*, window.*, browser.read_text, clipboard Default. Returns text + bounds — no image, no vision LLM.
T2 ocr cheap system({"action":"ocr"}) A11y tree empty or sparse. OS-level OCR — text out, no LLM vision.
T3 screenshot medium computer({"action":"screenshot"}) OCR isn't enough and you need pixel context. Sends an image into LLM context.
T4 vision expensive smart_click, smart_read, smart_type Canvas-only apps (Paint, Figma, games) or spatial reasoning that text can't express. Last resort.

Rule: start at T1. Escalate only when the current tier fails. Apply the same discipline when calling compound tools directly; the built-in thin loop follows it too.


Observe vs Act

Every tool call is one of two kinds: observe (read the current state of the desktop — zero side effects) or act (change it). The log badge on each tool-call line tells you which, and which observation channel was used. This makes the cheap-first ladder visible at a glance as a task runs.

Kind Log badge What it does Example tools
Observe — a11y obs·a11y Read the accessibility tree (structured text + bounds, free) accessibility.read_tree, accessibility.find, accessibility.get_element, accessibility.get_value, accessibility.focused, window.list, window.active, accessibility.list_children, accessibility.wait_for
Observe — OCR obs·ocr Read on-screen text via OS OCR engine (no LLM vision) system.ocr (ocr_read_screen), system.smart_read
Observe — DOM obs·dom Read the browser DOM via CDP (Electron / WebView2 / Chrome) browser.read_text (cdp_read_text), browser.page_context (cdp_page_context)
Observe — vision obs·vision Take a screenshot (image bytes enter LLM context — most expensive) computer.screenshot (desktop_screenshot, desktop_screenshot_region, screenshot_full)
Act act Change the screen: click, type, key, scroll, drag, open, invoke computer.click/type/key/drag/scroll, accessibility.invoke/set_value/focus, window.open_app/open_url/open_file, system.clipboard_write, browser.click/type, batch, task

The discipline: prefer obs·a11y first — it returns structured text and element handles for free. If the a11y tree is empty or sparse, try obs·ocr. If the target is inside a WebView or Electron shell, use obs·dom via CDP. Only escalate to obs·vision (screenshot) when pixel context is genuinely needed. Act once you have enough information, then observe again to verify.

The badge column in the live log (CLAWD_LOG=pretty, the default on a TTY) shows this ladder in real time: you can watch obs·a11yactobs·a11y on a normal turn, and see when the agent is forced to climb to obs·ocr or obs·vision.

Derived from src/tools/cost-class.ts (authoritative cost-class table) + src/core/observability/logger.ts (observeActBadge).


Platform Support

Platform-specific code lives in src/platform/{windows,macos,linux}.ts (plus wayland-backend.ts) behind a single PlatformAdapter interface. Business logic never reads process.platform. Roughly 3,750 LOC across the four adapters.

Platform UI Automation OCR Browser (CDP) Input
Windows 10/11 (x64 / ARM64) UIA via PowerShell bridge Windows.Media.Ocr Chrome / Edge nut-js
macOS 12+ (Intel / Apple Silicon) JXA + System Events (TCC-safe) Apple Vision Chrome / Edge nut-js + System Events
Linux X11 AT-SPI via python3-gi Tesseract Chrome / Edge nut-js
Linux Wayland AT-SPI via python3-gi Tesseract Chrome / Edge ydotool / wtype

Per-OS setup notes:

  • Windows — no setup. PowerShell bridge spawns on demand.
  • macOS — first run needs Accessibility + Screen Recording in System Settings > Privacy & Security. clawdcursor grant walks the dialogs; enable the entry named ClawdCursor under both categories. Retina / HiDPI handled in the adapter; do not pre-scale coordinates.
  • Linux X11apt install tesseract-ocr python3-gi gir1.2-atspi-2.0 (or your distro's equivalent).
  • Linux Wayland — same a11y packages, plus ydotool + a running ydotoold daemon (preferred) or wtype (keyboard only).

Architecture

Five directories. Everything else is a leaf module.

Directory What lives here
src/core/ Thin agent loop (agent.ts, runAgent), sense layer (a11y/snapshot/fingerprint), focus guard, safety gate.
src/tools/ The 94 granular tools + 6 compound aggregators + batch, playbooks (find-replace, extract-compose), tool registry, dispatch.
src/platform/ PlatformAdapter interface + Windows / macOS / Linux / Wayland implementations, OCR engine, CDP driver, URI handler.
src/llm/ Provider clients (Claude, GPT, Gemini, Llama, Kimi, Ollama, …), credentials, model config.
src/surface/ CLI (clawdcursor), MCP server (stdio + HTTP), dashboard, doctor, onboarding, readiness probes.

The PlatformAdapter is the only thing platform code talks to. The safety.evaluate() chokepoint is the only way tools execute. Those two seams are the whole point of the architecture.


Safety & Privacy

Tier Actions Behavior
Auto Reading, opening apps, navigation, typing into non-sensitive fields Executes immediately
Preview Form fill, arbitrary input Logged before executing
Confirm Sends, deletes, purchases, transfers Pauses for user approval
Block Alt+F4 / Cmd+Q of the agent shell, Ctrl+Alt+Delete, Shift+Delete, power chords Refused outright

Hardening summary:

  • Network isolation. Server binds to 127.0.0.1. Verify with netstat -an | findstr 3847 (Windows) or | grep 3847 (Unix).
  • Bearer-token auth. Every HTTP request needs Authorization: Bearer $(cat ~/.clawdcursor/token).
  • Sensitive-app policy. Email, banking, password managers, private messaging auto-elevate to Confirm. The agent must ask the user before acting on these surfaces.
  • No telemetry by default. Nothing phones home on its own. Screenshots stay in RAM; with Ollama or any local model, nothing leaves the machine; with a cloud provider, screenshots go only to the endpoint you configured. The one exception is opt-in: clawdcursor report lets you manually send a diagnostic snapshot when you want help, and it previews exactly what's included before sending.
  • Prompt-injection defense. Screen text returned inside <untrusted-screen-content> tags is treated as data, never as instructions.
  • Log privacy. JSON logs at ~/.clawdcursor/logs/ redact password-field values (AXSecureTextField, UIA IsPassword=true).

See SECURITY.md for the private vulnerability reporting channel.


CLI

The CLI is for humans diagnosing an install or managing the guide cache. Agents should connect via MCP (stdio for editor hosts, HTTP for daemons).

# Install + setup
clawdcursor consent         Manage desktop-control consent (--accept / --revoke / --status)
clawdcursor grant           Grant macOS permissions (interactive, macOS only)
clawdcursor doctor          Verify permissions, configure AI provider + models
clawdcursor status          Readiness check (consent, permissions, AI config)

# Run
clawdcursor mcp             MCP stdio server — primary transport for editor hosts
clawdcursor agent           Daemon: HTTP MCP at /mcp on :3847, optional built-in thin loop
clawdcursor agent --no-llm  Daemon, tool surface only (no built-in brain/scheduler)
clawdcursor stop            Stop every running mode
clawdcursor uninstall       Remove all clawdcursor config and data

# Manual end-to-end testing only — agents should call submit_task via MCP.
clawdcursor task <t>        Send a task to the running agent

Options:
  --port <port>          Default: 3847
  --compact              MCP only: expose compact tools instead of 94 granular
  --provider <name>      `agent` only: anthropic | openai | gemini | ollama | ...
  --accept               `agent` and `consent` only: skip the consent prompt

Development

git clone https://github.com/AmrDab/clawdcursor.git
cd clawdcursor
npm install
npm run build       # tsc + postbuild
npm test            # vitest
npm run lint        # eslint
npm run typecheck   # tsc --noEmit
npm link            # global `clawdcursor` shim (Unix) — use Admin shell on Windows

The build emits dist/. Entry point: dist/surface/cli.js. Tests run on Node 20 and 22 against Ubuntu, macOS, and Windows in CI.


Tech Stack

TypeScript · Node.js 20+ · nut-js · Playwright · sharp · Express · Model Context Protocol SDK · Zod · commander


Contributing

PRs welcome. See CONTRIBUTING.md for the development loop, branch conventions, and the test matrix every change has to clear. Bug reports and feature requests go in issues; private security reports go to the channel listed in SECURITY.md.

License

MIT — see LICENSE.

Acknowledgments

Built on the shoulders of the Model Context Protocol SDK, nut-js, Playwright, the Anthropic computer_20250124 tool shape, and the AT-SPI / UIA / AX trees that make app-agnostic GUI automation possible at all.


clawdcursor.com · Discord · Changelog