About mac_action_orchestrator

# Mac Action Orchestrator Local-first macOS action orchestrator with FastAPI + Uvicorn + Gemini planner. ## 1. Prerequisites - macOS - Python 3.10+ (recommended: 3.11 or 3.12) - `pip` available - Apple Vision Framework (built into macOS) for OCR - Xcode Command Line Tools (required for `swift` runtime): `xcode-select --install` Why 3.10+: - Python 3.9 is already EOL and may trigger warnings with newer `google-auth`. - System Python on older macOS may use LibreSSL, which can cause `urllib3` warnings. OCR note: - Vision OCR is executed through a Swift script (`Vision`/`AppKit`/`Foundation`). - On a fresh machine/session, the first OCR call may be noticeably slower due to cold-start compilation/loading, and this can cause the first run to fail (for example, OCR timeout). A retry usually succeeds once warmup is complete. ## 2. Setup In project root: ```bash cd /path/to/mac-action-orchestrator python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt --index-url https://pypi.org/sim ...

x

Published by

xiyaoeva

Visit View Profile

README.md

View on GitHub

Mac Action Orchestrator

Local-first macOS action orchestrator with FastAPI + Uvicorn + Gemini planner.

1. Prerequisites

macOS
Python 3.10+ (recommended: 3.11 or 3.12)
pip available
Apple Vision Framework (built into macOS) for OCR
Xcode Command Line Tools (required for swift runtime): xcode-select --install

Why 3.10+:

Python 3.9 is already EOL and may trigger warnings with newer google-auth.
System Python on older macOS may use LibreSSL, which can cause urllib3 warnings.

OCR note:

Vision OCR is executed through a Swift script (Vision/AppKit/Foundation).
On a fresh machine/session, the first OCR call may be noticeably slower due to cold-start compilation/loading, and this can cause the first run to fail (for example, OCR timeout). A retry usually succeeds once warmup is complete.

2. Setup

In project root:

cd /path/to/mac-action-orchestrator
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt --index-url https://pypi.org/simple

If you copied this project and .venv did not exist before, creating it is expected.

3. Config

Use local-safe config - By default, no setting is required.:

{
  "host": "localhost",
  "user": "local",
  "remote_tmp_screen_path": "/tmp/agent_screen.png",
  "ssh_options": [],
  "rate_limit_seconds": 2
}

Current code is set to local execution mode for hackathon use.

4. Run

source .venv/bin/activate
uvicorn app:app --reload --port 8000

Open a new terminal window:

open -na "Google Chrome" --args --incognito http://127.0.0.1:8000

(Open this URL in a Chrome Incognito window.)

Permission note:

On first runs, macOS may show multiple permission dialogs.
Grant all required permissions (Accessibility, Screen Recording, Automation) before expecting stable end-to-end runs.
It is normal if the first few runs fail while permissions are still being approved.
In general, you cannot fully pre-grant these permissions from code before running; macOS requires user approval dialogs.
Permission setup guide (step-by-step for Terminal): https://docs.google.com/document/d/1iAPZq1aGQyLF7jpY-fye3rQRxDQVX5VCJIG8thkapnE/edit?usp=sharing

Prompt examples:

Open https://www.google.com in Chrome and then create a new tab, then move to the tab on the left, input wiki in this tab then search. On the results page, click the lowest (bottom-most) Wikipedia result visible on screen. No scrolling.
Open https://www.youtube.com/watch?v=5OR4c87Xt-E, and play the video from 00:30. Wait 5 seconds, than do next.
Open https://www.wikipedia.org in Chrome, search for "Yosemite National Park", press Enter, wait for results, open the print dialog, click Save once, then press Enter to confirm saving as PDF. Wait 5 seconds, then stop.
Open https://en.wikipedia.org/wiki/Main_Page. Repeatedly scroll down one page. After each scroll, look for the visible text "Contact WikiPedia". As soon as it is visible, click it and stop scrolling. (Max plan_again times should be set to 10 or more and you should have a solid API key)

5. First-time macOS permissions

You may need to allow permissions for Terminal:

Accessibility
Screen Recording
Automation (System Events / Chrome, if prompted)

Permission Guide:

Use this guide to complete macOS permission setup for Terminal/iTerm: https://docs.google.com/document/d/1iAPZq1aGQyLF7jpY-fye3rQRxDQVX5VCJIG8thkapnE/edit?usp=sharing
It explains what each permission is for and where to enable it in System Settings.

Without these, actions or screen-size/screenshot related APIs can fail.

6. Run completion behavior:

After all actions finish, the app shows a completion dialog popup with final status and log name.
The full execution log is saved under static/logs/ (for example: static/logs/run_<timestamp>_<id>.txt).
Example completion screenshot:

7. Common errors and fixes

source: no such file or directory: .venv/bin/activate

You have not created venv yet.
Run: python3 -m venv .venv

zsh: command not found: uvicorn

Dependencies were not installed in the active venv.
Run: source .venv/bin/activate && pip install -r requirements.txt

AssertionError: jinja2 must be installed

jinja2 missing.
Run: pip install -r requirements.txt (or pip install jinja2)

ModuleNotFoundError for Gemini / Google SDK

google-genai missing.
Run: pip install -r requirements.txt

POST /api/screen_size returns 500

Usually macOS permissions issue, or running an old code version.
Check permissions listed in section 5.
Confirm you are on latest local-only code.

First OCR call is slow (or times out once)

This can happen on first run while Swift/Vision warms up.
Retry once; subsequent OCR calls are typically much faster.

Unexpected SSH password prompt

This means you are likely running an older remote-enabled version.
Pull latest code and restart server.

8. Security notes

Do not commit real API keys or machine-specific secrets.
Keep config.json local and sanitized.
Use config.example.json as template for sharing.

9. End-to-end example (planning + click logic)

Example prompt:

Open https://www.google.com in Chrome and then create a new tab, then move to the tab on the left, input wiki in this tab then search. On the results page, click the lowest (bottom-most) Wikipedia result visible on screen. No scrolling.

What happens in this project:

Planner stage

You click Run.
/api/plan_actions sends your natural language prompt to Gemini.
Gemini returns structured batches/actions (not raw UI clicks yet), such as:
- open URL
- keyboard shortcuts (new tab / go to tab 1)
- type text + enter
- plan_again / click_at style actions for visual selection

Execution stage

The app runs actions locally via AppleScript/JXA.
After key steps (especially before visual clicks), it captures a fresh screenshot.
The run log is written to static/logs/... and shown in UI.

Click logic (important)

For click actions, the app does OCR on the latest screenshot and extracts visible text tokens with bounding boxes.
If the click target text is unique, it computes the center point and clicks directly.
If multiple matches exist, it enters disambiguation:
- generates an annotated image (numbered/marked candidates; “draw circles/boxes” style targeting aid),
- asks Gemini to choose the best candidate index using the screenshot + context prompt,
- executes click on the chosen candidate center.
If page state is not ready or occluded, it may trigger a recovery branch (re-plan or occlusion resolution) and continue.

Why this is robust for “lowest Wikipedia result”

The instruction “bottom-most visible result” is treated as a visual grounding problem, not just text matching.
OCR tokens provide concrete on-screen candidates.
Candidate annotation + Gemini selection adds a second pass when simple exact match is ambiguous.
This reduces wrong clicks caused by duplicate text (e.g., multiple “Wikipedia” labels on one page).

Tips for better reliability with this type of prompt:

Include constraints like No scrolling, visible on screen, bottom-most.
Keep target text explicit (Wikipedia result) rather than vague (click that one).
If there are repeated labels, include nearby context in prompt (for example: “in search results list, not top nav”).

mac_action_orchestrator