Home
Softono
mac_action_orchestrator

mac_action_orchestrator

Open source MIT Python
35
Stars
7
Forks
0
Issues
6
Watchers
4 months
Last Commit

About mac_action_orchestrator

# Mac Action Orchestrator Local-first macOS action orchestrator with FastAPI + Uvicorn + Gemini planner. ## 1. Prerequisites - macOS - Python 3.10+ (recommended: 3.11 or 3.12) - `pip` available - Apple Vision Framework (built into macOS) for OCR - Xcode Command Line Tools (required for `swift` runtime): `xcode-select --install` Why 3.10+: - Python 3.9 is already EOL and may trigger warnings with newer `google-auth`. - System Python on older macOS may use LibreSSL, which can cause `urllib3` warnings. OCR note: - Vision OCR is executed through a Swift script (`Vision`/`AppKit`/`Foundation`). - On a fresh machine/session, the first OCR call may be noticeably slower due to cold-start compilation/loading, and this can cause the first run to fail (for example, OCR timeout). A retry usually succeeds once warmup is complete. ## 2. Setup In project root: ```bash cd /path/to/mac-action-orchestrator python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt --index-url https://pypi.org/sim ...

Platforms

Web Self-hosted macOS

Languages

Python

Mac Action Orchestrator

Local-first macOS action orchestrator with FastAPI + Uvicorn + Gemini planner.

1. Prerequisites

  • macOS
  • Python 3.10+ (recommended: 3.11 or 3.12)
  • pip available
  • Apple Vision Framework (built into macOS) for OCR
  • Xcode Command Line Tools (required for swift runtime): xcode-select --install

Why 3.10+:

  • Python 3.9 is already EOL and may trigger warnings with newer google-auth.
  • System Python on older macOS may use LibreSSL, which can cause urllib3 warnings.

OCR note:

  • Vision OCR is executed through a Swift script (Vision/AppKit/Foundation).
  • On a fresh machine/session, the first OCR call may be noticeably slower due to cold-start compilation/loading, and this can cause the first run to fail (for example, OCR timeout). A retry usually succeeds once warmup is complete.

2. Setup

In project root:

cd /path/to/mac-action-orchestrator
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt --index-url https://pypi.org/simple

If you copied this project and .venv did not exist before, creating it is expected.

3. Config

Use local-safe config - By default, no setting is required.:

{
  "host": "localhost",
  "user": "local",
  "remote_tmp_screen_path": "/tmp/agent_screen.png",
  "ssh_options": [],
  "rate_limit_seconds": 2
}

Current code is set to local execution mode for hackathon use.

4. Run

source .venv/bin/activate
uvicorn app:app --reload --port 8000

Open a new terminal window:

open -na "Google Chrome" --args --incognito http://127.0.0.1:8000

(Open this URL in a Chrome Incognito window.)

Permission note:

  • On first runs, macOS may show multiple permission dialogs.
  • Grant all required permissions (Accessibility, Screen Recording, Automation) before expecting stable end-to-end runs.
  • It is normal if the first few runs fail while permissions are still being approved.
  • In general, you cannot fully pre-grant these permissions from code before running; macOS requires user approval dialogs.
  • Permission setup guide (step-by-step for Terminal): https://docs.google.com/document/d/1iAPZq1aGQyLF7jpY-fye3rQRxDQVX5VCJIG8thkapnE/edit?usp=sharing

Prompt examples:

  • Open https://www.google.com in Chrome and then create a new tab, then move to the tab on the left, input wiki in this tab then search. On the results page, click the lowest (bottom-most) Wikipedia result visible on screen. No scrolling.
  • Open https://www.youtube.com/watch?v=5OR4c87Xt-E, and play the video from 00:30. Wait 5 seconds, than do next.
  • Open https://www.wikipedia.org in Chrome, search for "Yosemite National Park", press Enter, wait for results, open the print dialog, click Save once, then press Enter to confirm saving as PDF. Wait 5 seconds, then stop.
  • Open https://en.wikipedia.org/wiki/Main_Page. Repeatedly scroll down one page. After each scroll, look for the visible text "Contact WikiPedia". As soon as it is visible, click it and stop scrolling. (Max plan_again times should be set to 10 or more and you should have a solid API key)

5. First-time macOS permissions

You may need to allow permissions for Terminal:

  • Accessibility
  • Screen Recording
  • Automation (System Events / Chrome, if prompted)

Permission Guide:

  • Use this guide to complete macOS permission setup for Terminal/iTerm: https://docs.google.com/document/d/1iAPZq1aGQyLF7jpY-fye3rQRxDQVX5VCJIG8thkapnE/edit?usp=sharing
  • It explains what each permission is for and where to enable it in System Settings.

Without these, actions or screen-size/screenshot related APIs can fail.

6. Run completion behavior:

  • After all actions finish, the app shows a completion dialog popup with final status and log name.

  • The full execution log is saved under static/logs/ (for example: static/logs/run_<timestamp>_<id>.txt).

  • Example completion screenshot:

    Run completion dialog

7. Common errors and fixes

  1. source: no such file or directory: .venv/bin/activate
  • You have not created venv yet.
  • Run: python3 -m venv .venv
  1. zsh: command not found: uvicorn
  • Dependencies were not installed in the active venv.
  • Run: source .venv/bin/activate && pip install -r requirements.txt
  1. AssertionError: jinja2 must be installed
  • jinja2 missing.
  • Run: pip install -r requirements.txt (or pip install jinja2)
  1. ModuleNotFoundError for Gemini / Google SDK
  • google-genai missing.
  • Run: pip install -r requirements.txt
  1. POST /api/screen_size returns 500
  • Usually macOS permissions issue, or running an old code version.
  • Check permissions listed in section 5.
  • Confirm you are on latest local-only code.
  1. First OCR call is slow (or times out once)
  • This can happen on first run while Swift/Vision warms up.
  • Retry once; subsequent OCR calls are typically much faster.
  1. Unexpected SSH password prompt
  • This means you are likely running an older remote-enabled version.
  • Pull latest code and restart server.

8. Security notes

  • Do not commit real API keys or machine-specific secrets.
  • Keep config.json local and sanitized.
  • Use config.example.json as template for sharing.

9. End-to-end example (planning + click logic)

Example prompt:

Open https://www.google.com in Chrome and then create a new tab, then move to the tab on the left, input wiki in this tab then search. On the results page, click the lowest (bottom-most) Wikipedia result visible on screen. No scrolling.

What happens in this project:

  1. Planner stage
  • You click Run.
  • /api/plan_actions sends your natural language prompt to Gemini.
  • Gemini returns structured batches/actions (not raw UI clicks yet), such as:
    • open URL
    • keyboard shortcuts (new tab / go to tab 1)
    • type text + enter
    • plan_again / click_at style actions for visual selection
  1. Execution stage
  • The app runs actions locally via AppleScript/JXA.
  • After key steps (especially before visual clicks), it captures a fresh screenshot.
  • The run log is written to static/logs/... and shown in UI.
  1. Click logic (important)
  • For click actions, the app does OCR on the latest screenshot and extracts visible text tokens with bounding boxes.
  • If the click target text is unique, it computes the center point and clicks directly.
  • If multiple matches exist, it enters disambiguation:
    • generates an annotated image (numbered/marked candidates; “draw circles/boxes” style targeting aid),
    • asks Gemini to choose the best candidate index using the screenshot + context prompt,
    • executes click on the chosen candidate center.
  • If page state is not ready or occluded, it may trigger a recovery branch (re-plan or occlusion resolution) and continue.
  1. Why this is robust for “lowest Wikipedia result”
  • The instruction “bottom-most visible result” is treated as a visual grounding problem, not just text matching.
  • OCR tokens provide concrete on-screen candidates.
  • Candidate annotation + Gemini selection adds a second pass when simple exact match is ambiguous.
  • This reduces wrong clicks caused by duplicate text (e.g., multiple “Wikipedia” labels on one page).

Tips for better reliability with this type of prompt:

  • Include constraints like No scrolling, visible on screen, bottom-most.
  • Keep target text explicit (Wikipedia result) rather than vague (click that one).
  • If there are repeated labels, include nearby context in prompt (for example: “in search results list, not top nav”).