Home
Softono
scrape-dojo

scrape-dojo

Open source MIT TypeScript
13
Stars
1
Forks
11
Issues
1
Watchers
2 weeks
Last Commit

About scrape-dojo

πŸ₯· Master the art of web scraping with JSON-powered workflows Define scrapes declaratively Β· Template everything Β· Run and monitor in style

Platforms

Web Self-hosted Docker

Languages

TypeScript
Scrape Dojo Logo

Scrape Dojo

Declarative web scraping & browser automation with JSON workflows

Version GHCR License: MIT Docs

NestJS Angular Astro Puppeteer TypeScript Nx pnpm

GitHub Stars GitHub Issues CI


[!NOTE] πŸ€– AI-Aided Development (AIAD)

This project openly uses AI-assisted development (e.g. Claude Code) to accelerate workflows, improve code quality, and gain more development momentum. All AI-generated code is reviewed and approved by humans β€” this is not a vibe-coding project, but a deliberate effort to build a useful product while exploring the boundaries, benefits, and trade-offs of AI-aided development.


πŸ₯· What is Scrape Dojo?

Scrape Dojo is a self-hosted web scraping & browser automation platform. Instead of writing Puppeteer code for every site, you define workflows declaratively in JSON/JSONC β€” like Infrastructure-as-Code, but for scraping.

Key capabilities:

  • ⚑ 25+ built-in actions β€” navigate, click, type, extract, loop, download, screenshot, and more
  • 🧩 Handlebars + JSONata β€” dynamic templates and powerful data transformations
  • ⏰ Cron scheduling β€” automate scrapes with cron, webhooks, or startup triggers
  • πŸ” Encrypted secrets β€” AES-256-CBC at-rest encryption for credentials
  • πŸ“‘ Real-time monitoring β€” SSE-powered live execution tracking in Angular UI
  • πŸ›‘οΈ Auth (optional) β€” JWT, OIDC/SSO, MFA/TOTP, API keys
  • πŸ—„οΈ Multi-DB β€” SQLite (default), MySQL, PostgreSQL

[!IMPORTANT] Scrape Dojo automates real browser interactions. Please respect website terms of service and applicable legal frameworks.

Full documentation: scrape-dojo.com


🐳 Quick Start (Docker)

# 1. Generate encryption key
node -e "console.log(require('crypto').randomBytes(32).toString('hex'))"

# 2. Create docker-compose.yml
cat <<'EOF' > docker-compose.yml
services:
  scrape-dojo:
    image: ghcr.io/disane87/scrape-dojo:latest
    ports:
      - '8080:80'
    environment:
      - SCRAPE_DOJO_ENCRYPTION_KEY=your_generated_key_here
      - SCRAPE_DOJO_AUTH_JWT_SECRET=your_random_jwt_secret_here
      - SCRAPE_DOJO_AUTH_REFRESH_TOKEN_SECRET=your_random_refresh_secret_here
      - DB_TYPE=sqlite
      # - SCRAPE_DOJO_PROXY_URL=http://proxy:8080  # Optional: route scrapes through a proxy
    volumes:
      - ./data:/home/pptruser/app/data
      - ./downloads:/home/pptruser/app/downloads
      - ./logs:/home/pptruser/app/logs
      - ./config:/home/pptruser/app/config
      - ./browser-data:/home/pptruser/app/browser-data
    restart: unless-stopped
EOF

# 3. Start
docker compose up -d

Open http://localhost:8080 β€” UI and API on the same port.

[!WARNING] The SCRAPE_DOJO_ENCRYPTION_KEY encrypts all secrets. Store it safely β€” if lost, existing secrets are unrecoverable.

For local development, environment variables, auth setup, and more: see the Quickstart Guide.


⚑ Your First Scrape

Create config/sites/my-first-scrape.jsonc:

{
  "$schema": "../scrapes.schema.json",
  "scrapes": [
    {
      "id": "my-first-scrape",
      "metadata": {
        "description": "Read a page title",
        "triggers": [{ "type": "manual" }],
      },
      "steps": [
        {
          "name": "Main",
          "actions": [
            {
              "name": "open",
              "action": "navigate",
              "params": { "url": "https://example.com" },
            },
            {
              "name": "title",
              "action": "extract",
              "params": { "selector": "h1" },
            },
            {
              "name": "log",
              "action": "logger",
              "params": { "message": "Title: {{previousData.title}}" },
            },
          ],
        },
      ],
    },
  ],
}

The scrape auto-appears in the UI (hot reload). Click Run or use the API:

curl http://localhost:8080/api/scrape/my-first-scrape

πŸ“– Documentation

Everything else lives in the docs:

Topic Link
πŸš€ Quickstart (Docker & Source) Getting Started
πŸ“ Config format & metadata Configuration
⚑ All 22 actions with examples Actions Reference
🧩 Templates & JSONata Templates
⏰ Scheduling & triggers Scheduling
πŸ” Secrets & variables Secrets & Variables
βš™οΈ Environment variables Env Reference
πŸ—οΈ Architecture & API Developer Guide
πŸ›‘οΈ Auth (JWT/OIDC/MFA) Authentication
πŸ’‘ Full examples Examples

πŸ› οΈ Development

git clone https://github.com/disane87/scrape-dojo.git && cd scrape-dojo
pnpm install
cp .env.example .env  # Set SCRAPE_DOJO_ENCRYPTION_KEY
pnpm start            # API (3000) + UI (4200)
pnpm test             # All tests
Command What it does
pnpm start API + UI dev servers
pnpm test All tests
pnpm test:api API tests only
pnpm test:ui UI tests only
pnpm lint Lint all projects
pnpm build Build all apps

Commits follow Conventional Commits (feat:, fix:, docs:, etc.).


🀝 Contributing

  • πŸ› Issues & bugs: GitHub Issues
  • πŸ’‘ Feature requests: New Issue
  • πŸ”€ Pull requests: Fork β†’ branch β†’ commit β†’ PR

πŸ“„ License

MIT β€” use it however you like.


🌟 Contributors

Contributors

Made with ❀️ by Marco Franke

Documentation Β· Issues Β· Discussions