Home
Softono
AnyCrawl

AnyCrawl

Open source MIT MDX
3.2K
Stars
340
Forks
9
Issues
24
Watchers
4 weeks
Last Commit

About AnyCrawl

<div align="center"> <img src="https://anycrawl.dev/logo.svg" alt="AnyCrawl" height="100"> <h1> AnyCrawl <p align="center"> <img src="https://img.shields.io/badge/any4ai-AnyCrawl-6d47b8" alt="AnyCrawl" /> </p> </h1> <img src="https://img.shields.io/badge/⚡-Fast-blue" alt="Fast"/> <img src="https://img.shields.io/badge/🚀-Scalable-orange" alt="Scalable"/> <img src="https://img.shields.io/badge/🕷️-Web%20Crawling-ff69b4" alt="Web Crawling"/> <img src="https://img.shields.io/badge/🌐-Site%20Crawling-9cf" alt="Site Crawling"/> <img src="https://img.shields.io/badge/🔍-SERP%20(Multi%20Engines)-green" alt="SERP"/> <img src="https://img.shields.io/badge/⚙️-Multi%20Threading-yellow" alt="Multi Threading"/> <img src="https://img.shields.io/badge/🔄-Multi%20Process-purple" alt="Multi Process"/> <img src="https://img.shields.io/badge/📦-Batch%20Tasks-red" alt="Batch Tasks"/> [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![PRs Welcome](https://im ...

Platforms

Web Self-hosted

Languages

MDX
AnyCrawl

AnyCrawl

AnyCrawl

Fast Scalable Web Crawling Site Crawling SERP Multi Threading Multi Process Batch Tasks

License: MIT PRs Welcome LLM Ready Documentation

X

Node.js TypeScript Redis

Sponsors

Swiftproxy(https://www.swiftproxy.net/?ref=AnyCrawl) — High-performance residential proxies built for scraping, automation, and large-scale data collection. Access 80M+ rotating residential IPs across 195+ countries with stable connections, high anonymity, and developer-friendly integration. Ideal for AI agents, crawlers, browser automation, and anti-bot bypass workflows. Free trial available. Use code PROXY90 for an exclusive 10% discount.

AtlasCloud(https://www.atlascloud.ai/?utm_source=github&utm_medium=sponsor&utm_campaign=AnyCrawl) — Atlas Cloud gives developers one API for 300 plus models, covering video, image, and LLM. It includes DeepSeek, GPT, Claude, Flux, Kling, and Seedance.

📖 Overview

AnyCrawl is a high‑performance crawling and scraping toolkit:

  • SERP crawling: multiple search engines, batch‑friendly
  • Web scraping: single‑page content extraction
  • Site crawling: full‑site traversal and collection
  • High performance: multi‑threading / multi‑process
  • Batch tasks: reliable and efficient
  • AI extraction: LLM‑powered structured data (JSON) extraction from pages

LLM‑friendly. Easy to integrate and use.

🚀 Quick Start

📖 See full docs: Docs

Generate an API Key (self-host)

If you enable authentication (ANYCRAWL_API_AUTH_ENABLED=true), generate an API key:

pnpm --filter api key:generate
# optionally name the key
pnpm --filter api key:generate -- default

The command prints uuid, key and credits. Use the printed key as a Bearer token.

Run Inside Docker

If running AnyCrawl via Docker:

  • Docker Compose:
docker compose exec api pnpm --filter api key:generate
docker compose exec api pnpm --filter api key:generate -- default
  • Single container (replace ):
docker exec -it <container_name_or_id> pnpm --filter api key:generate
docker exec -it <container_name_or_id> pnpm --filter api key:generate -- default

📚 Usage Examples

💡 Use the Playground to test APIs and generate code in your preferred language.

If self‑hosting, replace https://api.anycrawl.dev with your own server URL.

Web Scraping (Scrape)

Example


curl -X POST https://api.anycrawl.dev/v1/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "url": "https://example.com",
  "engine": "cheerio"
}'

Parameters

Parameter Type Description Default
url string (required) The URL to be scraped. Must be a valid URL starting with http:// or https:// -
engine string Scraping engine to use. Options: cheerio (static HTML parsing, fastest), playwright (JavaScript rendering with modern engine), puppeteer (JavaScript rendering with Chrome) cheerio
proxy string Proxy URL for the request. Supports HTTP and SOCKS proxies. Format: http://[username]:[password]@proxy:port (none)
max_age number Cache control (ms). 0 = force refresh (skip cache read); > 0 = accept cached content within this age; omit to use default. (none)
store_in_cache boolean Cache control. Whether to store the result in cache. To bypass cache reads, use max_age=0. true

More parameters: see Request Parameters.

Cache details (self-host / S3 / map index): see docs/cache.md.

LLM Extraction

curl -X POST "https://api.anycrawl.dev/v1/scrape" \
  -H "Authorization: Bearer YOUR_ANYCRAWL_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "json_options": {
      "schema": {
        "type": "object",
        "properties": {
          "company_mission": { "type": "string" },
          "is_open_source": { "type": "boolean" },
          "employee_count": { "type": "number" }
        },
        "required": ["company_mission"]
      }
    }
  }'

Atlas Cloud Provider

AnyCrawl supports Atlas Cloud as an OpenAI-compatible LLM provider for extraction and summarization workloads.

  • Official site: Atlas Cloud
  • LLM base URL: https://api.atlascloud.ai/v1
  • Recommended env model format: atlascloud/deepseek-v3
ATLASCLOUD_BASE_URL=https://api.atlascloud.ai/v1
ATLASCLOUD_API_KEY=your-atlascloud-api-key
DEFAULT_LLM_MODEL=atlascloud/deepseek-v3
DEFAULT_EXTRACT_MODEL=atlascloud/deepseek-v3

If you prefer file-based AI config, add an atlascloud provider entry in ai.config.json and map it to any Atlas Cloud model exposed through the OpenAI-compatible chat API.

Site Crawling (Crawl)

Example


curl -X POST https://api.anycrawl.dev/v1/crawl \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "url": "https://example.com",
  "engine": "playwright",
  "max_depth": 2,
  "limit": 10,
  "strategy": "same-domain"
}'

Parameters

Parameter Type Description Default
url string (required) Starting URL to crawl -
engine string Crawling engine. Options: cheerio, playwright, puppeteer cheerio
max_depth number Max depth from the start URL 10
limit number Max number of pages to crawl 100
strategy enum Scope: all, same-domain, same-hostname, same-origin same-domain
include_paths array Only crawl paths matching these patterns (none)
exclude_paths array Skip paths matching these patterns (none)
scrape_options object Per-page scrape options (formats, timeout, json extraction, etc.), same as Scrape options (none)

More parameters and endpoints: see Request Parameters.

Search Engine Results (SERP)

Example

curl -X POST https://api.anycrawl.dev/v1/search \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_ANYCRAWL_API_KEY' \
  -d '{
  "query": "AnyCrawl",
  "limit": 10,
  "engine": "google",
  "lang": "all"
}'

Parameters

Parameter Type Description Default
query string (required) Search query to be executed -
engine string Search engine to use. Options: google google
pages integer Number of search result pages to retrieve 1
lang string Language code for search results (e.g., 'en', 'zh', 'all') en-US

Supported search engines

  • Google

❓ FAQ

  1. Can I use proxies? Yes. AnyCrawl ships with a high‑quality default proxy. You can also configure your own: set the proxy request parameter (per request) or ANYCRAWL_PROXY_URL (self‑hosting).
  2. How to handle JavaScript‑rendered pages? Use the Playwright or Puppeteer engines.

🤝 Contributing

We welcome contributions! See the Contributing Guide.

Backers

Support us with a monthly donation and help us continue our activities. [Become a backer]

Mocha's backers on Open Collective

📄 License

MIT License — see LICENSE.

🎯 Mission

We build simple, reliable, and scalable tools for the AI ecosystem.


Built with ❤️ by the Any4AI team