Home
Softono
awesome-ai-web-scraping

awesome-ai-web-scraping

Open source
52
Stars
9
Forks
3
Issues
3
Watchers
4 weeks
Last Commit

About awesome-ai-web-scraping

A curated list of AI-powered web scraping tools, LLM-friendly crawlers, MCP servers, and infrastructure for turning the web into data.

Platforms

Web Self-hosted

Links

Awesome AI Web Scraping

Awesome

A curated list of tools, libraries, and resources for AI-powered web scraping.


Frameworks, hosted APIs, browser infrastructure, MCP servers, and research for turning the web into clean, structured data for LLMs, RAG pipelines, and agents.

Scope: Tools where AI or LLMs play a meaningful role in extraction, navigation, or content understanding. General-purpose scrapers (Scrapy, BeautifulSoup) belong in awesome-web-scraping. Autonomous browser agents belong in awesome-web-agents.

Contents

Frameworks & Libraries

Self-hosted, open-source. Most pair a headless browser with an LLM for schema-based or prompt-based extraction.

  • Crawl4AI - LLM-friendly web crawler with Markdown output and JSON-schema or LLM-based extraction. Python. GitHub Repo stars
  • Scrapling - Adaptive Python framework with smart element tracking that relocates elements after site changes. Cloudflare Turnstile bypass, spider framework with pause/resume, and a built-in MCP server. GitHub Repo stars
  • ScrapeGraphAI - Python scraper using LLM + graph pipelines. Describe data in natural language, get typed JSON. Works with OpenAI, Anthropic, Groq, Gemini, Ollama. GitHub Repo stars
  • llm-scraper - TypeScript library for structured extraction with Zod schemas. Supports GPT, Claude, Gemini, Llama, Qwen. GitHub Repo stars
  • Reader - Jina AI's URL-to-Markdown converter. Engine behind r.jina.ai. GitHub Repo stars
  • Stagehand - Browser automation framework with act, extract, and observe primitives over Playwright. GitHub Repo stars
  • Browser-Use - Agent framework commonly used for scraping complex, login-walled sites. GitHub Repo stars
  • Skyvern - Browser automation for forms, logins, and dynamic content. GitHub Repo stars
  • LaVague - Natural language web automation framework. GitHub Repo stars
  • CyberScraper 2077 - LLM scraper with Streamlit UI. Supports OpenAI, Gemini, and Ollama. Tor support included. GitHub Repo stars
  • ScraperAI - AI scraper with auto-detection of page types, pagination, and catalog cards. GitHub Repo stars
  • SpiderCreator - Generates Playwright spiders from natural language prompts. GitHub Repo stars
  • PulsarRPA - AI-powered browser automation and data extraction. GitHub Repo stars

Hosted APIs

Managed services that turn URLs into LLM-ready Markdown or JSON. JS rendering, proxies, and anti-bot handled internally.

  • Firecrawl - Scrape, crawl, map, search, agent, and interact endpoints. LLM-ready Markdown. 500 free credits, paid plans from $16/mo.
  • Jina Reader - Prepend r.jina.ai/ to any URL for LLM-friendly text. Free tier with no API key required.
  • Diffbot - Computer vision and NLP extraction with a knowledge graph layer. Paid.
  • Apify - Marketplace of 10,000+ pre-built scrapers ("Actors") plus a runtime for your own. Free tier and paid plans.
  • Bright Data - Scraping with 150M+ proxies and pre-built APIs for 120+ sites. Free tier and paid plans.
  • Zyte - Scraping API with AI extraction. Formerly Scrapinghub. Paid.
  • ScrapingBee - JS rendering, AI extraction, Markdown, and Google SERP support. Free trial and paid plans.
  • ZenRows - Anti-bot focused scraping API with Markdown output. Free trial and paid plans.
  • Oxylabs - Proxies plus a Web Scraper API with adaptive parsing. Paid.
  • Spider - Concurrent crawler and scraper API with LLM-ready output. Free tier and paid plans.
  • WebScraping.AI - Scraping API with question-answering and field-extraction endpoints. Free tier and paid plans.
  • Scrapeless - Scraping API with anti-bot bypass and structured extraction. Free tier and paid plans.
  • Kadoa - Self-healing extraction that adapts when sites change. Paid.
  • Expand.ai - Turns any website into a type-safe API. Paid.
  • Reworkd - Agentic AI for no-code structured extraction. Paid.

Browser Infrastructure for AI

Headless browsers designed for AI agents and scrapers.

  • Steel.dev - Open-source headless browser API for AI agents. Self-host or use the hosted service. GitHub Repo stars
  • Browserbase - Hosted headless browser. Powers Stagehand. Paid.
  • Hyperbrowser - Browser platform with stealth, scraping, and agent endpoints. Free tier and paid plans.
  • Anchor Browser - Browser API with built-in auth and session persistence. Paid.
  • Browserless - Headless Chrome as a service. Free tier and paid plans.
  • Obscura - Rust-based headless browser. CDP-compatible with Puppeteer and Playwright. Built-in stealth and tracker blocking. GitHub Repo stars
  • Browserable - Open-source, self-hostable browser automation library. GitHub Repo stars

No-Code AI Scrapers

Visual or point-and-click tools that use AI to extract data without writing code.

  • Browse AI - Chrome extension and SaaS for AI-assisted scraping with scheduled monitoring.
  • Bardeen.ai - Chrome extension combining AI scraping with automation across 100+ apps.
  • Thunderbit - Two-click Chrome extension with AI "Suggest Fields" for instant extraction.
  • Gumloop - Visual workflow builder for scraping, LLM calls, and data transforms.
  • Octoparse - Visual scraper with AI-assisted field detection.
  • ParseHub - Visual scraper with template-based extraction.

MCP Servers for Scraping

Model Context Protocol servers that expose scraping capabilities to Claude, Cursor, Windsurf, and other LLM clients.

  • Firecrawl MCP - Official MCP wrapper for Firecrawl's scrape, crawl, and extract endpoints. GitHub Repo stars
  • Bright Data MCP - Search, scrape, and extract from 60+ sources with anti-bot bypass. 5,000 free requests/month. GitHub Repo stars
  • Scrapling MCP - Built-in MCP server bundled with Scrapling. Install with pip install "scrapling[ai]".
  • Fetch - Anthropic's official fetch MCP server. URL-to-Markdown.
  • Browserbase MCP - MCP server exposing Browserbase sessions and Stagehand primitives. GitHub Repo stars
  • Puppeteer MCP - Browser automation for scraping and interaction.
  • Apify MCP - Run any Apify Actor as an MCP tool. GitHub Repo stars
  • WebScraping.AI MCP - MCP integration for WebScraping.AI's extraction tools.

Web Search APIs for LLMs

Search APIs that return structured, LLM-friendly results with full-page content.

  • Exa - Neural search API. Returns clean content alongside results.
  • Tavily - Search API optimized for LLMs and RAG.
  • Linkup - Search API with verified sources.
  • Perplexity Sonar - Perplexity's online search and answer API.
  • Serper - Fast, low-cost Google search API.
  • SerpAPI - Search engine results API.
  • Brave Search API - Independent search index.
  • You.com API - Web, news, and snippet endpoints.
  • Kagi Search API - Premium, ad-free search results.

Proxy & Anti-Bot Infrastructure

  • Bright Data - 150M+ proxies, Web Unblocker, browser cloud.
  • Oxylabs - Residential, datacenter, and ISP proxies plus Web Unblocker.
  • Decodo (Smartproxy) - Residential proxies and scraping APIs.
  • NetNut - ISP and residential proxy network.
  • ZenRows - Anti-bot proxy and scraping API.
  • ScraperAPI - Proxy rotation and CAPTCHA handling.

Datasets

Pre-scraped web data for RAG, training, or benchmarking.

  • Common Crawl - The largest public web crawl. Petabytes of pages, monthly updates.
  • FineWeb - 15T-token deduplicated web dataset from Hugging Face.
  • RedPajama-Data-v2 - 30T-token open web dataset.
  • C4 - Colossal Clean Crawled Corpus derived from Common Crawl.
  • The Pile - 825 GiB diverse text corpus including web data.

Benchmarks & Research

  • SWDE - Structured Web Data Extraction benchmark from Microsoft Research.
  • WebSRC - Dataset for web-based structural reading comprehension.
  • AXE - Research on DOM pruning for token-efficient LLM extraction.
  • NEXT-EVAL - Benchmark comparing HTML representations for LLM extraction accuracy.

Tutorials & Guides

Contributing

Contributions welcome. Open a pull request to add a new tool or resource.

Guidelines:

  • Keep entries focused on AI/LLM-powered scraping. Generic scrapers belong elsewhere.
  • Follow the format: - [Name](url) - One-line description.
  • Add the GitHub stars badge for open-source projects.
  • Mention pricing in the description if relevant (free tier, paid, etc.).

License

CC0

To the extent possible under law, the contributors have waived all copyright and related rights to this work.