About awesome-ai-web-scraping

A curated list of AI-powered web scraping tools, LLM-friendly crawlers, MCP servers, and infrastructure for turning the web into data.

h

Published by

h4ckf0r0day

Visit View Profile

README.md

View on GitHub

Awesome AI Web Scraping

A curated list of tools, libraries, and resources for AI-powered web scraping.

Frameworks, hosted APIs, browser infrastructure, MCP servers, and research for turning the web into clean, structured data for LLMs, RAG pipelines, and agents.

Scope: Tools where AI or LLMs play a meaningful role in extraction, navigation, or content understanding. General-purpose scrapers (Scrapy, BeautifulSoup) belong in awesome-web-scraping. Autonomous browser agents belong in awesome-web-agents.

Frameworks & Libraries

Self-hosted, open-source. Most pair a headless browser with an LLM for schema-based or prompt-based extraction.

Crawl4AI - LLM-friendly web crawler with Markdown output and JSON-schema or LLM-based extraction. Python.
Scrapling - Adaptive Python framework with smart element tracking that relocates elements after site changes. Cloudflare Turnstile bypass, spider framework with pause/resume, and a built-in MCP server.
ScrapeGraphAI - Python scraper using LLM + graph pipelines. Describe data in natural language, get typed JSON. Works with OpenAI, Anthropic, Groq, Gemini, Ollama.
llm-scraper - TypeScript library for structured extraction with Zod schemas. Supports GPT, Claude, Gemini, Llama, Qwen.
Reader - Jina AI's URL-to-Markdown converter. Engine behind r.jina.ai.
Stagehand - Browser automation framework with act, extract, and observe primitives over Playwright.
Browser-Use - Agent framework commonly used for scraping complex, login-walled sites.
Skyvern - Browser automation for forms, logins, and dynamic content.
LaVague - Natural language web automation framework.
CyberScraper 2077 - LLM scraper with Streamlit UI. Supports OpenAI, Gemini, and Ollama. Tor support included.
ScraperAI - AI scraper with auto-detection of page types, pagination, and catalog cards.
SpiderCreator - Generates Playwright spiders from natural language prompts.
PulsarRPA - AI-powered browser automation and data extraction.

Hosted APIs

Managed services that turn URLs into LLM-ready Markdown or JSON. JS rendering, proxies, and anti-bot handled internally.

Firecrawl - Scrape, crawl, map, search, agent, and interact endpoints. LLM-ready Markdown. 500 free credits, paid plans from $16/mo.
Jina Reader - Prepend r.jina.ai/ to any URL for LLM-friendly text. Free tier with no API key required.
Diffbot - Computer vision and NLP extraction with a knowledge graph layer. Paid.
Apify - Marketplace of 10,000+ pre-built scrapers ("Actors") plus a runtime for your own. Free tier and paid plans.
Bright Data - Scraping with 150M+ proxies and pre-built APIs for 120+ sites. Free tier and paid plans.
Zyte - Scraping API with AI extraction. Formerly Scrapinghub. Paid.
ScrapingBee - JS rendering, AI extraction, Markdown, and Google SERP support. Free trial and paid plans.
ZenRows - Anti-bot focused scraping API with Markdown output. Free trial and paid plans.
Oxylabs - Proxies plus a Web Scraper API with adaptive parsing. Paid.
Spider - Concurrent crawler and scraper API with LLM-ready output. Free tier and paid plans.
WebScraping.AI - Scraping API with question-answering and field-extraction endpoints. Free tier and paid plans.
Scrapeless - Scraping API with anti-bot bypass and structured extraction. Free tier and paid plans.
Kadoa - Self-healing extraction that adapts when sites change. Paid.
Expand.ai - Turns any website into a type-safe API. Paid.
Reworkd - Agentic AI for no-code structured extraction. Paid.

Browser Infrastructure for AI

Headless browsers designed for AI agents and scrapers.

Steel.dev - Open-source headless browser API for AI agents. Self-host or use the hosted service.
Browserbase - Hosted headless browser. Powers Stagehand. Paid.
Hyperbrowser - Browser platform with stealth, scraping, and agent endpoints. Free tier and paid plans.
Anchor Browser - Browser API with built-in auth and session persistence. Paid.
Browserless - Headless Chrome as a service. Free tier and paid plans.
Obscura - Rust-based headless browser. CDP-compatible with Puppeteer and Playwright. Built-in stealth and tracker blocking.
Browserable - Open-source, self-hostable browser automation library.

No-Code AI Scrapers

Visual or point-and-click tools that use AI to extract data without writing code.

Browse AI - Chrome extension and SaaS for AI-assisted scraping with scheduled monitoring.
Bardeen.ai - Chrome extension combining AI scraping with automation across 100+ apps.
Thunderbit - Two-click Chrome extension with AI "Suggest Fields" for instant extraction.
Gumloop - Visual workflow builder for scraping, LLM calls, and data transforms.
Octoparse - Visual scraper with AI-assisted field detection.
ParseHub - Visual scraper with template-based extraction.

MCP Servers for Scraping

Model Context Protocol servers that expose scraping capabilities to Claude, Cursor, Windsurf, and other LLM clients.

Firecrawl MCP - Official MCP wrapper for Firecrawl's scrape, crawl, and extract endpoints.
Bright Data MCP - Search, scrape, and extract from 60+ sources with anti-bot bypass. 5,000 free requests/month.
Scrapling MCP - Built-in MCP server bundled with Scrapling. Install with pip install "scrapling[ai]".
Fetch - Anthropic's official fetch MCP server. URL-to-Markdown.
Browserbase MCP - MCP server exposing Browserbase sessions and Stagehand primitives.
Puppeteer MCP - Browser automation for scraping and interaction.
Apify MCP - Run any Apify Actor as an MCP tool.
WebScraping.AI MCP - MCP integration for WebScraping.AI's extraction tools.

Web Search APIs for LLMs

Search APIs that return structured, LLM-friendly results with full-page content.

Exa - Neural search API. Returns clean content alongside results.
Tavily - Search API optimized for LLMs and RAG.
Linkup - Search API with verified sources.
Perplexity Sonar - Perplexity's online search and answer API.
Serper - Fast, low-cost Google search API.
SerpAPI - Search engine results API.
Brave Search API - Independent search index.
You.com API - Web, news, and snippet endpoints.
Kagi Search API - Premium, ad-free search results.

Proxy & Anti-Bot Infrastructure

Bright Data - 150M+ proxies, Web Unblocker, browser cloud.
Oxylabs - Residential, datacenter, and ISP proxies plus Web Unblocker.
Decodo (Smartproxy) - Residential proxies and scraping APIs.
NetNut - ISP and residential proxy network.
ZenRows - Anti-bot proxy and scraping API.
ScraperAPI - Proxy rotation and CAPTCHA handling.

Datasets

Pre-scraped web data for RAG, training, or benchmarking.

Common Crawl - The largest public web crawl. Petabytes of pages, monthly updates.
FineWeb - 15T-token deduplicated web dataset from Hugging Face.
RedPajama-Data-v2 - 30T-token open web dataset.
C4 - Colossal Clean Crawled Corpus derived from Common Crawl.
The Pile - 825 GiB diverse text corpus including web data.

Benchmarks & Research

SWDE - Structured Web Data Extraction benchmark from Microsoft Research.
WebSRC - Dataset for web-based structural reading comprehension.
AXE - Research on DOM pruning for token-efficient LLM extraction.
NEXT-EVAL - Benchmark comparing HTML representations for LLM extraction accuracy.

Tutorials & Guides

Firecrawl Docs - Guides for RAG ingestion, structured extraction, and agent integration.
Crawl4AI Documentation - Walk-throughs for LLM-based extraction strategies.
Jina Reader Quickstart - One-line URL conversion.
LangChain Web Loaders - Document loaders for web content.
LlamaIndex Web Connectors - Web data connectors for LlamaIndex.

Contributing

Contributions welcome. Open a pull request to add a new tool or resource.

Guidelines:

Keep entries focused on AI/LLM-powered scraping. Generic scrapers belong elsewhere.
Follow the format: - [Name](url) - One-line description.
Add the GitHub stars badge for open-source projects.
Mention pricing in the description if relevant (free tier, paid, etc.).

License

To the extent possible under law, the contributors have waived all copyright and related rights to this work.

awesome-ai-web-scraping