isdk

Open Source

web-fetcher.js

# 🕸️ @isdk/web-fetcher [![npm version](https://img.shields.io/npm/v/%40isdk%2Fweb-fetcher)](https://www.npmjs.com/package/@isdk/web-fetcher) [![npm downloads](https://img.shields.io/npm/dw/%40isdk%2Fweb-fetcher)](https://www.npmjs.com/package/@isdk/web-fetcher) [![License](https://img.shields.io/github/license/isdk/web-fetcher.js)](https://github.com/isdk/web-fetcher.js/blob/main/LICENSE) [![Node](https://img.shields.io/badge/node-%3E%3D18-339933?logo=node.js)](https://nodejs.org/) [![TypeScript](https://img.shields.io/badge/TypeScript-Types%20included-3178C6?logo=typescript)](https://www.typescriptlang.org/) [![GitHub Stars](https://img.shields.io/github/stars/isdk/web-fetcher.js?logo=github)](https://github.com/isdk/web-fetcher.js) ![antibot](https://img.shields.io/badge/antibot-optional-orange) English | [简体中文](./README.cn.md) > An AI-friendly web automation library that simplifies complex web interactions into a declarative JSON action script. Write your script once and run it in either a fast **`http`** mode for static content or a full **`browser`** mode for dynamic sites. An optional **`antibot`** flag helps bypass detection mechanisms. The library is designed for targeted, task-oriented data extraction (e.g., get X from page Y), not for building whole-site crawlers. --- ## ✨ Core Features * **⚙️ Dual-Engine Architecture**: Choose between **`http`** mode (powered by Cheerio) for speed on static sites, or **`browser`** mode (powered by Playwright) for full JavaScript execution on dynamic sites. * **📜 Declarative Action Scripts**: Define multi-step workflows (like logging in, filling forms, and clicking buttons) in a simple, readable JSON format. * **📊 Powerful and Flexible Data Extraction**: Easily extract all kinds of structured data, from simple text to complex nested objects, through an intuitive and powerful declarative Schema. * **🧠 Smart Engine Selection**: Automatically detects dynamic sites and can upgrade the engine from `http` to `browser` on the fly. * **🛡️ Anti-Bot Evasion**: In `browser` mode, an optional `antibot` flag helps to bypass common anti-bot measures like Cloudflare challenges. * **🕹️ High-Fidelity Interaction Simulation**: Supports Bézier curve-based mouse trajectory movement, realistic typing delay simulation, and complex keyboard interactions to significantly improve anti-bot evasion. * **🧩 Extensible**: Easily create custom, high-level "composite" actions to encapsulate reusable business logic (e.g., a `login` action). * **🧲 Advanced Collectors**: Asynchronously collect data in the background, triggered by events during the execution of a main action. --- ### Smart Upgrade and Retry Strategy When `enableSmart` is enabled, the system automatically determines whether an engine upgrade is needed based on response characteristics: - Triggers for upgrade include: - HTTP status codes: `401 / 403 / 429 / 5xx` (including network-level errors like timeouts or connection failures mapped to `408 / 503 / 504`) - Page appears to be dynamically rendered (detected typical JS framework signatures in HTML, controlled by `upgradeOnJsContent`) - `Retry-After` exceeds `upgradeThresholdMs` - During upgrade, you can choose whether to sync Cookies / Session state (`syncStateOnUpgrade`) - For `429` responses, if `Retry-After` is less than the `upgradeThresholdMs` threshold, the system will prioritize retry over upgrade | Option | Type | Default | Description | |--------|------|---------|-------------| | `enableSmart` | boolean | `true` | Enable smart detection and automatic engine upgrade | | `upgradeOnJsContent` | boolean | `false` | Upgrade to browser engine when JS rendering signatures are detected (e.g., `window.__NEXT_DATA__`, `window.__NUXT__`) | | `upgradeThresholdMs` | number | `5000` | Response time threshold (ms) to trigger upgrade; also used for 429 `Retry-After` comparison | | `syncStateOnUpgrade` | boolean | `false` | Sync cookies/session state during engine upgrade | --- ## 📦 Installation 1. **Install the Package:** ```bash npm install @isdk/web-fetcher ``` 2. **Install Browsers (For `browser` mode):** The `browser` engine is powered by Playwright, which requires separate browser binaries to be downloaded. If you plan to use the `browser` engine for interacting with dynamic websites, run the following command: ```bash npx playwright install ``` > ℹ️ **Note:** This step is only required for `browser` mode. The lightweight `http` mode works out of the box without this installation. --- ## 🚀 Quick Start The following example fetches a web page and extracts its title. ```typescript import { fetchWeb } from '@isdk/web-fetcher'; async function getTitle(url: string) { const { outputs } = await fetchWeb({ url, actions: [ { id: 'extract', params: { // Extracts the text content of the <title> tag selector: 'title', }, // Stores the result in the `outputs` object under the key 'pageTitle' storeAs: 'pageTitle', }, ], }); console.log('Page Title:', outputs.pageTitle); } getTitle('https://www.google.com'); ``` --- ## 🤖 Advanced Usage: Multi-Step Form Submission This example demonstrates how to use the `browser` engine to perform a search on Google. ```typescript import { fetchWeb } from '@isdk/web-fetcher'; async function searchGoogle(query: string) { // Search for the query on Google const { result, outputs } = await fetchWeb({ url: 'https://www.google.com', engine: 'browser', // Use the full browser engine for interaction actions: [ // The initial navigation to google.com is handled by the `url` option { id: 'fill', params: { selector: 'textarea[name=q]', value: query } }, { id: 'submit', params: { selector: 'form' } }, { id: 'waitFor', params: { selector: '#search' } }, // Wait for the search results container to appear { id: 'getContent', storeAs: 'searchResultsPage' }, ] }); console.log('Search Results URL:', result?.finalUrl); console.log('Outputs contains the full page content:', outputs.searchResultsPage.html.substring(0, 100)); } searchGoogle('gemini'); ``` --- ## 🏗️ Architecture This library is built on two core concepts: **Engines** and **Actions**. * ### Engine Architecture The library's core is its dual-engine design. It abstracts away the complexities of web interaction behind a unified API. For detailed information on the `http` (Cheerio) and `browser` (Playwright) engines, how they manage state, and how to extend them, please see the [**Fetch Engine Architecture**](./README.engine.md) document. * ### Action Architecture All workflows are defined as a series of "Actions". The library provides a set of built-in atomic actions and a powerful composition model for creating your own semantic actions. For a deep dive into creating and using actions, see the [**Action Script Architecture**](./README.action.md) document. --- ## 📚 API Reference ### `fetchWeb(options)` or `fetchWeb(url, options)` This is the main entry point for the library. **Key `FetcherOptions`**: * `url` (string): The initial URL to navigate to. * `engine` ('http' | 'browser' | 'auto'): The engine to use. Defaults to `auto`. * `proxy` (string | string[]): Proxy URL(s) to use for requests. * `debug` (boolean | string | string[]): Enable detailed execution metadata (timings, engine used, etc.) in response, or enable debug logs for specific categories (e.g., 'extract', 'submit', 'request'). * `actions` (FetchActionOptions[]): An array of action objects to execute. (Supports `action`/`name` as alias for `id`, and `args` as alias for `params`) * `headers` (Record<string, string>): Headers to use for all requests. * `cookies` (Cookie[]): Array of cookies to use. * `sessionState` (any): Crawlee session state to restore. * `storage` (StorageOptions): Controls session isolation, persistence, and cleanup. * `id` (string): Shared storage ID for cross-session data reuse. * `persist` (boolean): Whether to save data to disk. * `purge` (boolean): Whether to delete data on cleanup (defaults to `true`). * `config` (object): Raw Crawlee configuration (e.g., `{ localDataDirectory: './data' }`). * `cache` (FetchCacheOptions): Controls persistent HTTP caching with smart self-healing mechanisms. * `enabled` (boolean): Whether to enable caching. * `offline` (boolean): Enable offline mode (prohibit network requests, throw error on MISS). * `storagePath` (string): Custom path for cache storage. Shared pools are managed automatically. * `backgroundUpdate` (boolean): Whether to enable SWR (Stale-While-Revalidate) background updates. Default: `true`. * `staleIfError` (boolean): Force return of stale cache if network request fails. * `forceCache` (boolean): Ignore server directives and force caching. * `refresh` (boolean): **Force Refresh**: Ignore existing cache to re-validate and "heal" the cache entry. Useful for bypassing blocks via manual verification. * `methods`, `cacheRules`, `query`, `headers`, `cookies`, `body`: Fine-grained cache policy configuration. Supports `STALE_RESCUE` and `WAF_CHALLENGE` detection for automatic engine upgrade and cache healing when used with `enableSmart`. * `output` (object): Controls the output fields in `FetchResponse`. * `cookies` (boolean): Whether to include cookies in the response (default: `true`). * `sessionState` (boolean): Whether to include session state in the response (default: `true`). * `browser` (object): Browser engine configuration. * `headless` (boolean): Run in headless mode (default: `true`). * `launchOptions` (object): Playwright launch options (e.g., `{ slowMo: 50, args: [...] }`). * `sessionPoolOptions` (SessionPoolOptions): Advanced configuration for the underlying Crawlee SessionPool. * `enableSmart` (boolean): Enable smart detection and automatic engine upgrade (default: `true`). * `syncStateOnUpgrade` (boolean): Whether to sync Cookies / Session state when upgrading from http to browser engine (default: `false`). * `upgradeThresholdMs` (number): Wait time threshold in milliseconds to trigger engine upgrade; upgrades if exceeded or no explicit retry info (default: `5000`). * `maxRetries` (number): Maximum retry attempts for a single Action (default: `0`). * `failOnError` (boolean): Whether to throw an exception when an Action fails (default: `true` for main flow, `false` for collector). * `failOnTimeout` (boolean): Whether to treat timeout as failure (default: `false`). * ...and many other options for proxy, retries, etc. ### Built-in Actions The library provides a set of powerful built-in actions, many of which are engine-agnostic and handled centrally for consistency: * `goto`: Navigates to a new URL. * `click`: Clicks on an element (Engine-specific). * `fill`: Fills an input field (Engine-specific). * `submit`: Submits a form (Engine-specific). * `mouseMove`: Moves the mouse cursor to a specific coordinate or element (Bézier curve supported). * `mouseClick`: Triggers a mouse click at the current position or specified coordinates. * `mouseWheel`: Simulates a mouse wheel scroll event with horizontal and vertical deltas. Supports splitting into multiple steps and automatic scrolling to make the target element visible. * `scrollIntoView`: Scrolls the page or a container to make a specific element visible in the viewport. * `keyboardType`: Simulates human-like typing into the currently focused element. * `keyboardPress`: Simulates pressing a single key or a key combination. * `trim`: Removes elements from the DOM to clean up the page. * `waitFor`: Pauses execution to wait for a specific condition (Supports fixed timeouts centrally). * `pause`: Pauses execution for manual intervention (Handled centrally). * `getContent`: Retrieves the full content of the current page (Handled centrally). * `evaluate`: Executes custom JavaScript within the page context. * `extract`: Extracts structured data using an engine-agnostic core logic and engine-specific DOM primitives. Supports `required` fields and `strict` validation. ### Response Structure The `fetchWeb` function returns an object containing: * `result` (FetchResponse): * `url`: The final URL. * `statusCode`: HTTP status code. * `headers`: HTTP headers. * `cookies`: Array of cookies. * `sessionState`: Crawlee session state. * `text`, `html`: Page content. * `outputs` (Record<string, any>): Data extracted and stored via `storeAs`. Note: When multiple actions store objects into the same key, they are merged instead of overwritten. --- ## 📜 License [MIT](./LICENSE-MIT)

JavaScript Libraries & Components Browser Automation

24 Github Stars

Software by isdk

web-fetcher.js