Home
Softono
web-fetcher.js

web-fetcher.js

Open source MIT TypeScript
24
Stars
3
Forks
0
Issues
0
Watchers
4 weeks
Last Commit

About web-fetcher.js

# 🕸️ @isdk/web-fetcher [![npm version](https://img.shields.io/npm/v/%40isdk%2Fweb-fetcher)](https://www.npmjs.com/package/@isdk/web-fetcher) [![npm downloads](https://img.shields.io/npm/dw/%40isdk%2Fweb-fetcher)](https://www.npmjs.com/package/@isdk/web-fetcher) [![License](https://img.shields.io/github/license/isdk/web-fetcher.js)](https://github.com/isdk/web-fetcher.js/blob/main/LICENSE) [![Node](https://img.shields.io/badge/node-%3E%3D18-339933?logo=node.js)](https://nodejs.org/) [![TypeScript](https://img.shields.io/badge/TypeScript-Types%20included-3178C6?logo=typescript)](https://www.typescriptlang.org/) [![GitHub Stars](https://img.shields.io/github/stars/isdk/web-fetcher.js?logo=github)](https://github.com/isdk/web-fetcher.js) ![antibot](https://img.shields.io/badge/antibot-optional-orange) English | [简体中文](./README.cn.md) > An AI-friendly web automation library that simplifies complex web interactions into a declarative JSON action script. Write your script once and run it in either a fast **`h ...

Platforms

Web Self-hosted

Languages

TypeScript

Links

🕸️ @isdk/web-fetcher

npm version npm downloads License Node TypeScript GitHub Stars antibot

English | 简体中文

An AI-friendly web automation library that simplifies complex web interactions into a declarative JSON action script. Write your script once and run it in either a fast http mode for static content or a full browser mode for dynamic sites. An optional antibot flag helps bypass detection mechanisms. The library is designed for targeted, task-oriented data extraction (e.g., get X from page Y), not for building whole-site crawlers.


✨ Core Features

  • ⚙️ Dual-Engine Architecture: Choose between http mode (powered by Cheerio) for speed on static sites, or browser mode (powered by Playwright) for full JavaScript execution on dynamic sites.
  • 📜 Declarative Action Scripts: Define multi-step workflows (like logging in, filling forms, and clicking buttons) in a simple, readable JSON format.
  • 📊 Powerful and Flexible Data Extraction: Easily extract all kinds of structured data, from simple text to complex nested objects, through an intuitive and powerful declarative Schema.
  • 🧠 Smart Engine Selection: Automatically detects dynamic sites and can upgrade the engine from http to browser on the fly.
  • 🛡️ Anti-Bot Evasion: In browser mode, an optional antibot flag helps to bypass common anti-bot measures like Cloudflare challenges.
  • 🕹️ High-Fidelity Interaction Simulation: Supports Bézier curve-based mouse trajectory movement, realistic typing delay simulation, and complex keyboard interactions to significantly improve anti-bot evasion.
  • 🧩 Extensible: Easily create custom, high-level "composite" actions to encapsulate reusable business logic (e.g., a login action).
  • 🧲 Advanced Collectors: Asynchronously collect data in the background, triggered by events during the execution of a main action.

Smart Upgrade and Retry Strategy

When enableSmart is enabled, the system automatically determines whether an engine upgrade is needed based on response characteristics:

  • Triggers for upgrade include:
    • HTTP status codes: 401 / 403 / 429 / 5xx (including network-level errors like timeouts or connection failures mapped to 408 / 503 / 504)
    • Page appears to be dynamically rendered (detected typical JS framework signatures in HTML, controlled by upgradeOnJsContent)
    • Retry-After exceeds upgradeThresholdMs
  • During upgrade, you can choose whether to sync Cookies / Session state (syncStateOnUpgrade)
  • For 429 responses, if Retry-After is less than the upgradeThresholdMs threshold, the system will prioritize retry over upgrade
Option Type Default Description
enableSmart boolean true Enable smart detection and automatic engine upgrade
upgradeOnJsContent boolean false Upgrade to browser engine when JS rendering signatures are detected (e.g., window.__NEXT_DATA__, window.__NUXT__)
upgradeThresholdMs number 5000 Response time threshold (ms) to trigger upgrade; also used for 429 Retry-After comparison
syncStateOnUpgrade boolean false Sync cookies/session state during engine upgrade

📦 Installation

  1. Install the Package:

     npm install @isdk/web-fetcher
  2. Install Browsers (For browser mode):

    The browser engine is powered by Playwright, which requires separate browser binaries to be downloaded. If you plan to use the browser engine for interacting with dynamic websites, run the following command:

     npx playwright install

    ℹ️ Note: This step is only required for browser mode. The lightweight http mode works out of the box without this installation.


🚀 Quick Start

The following example fetches a web page and extracts its title.

import { fetchWeb } from '@isdk/web-fetcher';

async function getTitle(url: string) {
  const { outputs } = await fetchWeb({
    url,
    actions: [
      {
        id: 'extract',
        params: {
          // Extracts the text content of the <title> tag
          selector: 'title',
        },
        // Stores the result in the `outputs` object under the key 'pageTitle'
        storeAs: 'pageTitle',
      },
    ],
  });

  console.log('Page Title:', outputs.pageTitle);
}

getTitle('https://www.google.com');

🤖 Advanced Usage: Multi-Step Form Submission

This example demonstrates how to use the browser engine to perform a search on Google.

import { fetchWeb } from '@isdk/web-fetcher';

async function searchGoogle(query: string) {
  // Search for the query on Google
  const { result, outputs } = await fetchWeb({
    url: 'https://www.google.com',
    engine: 'browser', // Use the full browser engine for interaction
    actions: [
      // The initial navigation to google.com is handled by the `url` option
      { id: 'fill', params: { selector: 'textarea[name=q]', value: query } },
      { id: 'submit', params: { selector: 'form' } },
      { id: 'waitFor', params: { selector: '#search' } }, // Wait for the search results container to appear
      { id: 'getContent', storeAs: 'searchResultsPage' },
    ]
  });

  console.log('Search Results URL:', result?.finalUrl);
  console.log('Outputs contains the full page content:', outputs.searchResultsPage.html.substring(0, 100));
}

searchGoogle('gemini');

🏗️ Architecture

This library is built on two core concepts: Engines and Actions.

  • Engine Architecture

    The library's core is its dual-engine design. It abstracts away the complexities of web interaction behind a unified API. For detailed information on the http (Cheerio) and browser (Playwright) engines, how they manage state, and how to extend them, please see the Fetch Engine Architecture document.

  • Action Architecture

    All workflows are defined as a series of "Actions". The library provides a set of built-in atomic actions and a powerful composition model for creating your own semantic actions. For a deep dive into creating and using actions, see the Action Script Architecture document.


📚 API Reference

fetchWeb(options) or fetchWeb(url, options)

This is the main entry point for the library.

Key FetcherOptions:

  • url (string): The initial URL to navigate to.
  • engine ('http' | 'browser' | 'auto'): The engine to use. Defaults to auto.
  • proxy (string | string[]): Proxy URL(s) to use for requests.
  • debug (boolean | string | string[]): Enable detailed execution metadata (timings, engine used, etc.) in response, or enable debug logs for specific categories (e.g., 'extract', 'submit', 'request').
  • actions (FetchActionOptions[]): An array of action objects to execute. (Supports action/name as alias for id, and args as alias for params)
  • headers (Record<string, string>): Headers to use for all requests.
  • cookies (Cookie[]): Array of cookies to use.
  • sessionState (any): Crawlee session state to restore.
  • storage (StorageOptions): Controls session isolation, persistence, and cleanup.
    • id (string): Shared storage ID for cross-session data reuse.
    • persist (boolean): Whether to save data to disk.
    • purge (boolean): Whether to delete data on cleanup (defaults to true).
    • config (object): Raw Crawlee configuration (e.g., { localDataDirectory: './data' }).
  • cache (FetchCacheOptions): Controls persistent HTTP caching with smart self-healing mechanisms.
    • enabled (boolean): Whether to enable caching.
    • offline (boolean): Enable offline mode (prohibit network requests, throw error on MISS).
    • storagePath (string): Custom path for cache storage. Shared pools are managed automatically.
    • backgroundUpdate (boolean): Whether to enable SWR (Stale-While-Revalidate) background updates. Default: true.
    • staleIfError (boolean): Force return of stale cache if network request fails.
    • forceCache (boolean): Ignore server directives and force caching.
    • refresh (boolean): Force Refresh: Ignore existing cache to re-validate and "heal" the cache entry. Useful for bypassing blocks via manual verification.
    • methods, cacheRules, query, headers, cookies, body: Fine-grained cache policy configuration. Supports STALE_RESCUE and WAF_CHALLENGE detection for automatic engine upgrade and cache healing when used with enableSmart.
  • output (object): Controls the output fields in FetchResponse.
    • cookies (boolean): Whether to include cookies in the response (default: true).
    • sessionState (boolean): Whether to include session state in the response (default: true).
  • browser (object): Browser engine configuration.
    • headless (boolean): Run in headless mode (default: true).
    • launchOptions (object): Playwright launch options (e.g., { slowMo: 50, args: [...] }).
  • sessionPoolOptions (SessionPoolOptions): Advanced configuration for the underlying Crawlee SessionPool.
  • enableSmart (boolean): Enable smart detection and automatic engine upgrade (default: true).
  • syncStateOnUpgrade (boolean): Whether to sync Cookies / Session state when upgrading from http to browser engine (default: false).
  • upgradeThresholdMs (number): Wait time threshold in milliseconds to trigger engine upgrade; upgrades if exceeded or no explicit retry info (default: 5000).
  • maxRetries (number): Maximum retry attempts for a single Action (default: 0).
  • failOnError (boolean): Whether to throw an exception when an Action fails (default: true for main flow, false for collector).
  • failOnTimeout (boolean): Whether to treat timeout as failure (default: false).
  • ...and many other options for proxy, retries, etc.

Built-in Actions

The library provides a set of powerful built-in actions, many of which are engine-agnostic and handled centrally for consistency:

  • goto: Navigates to a new URL.
  • click: Clicks on an element (Engine-specific).
  • fill: Fills an input field (Engine-specific).
  • submit: Submits a form (Engine-specific).
  • mouseMove: Moves the mouse cursor to a specific coordinate or element (Bézier curve supported).
  • mouseClick: Triggers a mouse click at the current position or specified coordinates.
  • mouseWheel: Simulates a mouse wheel scroll event with horizontal and vertical deltas. Supports splitting into multiple steps and automatic scrolling to make the target element visible.
  • scrollIntoView: Scrolls the page or a container to make a specific element visible in the viewport.
  • keyboardType: Simulates human-like typing into the currently focused element.
  • keyboardPress: Simulates pressing a single key or a key combination.
  • trim: Removes elements from the DOM to clean up the page.
  • waitFor: Pauses execution to wait for a specific condition (Supports fixed timeouts centrally).
  • pause: Pauses execution for manual intervention (Handled centrally).
  • getContent: Retrieves the full content of the current page (Handled centrally).
  • evaluate: Executes custom JavaScript within the page context.
  • extract: Extracts structured data using an engine-agnostic core logic and engine-specific DOM primitives. Supports required fields and strict validation.

Response Structure

The fetchWeb function returns an object containing:

  • result (FetchResponse):
    • url: The final URL.
    • statusCode: HTTP status code.
    • headers: HTTP headers.
    • cookies: Array of cookies.
    • sessionState: Crawlee session state.
    • text, html: Page content.
  • outputs (Record<string, any>): Data extracted and stored via storeAs. Note: When multiple actions store objects into the same key, they are merged instead of overwritten.

📜 License

MIT