🕸️ @isdk/web-fetcher
English | 简体中文
An AI-friendly web automation library that simplifies complex web interactions into a declarative JSON action script. Write your script once and run it in either a fast
httpmode for static content or a fullbrowsermode for dynamic sites. An optionalantibotflag helps bypass detection mechanisms. The library is designed for targeted, task-oriented data extraction (e.g., get X from page Y), not for building whole-site crawlers.
✨ Core Features
- ⚙️ Dual-Engine Architecture: Choose between
httpmode (powered by Cheerio) for speed on static sites, orbrowsermode (powered by Playwright) for full JavaScript execution on dynamic sites. - 📜 Declarative Action Scripts: Define multi-step workflows (like logging in, filling forms, and clicking buttons) in a simple, readable JSON format.
- 📊 Powerful and Flexible Data Extraction: Easily extract all kinds of structured data, from simple text to complex nested objects, through an intuitive and powerful declarative Schema.
- 🧠 Smart Engine Selection: Automatically detects dynamic sites and can upgrade the engine from
httptobrowseron the fly. - 🛡️ Anti-Bot Evasion: In
browsermode, an optionalantibotflag helps to bypass common anti-bot measures like Cloudflare challenges. - 🕹️ High-Fidelity Interaction Simulation: Supports Bézier curve-based mouse trajectory movement, realistic typing delay simulation, and complex keyboard interactions to significantly improve anti-bot evasion.
- 🧩 Extensible: Easily create custom, high-level "composite" actions to encapsulate reusable business logic (e.g., a
loginaction). - 🧲 Advanced Collectors: Asynchronously collect data in the background, triggered by events during the execution of a main action.
Smart Upgrade and Retry Strategy
When enableSmart is enabled, the system automatically determines whether an engine upgrade is needed based on response characteristics:
- Triggers for upgrade include:
- HTTP status codes:
401 / 403 / 429 / 5xx(including network-level errors like timeouts or connection failures mapped to408 / 503 / 504) - Page appears to be dynamically rendered (detected typical JS framework signatures in HTML, controlled by
upgradeOnJsContent) Retry-AfterexceedsupgradeThresholdMs
- HTTP status codes:
- During upgrade, you can choose whether to sync Cookies / Session state (
syncStateOnUpgrade) - For
429responses, ifRetry-Afteris less than theupgradeThresholdMsthreshold, the system will prioritize retry over upgrade
| Option | Type | Default | Description |
|---|---|---|---|
enableSmart |
boolean | true |
Enable smart detection and automatic engine upgrade |
upgradeOnJsContent |
boolean | false |
Upgrade to browser engine when JS rendering signatures are detected (e.g., window.__NEXT_DATA__, window.__NUXT__) |
upgradeThresholdMs |
number | 5000 |
Response time threshold (ms) to trigger upgrade; also used for 429 Retry-After comparison |
syncStateOnUpgrade |
boolean | false |
Sync cookies/session state during engine upgrade |
📦 Installation
-
Install the Package:
npm install @isdk/web-fetcher -
Install Browsers (For
browsermode):The
browserengine is powered by Playwright, which requires separate browser binaries to be downloaded. If you plan to use thebrowserengine for interacting with dynamic websites, run the following command:npx playwright installℹ️ Note: This step is only required for
browsermode. The lightweighthttpmode works out of the box without this installation.
🚀 Quick Start
The following example fetches a web page and extracts its title.
import { fetchWeb } from '@isdk/web-fetcher';
async function getTitle(url: string) {
const { outputs } = await fetchWeb({
url,
actions: [
{
id: 'extract',
params: {
// Extracts the text content of the <title> tag
selector: 'title',
},
// Stores the result in the `outputs` object under the key 'pageTitle'
storeAs: 'pageTitle',
},
],
});
console.log('Page Title:', outputs.pageTitle);
}
getTitle('https://www.google.com');
🤖 Advanced Usage: Multi-Step Form Submission
This example demonstrates how to use the browser engine to perform a search on Google.
import { fetchWeb } from '@isdk/web-fetcher';
async function searchGoogle(query: string) {
// Search for the query on Google
const { result, outputs } = await fetchWeb({
url: 'https://www.google.com',
engine: 'browser', // Use the full browser engine for interaction
actions: [
// The initial navigation to google.com is handled by the `url` option
{ id: 'fill', params: { selector: 'textarea[name=q]', value: query } },
{ id: 'submit', params: { selector: 'form' } },
{ id: 'waitFor', params: { selector: '#search' } }, // Wait for the search results container to appear
{ id: 'getContent', storeAs: 'searchResultsPage' },
]
});
console.log('Search Results URL:', result?.finalUrl);
console.log('Outputs contains the full page content:', outputs.searchResultsPage.html.substring(0, 100));
}
searchGoogle('gemini');
🏗️ Architecture
This library is built on two core concepts: Engines and Actions.
-
Engine Architecture
The library's core is its dual-engine design. It abstracts away the complexities of web interaction behind a unified API. For detailed information on the
http(Cheerio) andbrowser(Playwright) engines, how they manage state, and how to extend them, please see the Fetch Engine Architecture document. -
Action Architecture
All workflows are defined as a series of "Actions". The library provides a set of built-in atomic actions and a powerful composition model for creating your own semantic actions. For a deep dive into creating and using actions, see the Action Script Architecture document.
📚 API Reference
fetchWeb(options) or fetchWeb(url, options)
This is the main entry point for the library.
Key FetcherOptions:
url(string): The initial URL to navigate to.engine('http' | 'browser' | 'auto'): The engine to use. Defaults toauto.proxy(string | string[]): Proxy URL(s) to use for requests.debug(boolean | string | string[]): Enable detailed execution metadata (timings, engine used, etc.) in response, or enable debug logs for specific categories (e.g., 'extract', 'submit', 'request').actions(FetchActionOptions[]): An array of action objects to execute. (Supportsaction/nameas alias forid, andargsas alias forparams)headers(Record<string, string>): Headers to use for all requests.cookies(Cookie[]): Array of cookies to use.sessionState(any): Crawlee session state to restore.storage(StorageOptions): Controls session isolation, persistence, and cleanup.id(string): Shared storage ID for cross-session data reuse.persist(boolean): Whether to save data to disk.purge(boolean): Whether to delete data on cleanup (defaults totrue).config(object): Raw Crawlee configuration (e.g.,{ localDataDirectory: './data' }).
cache(FetchCacheOptions): Controls persistent HTTP caching with smart self-healing mechanisms.enabled(boolean): Whether to enable caching.offline(boolean): Enable offline mode (prohibit network requests, throw error on MISS).storagePath(string): Custom path for cache storage. Shared pools are managed automatically.backgroundUpdate(boolean): Whether to enable SWR (Stale-While-Revalidate) background updates. Default:true.staleIfError(boolean): Force return of stale cache if network request fails.forceCache(boolean): Ignore server directives and force caching.refresh(boolean): Force Refresh: Ignore existing cache to re-validate and "heal" the cache entry. Useful for bypassing blocks via manual verification.methods,cacheRules,query,headers,cookies,body: Fine-grained cache policy configuration. SupportsSTALE_RESCUEandWAF_CHALLENGEdetection for automatic engine upgrade and cache healing when used withenableSmart.
output(object): Controls the output fields inFetchResponse.cookies(boolean): Whether to include cookies in the response (default:true).sessionState(boolean): Whether to include session state in the response (default:true).
browser(object): Browser engine configuration.headless(boolean): Run in headless mode (default:true).launchOptions(object): Playwright launch options (e.g.,{ slowMo: 50, args: [...] }).
sessionPoolOptions(SessionPoolOptions): Advanced configuration for the underlying Crawlee SessionPool.enableSmart(boolean): Enable smart detection and automatic engine upgrade (default:true).syncStateOnUpgrade(boolean): Whether to sync Cookies / Session state when upgrading from http to browser engine (default:false).upgradeThresholdMs(number): Wait time threshold in milliseconds to trigger engine upgrade; upgrades if exceeded or no explicit retry info (default:5000).maxRetries(number): Maximum retry attempts for a single Action (default:0).failOnError(boolean): Whether to throw an exception when an Action fails (default:truefor main flow,falsefor collector).failOnTimeout(boolean): Whether to treat timeout as failure (default:false).- ...and many other options for proxy, retries, etc.
Built-in Actions
The library provides a set of powerful built-in actions, many of which are engine-agnostic and handled centrally for consistency:
goto: Navigates to a new URL.click: Clicks on an element (Engine-specific).fill: Fills an input field (Engine-specific).submit: Submits a form (Engine-specific).mouseMove: Moves the mouse cursor to a specific coordinate or element (Bézier curve supported).mouseClick: Triggers a mouse click at the current position or specified coordinates.mouseWheel: Simulates a mouse wheel scroll event with horizontal and vertical deltas. Supports splitting into multiple steps and automatic scrolling to make the target element visible.scrollIntoView: Scrolls the page or a container to make a specific element visible in the viewport.keyboardType: Simulates human-like typing into the currently focused element.keyboardPress: Simulates pressing a single key or a key combination.trim: Removes elements from the DOM to clean up the page.waitFor: Pauses execution to wait for a specific condition (Supports fixed timeouts centrally).pause: Pauses execution for manual intervention (Handled centrally).getContent: Retrieves the full content of the current page (Handled centrally).evaluate: Executes custom JavaScript within the page context.extract: Extracts structured data using an engine-agnostic core logic and engine-specific DOM primitives. Supportsrequiredfields andstrictvalidation.
Response Structure
The fetchWeb function returns an object containing:
result(FetchResponse):url: The final URL.statusCode: HTTP status code.headers: HTTP headers.cookies: Array of cookies.sessionState: Crawlee session state.text,html: Page content.
outputs(Record<string, any>): Data extracted and stored viastoreAs. Note: When multiple actions store objects into the same key, they are merged instead of overwritten.