Home
Softono
a

apify

Professional software vendor delivering innovative solutions on the Softono platform. Specialized in both open-source and proprietary software development.

Total Products
8

Software by apify

browser-pool
Open Source

browser-pool

# (DEPRECATED) Browser Pool - the headless browser manager > **DEPRECATED!** > > The Browser Pool package now lives in the [`crawlee`](https://github.com/apify/crawlee) repository - submit all your issues and PRs there. This repository is no longer actively maintained! Browser Pool is a small, but powerful and extensible library, that allows you to seamlessly control multiple headless browsers at the same time with only a little configuration, and a single function call. Currently it supports [Puppeteer](https://github.com/puppeteer/puppeteer), [Playwright](https://github.com/microsoft/playwright) and it can be easily extended with plugins. We created Browser Pool because we regularly needed to execute tasks concurrently in many headless browsers and their pages, but we did not want to worry about launching browsers, closing browsers, restarting them after crashes and so on. We also wanted to easily and reliably manage the whole browser / page lifecycle. You can use Browser Pool for scraping the internet at scale, testing your website in multiple browsers at the same time or launching web automation robots. We're interested to hear about your use cases in the [Discussions](https://github.com/apify/browser-pool/discussions). <!-- toc --> - [Installation](#installation) - [Usage](#usage) - [Launching multiple browsers](#launching-multiple-browsers) - [Features](#features) * [Simple configuration](#simple-configuration) * [Proxy management](#proxy-management) * [Lifecycle management with hooks](#lifecycle-management-with-hooks) * [Manipulating playwright context using `pageOptions` or `launchOptions`](#manipulating-playwright-context-using-pageoptions-or-launchoptions) * [Single API for common operations](#single-api-for-common-operations) * [Graceful browser closing](#graceful-browser-closing) * [Changing browser fingerprints a.k.a. browser signatures](#changing-browser-fingerprints-aka-browser-signatures) * [(UNSTABLE) Extensibility with plugins](#unstable-extensibility-with-plugins) - [API Reference](#api-reference) <!-- tocstop --> ## Installation Use NPM or Yarn to install `browser-pool`. Note that `browser-pool` does not come preinstalled with browser automation libraries. This allows you to choose your own libraries and their versions and it also makes `browser-pool` much smaller. Run this command to install `browser-pool` and the `playwright` browser automation library. ```bash npm install browser-pool playwright ``` ## Usage This simple example shows how to open a page in a browser using Browser Pool. We use the provided `PlaywrightPlugin` to wrap a Playwright installation of your own. By calling `browserPool.newPage()` you launch a new Firefox browser and open a new page in that browser. ```js const { BrowserPool, PlaywrightPlugin } = require('browser-pool'); const playwright = require('playwright'); const browserPool = new BrowserPool({ browserPlugins: [new PlaywrightPlugin(playwright.chromium)], }); // An asynchronous IIFE (immediately invoked function expression) // allows us to use the 'await' keyword. (async () => { // Launches Chromium with Playwright and returns a Playwright Page. const page1 = await browserPool.newPage(); // You can interact with the page as you're used to. await page1.goto('https://example.com'); // When you're done, close the page. await page1.close(); // Opens a second page in the same browser. const page2 = await browserPool.newPage(); // When everything's finished, tear down the pool. await browserPool.destroy(); })(); ``` > Browser Pool uses the same asynchronous API as the underlying automation libraries which means extensive use of Promises and the `async` / `await` pattern. [Visit MDN to learn more](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Asynchronous/Async_await). ## Launching multiple browsers The basic example shows how to launch a single browser, but the purpose of Browser Pool is to launch many browsers. This is done automatically in the background. You only need to provide the relevant plugins and call `browserPool.newPage()`. ```js const { BrowserPool, PlaywrightPlugin } = require('browser-pool'); const playwright = require('playwright'); const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), new PlaywrightPlugin(playwright.firefox), new PlaywrightPlugin(playwright.webkit), ], }); (async () => { // Open 4 pages in 3 browsers. The browsers are launched // in a round-robin fashion based on the plugin order. const chromiumPage = await browserPool.newPage(); const firefoxPage = await browserPool.newPage(); const webkitPage = await browserPool.newPage(); const chromiumPage2 = await browserPool.newPage(); // Don't forget to close pages / destroy pool when you're done. })(); ``` This round-robin way of opening pages may not be useful for you, if you need to consistently run tasks in multiple environments. For that, there's the `newPageWithEachPlugin` function. ```js const { BrowserPool, PlaywrightPlugin, PuppeteerPlugin } = require('browser-pool'); const playwright = require('playwright'); const puppeteer = require('puppeteer'); const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), new PuppeteerPlugin(puppeteer), ], }); (async () => { const pages = await browserPool.newPageWithEachPlugin(); const promises = pages.map(async page => { // Run some task with each page // pages are in order of plugins: // [playwrightPage, puppeteerPage] await page.close(); }); await Promise.all(promises); // Continue with some more work. })(); ``` ## Features Besides a simple interface for launching browsers, Browser Pool includes other helpful features that make browser management more convenient. ### Simple configuration You can easily set the maximum number of pages that can be open in a given browser and also the maximum number of pages to process before a browser [is retired](#graceful-browser-closing). ```js const browserPool = new BrowserPool({ maxOpenPagesPerBrowser: 20, retireBrowserAfterPageCount: 100, }); ``` You can configure the browser launch options either right in the plugins: ```js const playwrightPlugin = new PlaywrightPlugin(playwright.chromium, { launchOptions: { headless: true, } }) ``` Or dynamically in [pre-launch hooks](#lifecycle-management-with-hooks): ```js const browserPool = new BrowserPool({ preLaunchHooks: [(pageId, launchContext) => { if (pageId === 'headful') { launchContext.launchOptions.headless = false; } }] }); ``` ### Proxy management When scraping at scale or testing websites from multiple geolocations, one often needs to use proxy servers. Setting up an authenticated proxy in Puppeteer can be cumbersome, so we created a helper that does all the heavy lifting for you. Simply provide a proxy URL with authentication credentials, and you're done. It works the same for Playwright too. ```js const puppeteerPlugin = new PuppeteerPlugin(puppeteer, { proxyUrl: 'http://<username>:<password>@proxy.com:8000' }); ``` > We plan to extend this by adding a proxy-per-page functionality, > allowing you to rotate proxies per page, rather than per browser. ### Lifecycle management with hooks Browser Pool allows you to manage the full browser / page lifecycle by attaching hooks to the most important events. Asynchronous hooks are supported, and their execution order is guaranteed. The first parameter of each hook is either a `pageId` for the hooks executed before a `page` is created or a `page` afterwards. This is useful to keep track of which hook was triggered by which `newPage()` call. ```js const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), ], preLaunchHooks: [(pageId, launchContext) => { // You can use pre-launch hooks to make dynamic changes // to the launchContext, such as changing a proxyUrl // or updating the browser launchOptions pageId === 'my-page' // true }], postPageCreateHooks: [(page, browserController) => { // It makes sense to make global changes to pages // in post-page-create hooks. For example, you can // inject some JavaScript library, such as jQuery. browserPool.getPageId(page) === 'my-page' // true }] }); await browserPool.newPage({ id: 'my-page' }); ``` > See the API Documentation for all hooks and their arguments. ### Manipulating playwright context using `pageOptions` or `launchOptions` Playwright allows customizing multiple browser attributes by browser context. You can customize some of them once the context is created, but some need to be customized within its creation. This part of the documentation should explain how you can effectively customize the browser context. First of all, let's take a look at what kind of context strategy you chose. You can choose between two strategies by `useIncognitoPages` `LaunchContext` option. Suppose you decide to keep `useIncognitoPages` default `false` and create a shared context across all pages launched by one browser. In this case, you should pass the `contextOptions` as a `launchOptions` since the context is created within the new browser launch. The `launchOptions` corresponds to these [playwright options](https://playwright.dev/docs/api/class-browsertype#browsertypelaunchpersistentcontextuserdatadir-options). As you can see, these options contain not only ordinary playwright launch options but also the context options. If you set `useIncognitoPages` to `true`, you will create a new context within each new page, which allows you to handle each page its cookies and application data. This approach allows you to pass the context options as `pageOptions` because a new context is created once you create a new page. In this case, the `pageOptions` corresponds to these [playwright options](https://playwright.dev/docs/api/class-browser#browsernewpageoptions). **Changing context options with `LaunchContext`:** This will only work if you keep the default value for `useIncognitoPages` (`false`). ```javascript const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin( playwright.chromium, { launchOptions: { deviceScaleFactor: 2, }, }, ), ], }); ``` **Changing context options with `browserPool.newPage` options:** ```javascript const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin( playwright.chromium, { useIncognitoPages: true, // You must turn on incognito pages. launchOptions: { // launch options headless: false, devtools: true, }, }, ), ], }); (async () => { // Launches Chromium with Playwright and returns a Playwright Page. const page = await browserPool.newPage({ pageOptions: { // context options deviceScaleFactor: 2, colorScheme: 'light', locale: 'de-DE', }, }); })(); ``` **Changing context options with `prePageCreateHooks` options:** ```javascript const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin( playwright.chromium, { useIncognitoPages: true, launchOptions: { // launch options headless: false, devtools: true, }, }, ), ], prePageCreateHooks: [ (pageId, browserController, pageOptions) => { pageOptions.deviceScaleFactor = 2; pageOptions.colorScheme = 'dark'; pageOptions.locale = 'de-DE'; // You must modify the 'pageOptions' object, not assign to the variable. // pageOptions = {deviceScaleFactor: 2, ...etc} => This will not work! }, ], }); (async () => { // Launches Chromium with Playwright and returns a Playwright Page. const page = await browserPool.newPage(); })(); ``` ### Single API for common operations Puppeteer and Playwright handle some things differently. Browser Pool attempts to remove those differences for the most common use-cases. ```js // Playwright const cookies = await context.cookies(); await context.addCookies(cookies); // Puppeteer const cookies = await page.cookies(); await page.setCookie(...cookies); // BrowserPool uses the same API for all plugins const cookies = await browserController.getCookies(page); await browserController.setCookies(page, cookies); ``` ### Graceful browser closing With Browser Pool, browsers are not closed, but retired. A retired browser will no longer open new pages, but it will wait until the open pages are closed, allowing your running tasks to finish. If a browser gets stuck in limbo, it will be killed after a timeout to prevent hanging browser processes. ### Changing browser fingerprints a.k.a. browser signatures Changing browser fingerprints is beneficial for avoiding getting blocked and simulating real user browsers. With Browser Pool, you can do this otherwise complicated technique by enabling the `useFingerprints` option. The fingerprints are by default tied to the respective session (and proxy URLs) to not use the same unique fingerprint from various IP addresses. You can disable this behavior in the [`fingerprintOptions`](#new_BrowserPool_new). Using `fingerprintsOptions`, you can also control which fingerprints are generated. You can control parameters as browser, operating system, and browser versions. ### (UNSTABLE) Extensibility with plugins A new super cool browser automation library appears? No problem, we add a simple plugin to Browser Pool and it automagically works. > The BrowserPlugin and BrowserController interfaces are unstable and may > change if we find some implementation to be sub-optimal. ## API Reference All public classes, methods and their parameters can be inspected in this API reference. <a name="module_browser-pool"></a> ### browser-pool The `browser-pool` module exports three constructors. One for `BrowserPool` itself and two for the included Puppeteer and Playwright plugins. **Example:** ```js const { BrowserPool, PuppeteerPlugin, PlaywrightPlugin } = require('browser-pool'); const puppeteer = require('puppeteer'); const playwright = require('playwright'); const browserPool = new BrowserPool({ browserPlugins: [ new PuppeteerPlugin(puppeteer), new PlaywrightPlugin(playwright.chromium), ] }); ``` **Properties** | Name | Type | | --- | --- | | BrowserPool | [<code>BrowserPool</code>](#BrowserPool) | | PuppeteerPlugin | <code>PuppeteerPlugin</code> | | PlaywrightPlugin | <code>PlaywrightPlugin</code> | * * * <a name="BrowserPool"></a> ### BrowserPool The `BrowserPool` class is the most important class of the `browser-pool` module. It manages opening and closing of browsers and their pages and its constructor options allow easy configuration of the browsers' and pages' lifecycle. The most important and useful constructor options are the various lifecycle hooks. Those allow you to sequentially call a list of (asynchronous) functions at each stage of the browser / page lifecycle. **Example:** ```js const { BrowserPool, PlaywrightPlugin } = require('browser-pool'); const playwright = require('playwright'); const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium)], preLaunchHooks: [(pageId, launchContext) => { // do something before a browser gets launched launchContext.launchOptions.headless = false; }], postLaunchHooks: [(pageId, browserController) => { // manipulate the browser right after launch console.dir(browserController.browser.contexts()); }], prePageCreateHooks: [(pageId, browserController) => { if (pageId === 'my-page') { // make changes right before a specific page is created } }], postPageCreateHooks: [async (page, browserController) => { // update some or all new pages await page.evaluate(() => { // now all pages will have 'foo' window.foo = 'bar' }) }], prePageCloseHooks: [async (page, browserController) => { // collect information just before a page closes await page.screenshot(); }], postPageCloseHooks: [(pageId, browserController) => { // clean up or log after a job is done console.log('Page closed: ', pageId) }] }); ``` * [BrowserPool](#BrowserPool) * [`new BrowserPool(options)`](#new_BrowserPool_new) * [`.newPage(options)`](#BrowserPool+newPage) ⇒ <code>Promise.&lt;Page&gt;</code> * [`.newPageInNewBrowser(options)`](#BrowserPool+newPageInNewBrowser) ⇒ <code>Promise.&lt;Page&gt;</code> * [`.newPageWithEachPlugin(optionsList)`](#BrowserPool+newPageWithEachPlugin) ⇒ <code>Promise.&lt;Array.&lt;Page&gt;&gt;</code> * [`.getBrowserControllerByPage(page)`](#BrowserPool+getBrowserControllerByPage) ⇒ [<code>BrowserController</code>](#BrowserController) * [`.getPage(id)`](#BrowserPool+getPage) ⇒ <code>Page</code> * [`.getPageId(page)`](#BrowserPool+getPageId) ⇒ <code>string</code> * [`.retireBrowserController(browserController)`](#BrowserPool+retireBrowserController) * [`.retireBrowserByPage(page)`](#BrowserPool+retireBrowserByPage) * [`.retireAllBrowsers()`](#BrowserPool+retireAllBrowsers) * [`.closeAllBrowsers()`](#BrowserPool+closeAllBrowsers) ⇒ <code>Promise.&lt;void&gt;</code> * [`.destroy()`](#BrowserPool+destroy) ⇒ <code>Promise.&lt;void&gt;</code> * * * <a name="new_BrowserPool_new"></a> #### `new BrowserPool(options)` | Param | Type | Default | Description | | --- | --- | --- | --- | | options | <code>object</code> | | | | options.browserPlugins | [<code>Array.&lt;BrowserPlugin&gt;</code>](#BrowserPlugin) | | Browser plugins are wrappers of browser automation libraries that allow `BrowserPool` to control browsers with those libraries. `browser-pool` comes with a `PuppeteerPlugin` and a `PlaywrightPlugin`. | | [options.maxOpenPagesPerBrowser] | <code>number</code> | <code>20</code> | Sets the maximum number of pages that can be open in a browser at the same time. Once reached, a new browser will be launched to handle the excess. | | [options.retireBrowserAfterPageCount] | <code>number</code> | <code>100</code> | Browsers tend to get bloated after processing a lot of pages. This option configures the number of processed pages after which the browser will automatically retire and close. A new browser will launch in its place. | | [options.operationTimeoutSecs] | <code>number</code> | <code>15</code> | As we know from experience, async operations of the underlying libraries, such as launching a browser or opening a new page, can get stuck. To prevent `BrowserPool` from getting stuck, we add a timeout to those operations and you can configure it with this option. | | [options.closeInactiveBrowserAfterSecs] | <code>number</code> | <code>300</code> | Browsers normally close immediately after their last page is processed. However, there could be situations where this does not happen. Browser Pool makes sure all inactive browsers are closed regularly, to free resources. | | [options.preLaunchHooks] | <code>Array.&lt;function()&gt;</code> | | Pre-launch hooks are executed just before a browser is launched and provide a good opportunity to dynamically change the launch options. The hooks are called with two arguments: `pageId`: `string` and `launchContext`: [LaunchContext](#LaunchContext) | | [options.postLaunchHooks] | <code>Array.&lt;function()&gt;</code> | | Post-launch hooks are executed as soon as a browser is launched. The hooks are called with two arguments: `pageId`: `string` and `browserController`: [BrowserController](#BrowserController) To guarantee order of execution before other hooks in the same browser, the [BrowserController](#BrowserController) methods cannot be used until the post-launch hooks complete. If you attempt to call `await browserController.close()` from a post-launch hook, it will deadlock the process. This API is subject to change. | | [options.prePageCreateHooks] | <code>Array.&lt;function()&gt;</code> | | Pre-page-create hooks are executed just before a new page is created. They are useful to make dynamic changes to the browser before opening a page. The hooks are called with two arguments: `pageId`: `string`, `browserController`: [BrowserController](#BrowserController) and `pageOptions`: `object|undefined` - This only works if the underlying `BrowserController` supports new page options. So far, new page options are only supported by `PlaywrightController`. If the page options are not supported by `BrowserController` the `pageOptions` argument is `undefined`. | | [options.postPageCreateHooks] | <code>Array.&lt;function()&gt;</code> | | Post-page-create hooks are called right after a new page is created and all internal actions of Browser Pool are completed. This is the place to make changes to a page that you would like to apply to all pages. Such as injecting a JavaScript library into all pages. The hooks are called with two arguments: `page`: `Page` and `browserController`: [BrowserController](#BrowserController) | | [options.prePageCloseHooks] | <code>Array.&lt;function()&gt;</code> | | Pre-page-close hooks give you the opportunity to make last second changes in a page that's about to be closed, such as saving a snapshot or updating state. The hooks are called with two arguments: `page`: `Page` and `browserController`: [BrowserController](#BrowserController) | | [options.postPageCloseHooks] | <code>Array.&lt;function()&gt;</code> | | Post-page-close hooks allow you to do page related clean up. The hooks are called with two arguments: `pageId`: `string` and `browserController`: [BrowserController](#BrowserController) | |[options.useFingerprints] | <code>boolean</code> | <code>false</code> | If true the Browser pool will automatically generate and inject fingerprints to browsers.| | [options.fingerprintsOptions] | <code>FingerprintOptions </code> | | Fingerprints options that allows customizing the fingerprinting behavior. | | [options.fingerprintsOptions.fingerprintGeneratorOptions] | | | See the [Fingerprint generator]("https://github.com/apify/fingerprint-generator#headergeneratoroptions") documentation. | | [options.fingerprintsOptions.useFingerprintPerProxyCache] | <code>boolean</code> | <code>true</code>| Fingerprints are autimatically assigned to an IP address so 1 IP equals 1 fingerprint. You can disable this behavior by settings this property to false. | | [options.fingerprintsOptions.fingerprintPerProxyCacheSize] | <code>number</code> | <code>10000</code> | Maximum number of IP to fingerprint pairs. * * * <a name="BrowserPool+newPage"></a> #### `browserPool.newPage(options)` ⇒ <code>Promise.&lt;Page&gt;</code> Opens a new page in one of the running browsers or launches a new browser and opens a page there, if no browsers are active, or their page limits have been exceeded. | Param | Type | Description | | --- | --- | --- | | options | <code>object</code> | | | [options.id] | <code>string</code> | Assign a custom ID to the page. If you don't a random string ID will be generated. | | [options.pageOptions] | <code>object</code> | Some libraries (Playwright) allow you to open new pages with specific options. Use this property to set those options. | | [options.browserPlugin] | [<code>BrowserPlugin</code>](#BrowserPlugin) | Choose a plugin to open the page with. If none is provided, one of the pool's available plugins will be used. It must be one of the plugins browser pool was created with. If you wish to start a browser with a different configuration, see the `newPageInNewBrowser` function. | * * * <a name="BrowserPool+newPageInNewBrowser"></a> #### `browserPool.newPageInNewBrowser(options)` ⇒ <code>Promise.&lt;Page&gt;</code> Unlike [newPage](newPage), `newPageInNewBrowser` always launches a new browser to open the page in. Use the `launchOptions` option to configure the new browser. | Param | Type | Description | | --- | --- | --- | | options | <code>object</code> | | | [options.id] | <code>string</code> | Assign a custom ID to the page. If you don't a random string ID will be generated. | | [options.pageOptions] | <code>object</code> | Some libraries (Playwright) allow you to open new pages with specific options. Use this property to set those options. | | [options.launchOptions] | <code>object</code> | Options that will be used to launch the new browser. | | [options.browserPlugin] | [<code>BrowserPlugin</code>](#BrowserPlugin) | Provide a plugin to launch the browser. If none is provided, one of the pool's available plugins will be used. If you configured `BrowserPool` to rotate multiple libraries, such as both Puppeteer and Playwright, you should always set the `browserPlugin` when using the `launchOptions` option. The plugin will not be added to the list of plugins used by the pool. You can either use one of those, to launch a specific browser, or provide a completely new configuration. | * * * <a name="BrowserPool+newPageWithEachPlugin"></a> #### `browserPool.newPageWithEachPlugin(optionsList)` ⇒ <code>Promise.&lt;Array.&lt;Page&gt;&gt;</code> Opens new pages with all available plugins and returns an array of pages in the same order as the plugins were provided to `BrowserPool`. This is useful when you want to run a script in multiple environments at the same time, typically in testing or website analysis. **Example:** ```js const browserPool = new BrowserPool({ browserPlugins: [ new PlaywrightPlugin(playwright.chromium), new PlaywrightPlugin(playwright.firefox), new PlaywrightPlugin(playwright.webkit), new PuppeteerPlugin(puppeteer), ] }); const pages = await browserPool.newPageWithEachPlugin(); const [chromiumPage, firefoxPage, webkitPage, puppeteerPage] = pages; ``` | Param | Type | | --- | --- | | optionsList | <code>Array.&lt;object&gt;</code> | * * * <a name="BrowserPool+getBrowserControllerByPage"></a> #### `browserPool.getBrowserControllerByPage(page)` ⇒ [<code>BrowserController</code>](#BrowserController) Retrieves a [BrowserController](#BrowserController) for a given page. This is useful when you're working only with pages and need to access the browser manipulation functionality. You could access the browser directly from the page, but that would circumvent `BrowserPool` and most likely cause weird things to happen, so please always use `BrowserController` to control your browsers. The function returns `undefined` if the browser is closed. | Param | Type | Description | | --- | --- | --- | | page | <code>Page</code> | Browser plugin page | * * * <a name="BrowserPool+getPage"></a> #### `browserPool.getPage(id)` ⇒ <code>Page</code> If you provided a custom ID to one of your pages or saved the randomly generated one, you can use this function to retrieve the page. If the page is no longer open, the function will return `undefined`. | Param | Type | | --- | --- | | id | <code>string</code> | * * * <a name="BrowserPool+getPageId"></a> #### `browserPool.getPageId(page)` ⇒ <code>string</code> Page IDs are used throughout `BrowserPool` as a method of linking events. You can use a page ID to track the full lifecycle of the page. It is created even before a browser is launched and stays with the page until it's closed. | Param | Type | | --- | --- | | page | <code>Page</code> | * * * <a name="BrowserPool+retireBrowserController"></a> #### `browserPool.retireBrowserController(browserController)` Removes a browser controller from the pool. The underlying browser will be closed after all its pages are closed. | Param | Type | | --- | --- | | browserController | [<code>BrowserController</code>](#BrowserController) | * * * <a name="BrowserPool+retireBrowserByPage"></a> #### `browserPool.retireBrowserByPage(page)` Removes a browser from the pool. It will be closed after all its pages are closed. | Param | Type | | --- | --- | | page | <code>Page</code> | * * * <a name="BrowserPool+retireAllBrowsers"></a> #### `browserPool.retireAllBrowsers()` Removes all active browsers from the pool. The browsers will be closed after all their pages are closed. * * * <a name="BrowserPool+closeAllBrowsers"></a> #### `browserPool.closeAllBrowsers()` ⇒ <code>Promise.&lt;void&gt;</code> Closes all managed browsers without waiting for pages to close. * * * <a name="BrowserPool+destroy"></a> #### `browserPool.destroy()` ⇒ <code>Promise.&lt;void&gt;</code> Closes all managed browsers and tears down the pool. * * * <a name="BrowserController"></a> ### BrowserController The `BrowserController` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerController` or `PlaywrightController` extend. Second, it defines the public interface of the specialized classes which provide only private methods. Therefore, we do not keep documentation for the specialized classes, because it's the same for all of them. **Properties** | Name | Type | Description | | --- | --- | --- | | id | <code>string</code> | | | browserPlugin | [<code>BrowserPlugin</code>](#BrowserPlugin) | The `BrowserPlugin` instance used to launch the browser. | | browser | <code>Browser</code> | Browser representation of the underlying automation library. | | launchContext | [<code>LaunchContext</code>](#LaunchContext) | The configuration the browser was launched with. | * [BrowserController](#BrowserController) * [`.close()`](#BrowserController+close) ⇒ <code>Promise.&lt;void&gt;</code> * [`.kill()`](#BrowserController+kill) ⇒ <code>Promise.&lt;void&gt;</code> * [`.setCookies(page, cookies)`](#BrowserController+setCookies) ⇒ <code>Promise.&lt;void&gt;</code> * [`.getCookies(page)`](#BrowserController+getCookies) ⇒ <code>Promise.&lt;Array.&lt;object&gt;&gt;</code> * * * <a name="BrowserController+close"></a> #### `browserController.close()` ⇒ <code>Promise.&lt;void&gt;</code> Gracefully closes the browser and makes sure there will be no lingering browser processes. Emits 'browserClosed' event. * * * <a name="BrowserController+kill"></a> #### `browserController.kill()` ⇒ <code>Promise.&lt;void&gt;</code> Immediately kills the browser process. Emits 'browserClosed' event. * * * <a name="BrowserController+setCookies"></a> #### `browserController.setCookies(page, cookies)` ⇒ <code>Promise.&lt;void&gt;</code> | Param | Type | | --- | --- | | page | <code>Object</code> | | cookies | <code>Array.&lt;object&gt;</code> | * * * <a name="BrowserController+getCookies"></a> #### `browserController.getCookies(page)` ⇒ <code>Promise.&lt;Array.&lt;object&gt;&gt;</code> | Param | Type | | --- | --- | | page | <code>Object</code> | * * * <a name="BrowserPlugin"></a> ### BrowserPlugin The `BrowserPlugin` serves two purposes. First, it is the base class that specialized controllers like `PuppeteerPlugin` or `PlaywrightPlugin` extend. Second, it allows the user to configure the automation libraries and feed them to [BrowserPool](#BrowserPool) for use. **Properties** | Name | Type | Default | Description | | --- | --- | --- | --- | | [useIncognitoPages] | <code>boolean</code> | <code>false</code> | By default pages share the same browser context. If set to true each page uses its own context that is destroyed once the page is closed or crashes. | | [userDataDir] | <code>object</code> | | Path to a User Data Directory, which stores browser session data like cookies and local storage. | * * * <a name="new_BrowserPlugin_new"></a> #### `new BrowserPlugin(library, [options])` | Param | Type | Description | | --- | --- | --- | | library | <code>object</code> | Each plugin expects an instance of the object with the `.launch()` property. For Puppeteer, it is the `puppeteer` module itself, whereas for Playwright it is one of the browser types, such as `puppeteer.chromium`. `BrowserPlugin` does not include the library. You can choose any version or fork of the library. It also keeps `browser-pool` installation small. | | [options] | <code>object</code> | | | [options.launchOptions] | <code>object</code> | Options that will be passed down to the automation library. E.g. `puppeteer.launch(launchOptions);`. This is a good place to set options that you want to apply as defaults. To dynamically override those options per-browser, see the `preLaunchHooks` of [BrowserPool](#BrowserPool). | | [options.proxyUrl] | <code>string</code> | Automation libraries configure proxies differently. This helper allows you to set a proxy URL without worrying about specific implementations. It also allows you use an authenticated proxy without extra code. | * * * <a name="LaunchContext"></a> ### LaunchContext `LaunchContext` holds information about the launched browser. It's useful to retrieve the `launchOptions`, the proxy the browser was launched with or any other information user chose to add to the `LaunchContext` by calling its `extend` function. This is very useful to keep track of browser-scoped values, such as session IDs. **Properties** | Name | Type | Description | | --- | --- | --- | | id | <code>string</code> | To make identification of `LaunchContext` easier, `BrowserPool` assigns the `LaunchContext` an `id` that's equal to the `id` of the page that triggered the browser launch. This is useful, because many pages share a single launch context (single browser). | | browserPlugin | [<code>BrowserPlugin</code>](#BrowserPlugin) | The `BrowserPlugin` instance used to launch the browser. | | launchOptions | <code>object</code> | The actual options the browser was launched with, after changes. Those changes would be typically made in pre-launch hooks. | | [useIncognitoPages] | <code>boolean</code> | By default pages share the same browser context. If set to true each page uses its own context that is destroyed once the page is closed or crashes. | | [userDataDir] | <code>object</code> | Path to a User Data Directory, which stores browser session data like cookies and local storage. | * [LaunchContext](#LaunchContext) * [`.proxyUrl`](#LaunchContext+proxyUrl) * [`.proxyUrl`](#LaunchContext+proxyUrl) ⇒ <code>string</code> * [`.extend(fields)`](#LaunchContext+extend) * * * <a name="LaunchContext+proxyUrl"></a> #### `launchContext.proxyUrl` Sets a proxy URL for the browser. Use `undefined` to unset existing proxy URL. | Param | Type | | --- | --- | | url | <code>string</code> | * * * <a name="LaunchContext+proxyUrl"></a> #### `launchContext.proxyUrl` ⇒ <code>string</code> Returns the proxy URL of the browser. * * * <a name="LaunchContext+extend"></a> #### `launchContext.extend(fields)` Extend the launch context with any extra fields. This is useful to keep state information relevant to the browser being launched. It ensures that no internal fields are overridden and should be used instead of property assignment. | Param | Type | | --- | --- | | fields | <code>object</code> | * * *

Developer Tools Browser Automation
99 Github Stars
crawlee
Open Source

crawlee

<h1 align="center"> <a href="https://crawlee.dev"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/apify/crawlee/master/website/static/img/crawlee-dark.svg?sanitize=true"> <img alt="Crawlee" src="https://raw.githubusercontent.com/apify/crawlee/master/website/static/img/crawlee-light.svg?sanitize=true" width="500"> </picture> </a> <br> <small>A web scraping and browser automation library</small> </h1> <p align=center> <a href="https://trendshift.io/repositories/5179" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5179" alt="apify%2Fcrawlee | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> </p> <p align=center> <a href="https://www.npmjs.com/package/@crawlee/core" rel="nofollow"><img src="https://img.shields.io/npm/v/@crawlee/core.svg" alt="NPM latest version" data-canonical-src="https://img.shields.io/npm/v/@crawlee/core/next.svg" style="max-width: 100%;"></a> <a href="https://www.npmjs.com/package/@crawlee/core" rel="nofollow"><img src="https://img.shields.io/npm/dm/@crawlee/core.svg" alt="Downloads" data-canonical-src="https://img.shields.io/npm/dm/@crawlee/core.svg" style="max-width: 100%;"></a> <a href="https://discord.gg/jyEM2PRvMU" rel="nofollow"><img src="https://img.shields.io/discord/801163717915574323?label=discord" alt="Chat on discord" data-canonical-src="https://img.shields.io/discord/801163717915574323?label=discord" style="max-width: 100%;"></a> <a href="https://github.com/apify/crawlee/actions/workflows/test-ci.yml"><img src="https://github.com/apify/crawlee/actions/workflows/test-ci.yml/badge.svg?branch=master" alt="Build Status" style="max-width: 100%;"></a> </p> Crawlee covers your crawling and scraping end-to-end and **helps you build reliable scrapers. Fast.** Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data, and store it to disk or cloud while staying configurable to suit your project's needs. Crawlee is available as the [`crawlee`](https://www.npmjs.com/package/crawlee) NPM package. > 👉 **View full documentation, guides and examples on the [Crawlee project website](https://crawlee.dev)** 👈 > Do you prefer 🐍 Python instead of JavaScript? [👉 Checkout Crawlee for Python 👈](https://github.com/apify/crawlee-python). ## Installation We recommend visiting the [Introduction tutorial](https://crawlee.dev/js/docs/introduction) in Crawlee documentation for more information. > Crawlee requires **Node.js 16 or higher**. ### With Crawlee CLI The fastest way to try Crawlee out is to use the **Crawlee CLI** and choose the **Getting started example**. The CLI will install all the necessary dependencies and add boilerplate code for you to play with. ```bash npx crawlee create my-crawler ``` ```bash cd my-crawler npm start ``` ### Manual installation If you prefer adding Crawlee **into your own project**, try the example below. Because it uses `PlaywrightCrawler` we also need to install [Playwright](https://playwright.dev). It's not bundled with Crawlee to reduce install size. ```bash npm install crawlee playwright ``` ```js import { PlaywrightCrawler, Dataset } from 'crawlee'; // PlaywrightCrawler crawls the web using a headless // browser controlled by the Playwright library. const crawler = new PlaywrightCrawler({ // Use the requestHandler to process each of the crawled pages. async requestHandler({ request, page, enqueueLinks, log }) { const title = await page.title(); log.info(`Title of ${request.loadedUrl} is '${title}'`); // Save results as JSON to ./storage/datasets/default await Dataset.pushData({ title, url: request.loadedUrl }); // Extract links from the current page // and add them to the crawling queue. await enqueueLinks(); }, // Uncomment this option to see the browser window. // headless: false, }); // Add first URL to the queue and start the crawl. await crawler.run(['https://crawlee.dev']); ``` By default, Crawlee stores data to `./storage` in the current working directory. You can override this directory via Crawlee configuration. For details, see [Configuration guide](https://crawlee.dev/js/docs/guides/configuration), [Request storage](https://crawlee.dev/js/docs/guides/request-storage) and [Result storage](https://crawlee.dev/js/docs/guides/result-storage). ### Installing pre-release versions We provide automated beta builds for every merged code change in Crawlee. You can find them in the npm [list of releases](https://www.npmjs.com/package/crawlee?activeTab=versions). If you want to test new features or bug fixes before we release them, feel free to install a beta build like this: ```bash npm install crawlee@next ``` If you also use the [Apify SDK](https://github.com/apify/apify-sdk-js), you need to specify dependency overrides in your `package.json` file so that you don't end up with multiple versions of Crawlee installed: ```json { "overrides": { "apify": { "@crawlee/core": "$crawlee", "@crawlee/types": "$crawlee", "@crawlee/utils": "$crawlee" } } } ``` ## 🛠 Features - Single interface for **HTTP and headless browser** crawling - Persistent **queue** for URLs to crawl (breadth & depth first) - Pluggable **storage** of both tabular data and files - Automatic **scaling** with available system resources - Integrated **proxy rotation** and session management - Lifecycles customizable with **hooks** - **CLI** to bootstrap your projects - Configurable **routing**, **error handling** and **retries** - **Dockerfiles** ready to deploy - Written in **TypeScript** with generics ### 👾 HTTP crawling - Zero config **HTTP2 support**, even for proxies - Automatic generation of **browser-like headers** - Replication of browser **TLS fingerprints** - Integrated fast **HTML parsers**. Cheerio and JSDOM - Yes, you can scrape **JSON APIs** as well ### 💻 Real browser crawling - JavaScript **rendering** and **screenshots** - **Headless** and **headful** support - Zero-config generation of **human-like fingerprints** - Automatic **browser management** - Use **Playwright** and **Puppeteer** with the same interface - **Chrome**, **Firefox**, **Webkit** and many others ## Usage on the Apify platform Crawlee is open-source and runs anywhere, but since it's developed by [Apify](https://apify.com), it's easy to set up on the Apify platform and run in the cloud. Visit the [Apify SDK website](https://sdk.apify.com) to learn more about deploying Crawlee to the Apify platform. ## Support If you find any bug or issue with Crawlee, please [submit an issue on GitHub](https://github.com/apify/crawlee/issues). For questions, you can ask on [Stack Overflow](https://stackoverflow.com/questions/tagged/apify), in GitHub Discussions or you can join our [Discord server](https://discord.com/invite/jyEM2PRvMU). ## Contributing Your code contributions are welcome, and you'll be praised to eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see [CONTRIBUTING.md](https://github.com/apify/crawlee/blob/master/CONTRIBUTING.md). ## License This project is licensed under the Apache License 2.0 - see the [LICENSE.md](https://github.com/apify/crawlee/blob/master/LICENSE.md) file for details.

Browser Automation
23.7K Github Stars
crawlee-python
Open Source

crawlee-python

<h1 align="center"> <a href="https://crawlee.dev"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/apify/crawlee-python/master/website/static/img/crawlee-dark.svg?sanitize=true"> <img alt="Crawlee" src="https://raw.githubusercontent.com/apify/crawlee-python/master/website/static/img/crawlee-light.svg?sanitize=true" width="500"> </picture> </a> <br> <small>A web scraping and browser automation library</small> </h1> <p align=center> <a href="https://trendshift.io/repositories/11169" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11169" alt="apify%2Fcrawlee-python | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a> </p> <p align="center"> <a href="https://badge.fury.io/py/crawlee" rel="nofollow"><img src="https://badge.fury.io/py/crawlee.svg" alt="PyPI package version"></a> <a href="https://pypi.org/project/crawlee/" rel="nofollow"><img src="https://img.shields.io/pypi/dm/crawlee" alt="PyPI package downloads"></a> <a href="https://codecov.io/gh/apify/crawlee-python"><img src="https://codecov.io/gh/apify/crawlee-python/graph/badge.svg?token=cCju61iPQG" alt="Codecov report"></a> <a href="https://pypi.org/project/crawlee/" rel="nofollow"><img src="https://img.shields.io/pypi/pyversions/crawlee" alt="PyPI Python version"></a> <a href="https://discord.gg/jyEM2PRvMU" rel="nofollow"><img src="https://img.shields.io/discord/801163717915574323?label=discord" alt="Chat on Discord"></a> </p> Crawlee covers your crawling and scraping end-to-end and **helps you build reliable scrapers. Fast.** Your crawlers will appear almost human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it. > 👉 **View full documentation, guides and examples on the [Crawlee project website](https://crawlee.dev/python/)** 👈 We also have a TypeScript implementation of the Crawlee, which you can explore and utilize for your projects. Visit our GitHub repository for more information [Crawlee for JS/TS on GitHub](https://github.com/apify/crawlee). ## Installation We recommend visiting the [Introduction tutorial](https://crawlee.dev/python/docs/introduction) in Crawlee documentation for more information. Crawlee is available as [`crawlee`](https://pypi.org/project/crawlee/) package on PyPI. This package includes the core functionality, while additional features are available as optional extras to keep dependencies and package size minimal. To install Crawlee with all features, run the following command: ```sh python -m pip install 'crawlee[all]' ``` Then, install the [Playwright](https://playwright.dev/) dependencies: ```sh playwright install ``` Verify that Crawlee is successfully installed: ```sh python -c 'import crawlee; print(crawlee.__version__)' ``` For detailed installation instructions see the [Setting up](https://crawlee.dev/python/docs/introduction/setting-up) documentation page. ### With Crawlee CLI The quickest way to get started with Crawlee is by using the Crawlee CLI and selecting one of the prepared templates. First, ensure you have [uv](https://pypi.org/project/uv/) installed: ```sh uv --help ``` If [uv](https://pypi.org/project/uv/) is not installed, follow the official [installation guide](https://docs.astral.sh/uv/getting-started/installation/). Then, run the CLI and choose from the available templates: ```sh uvx 'crawlee[cli]' create my-crawler ``` If you already have `crawlee` installed, you can spin it up by running: ```sh crawlee create my-crawler ``` ## Examples Here are some practical examples to help you get started with different types of crawlers in Crawlee. Each example demonstrates how to set up and run a crawler for specific use cases, whether you need to handle simple HTML pages or interact with JavaScript-heavy sites. A crawler run will create a `storage/` directory in your current working directory. ### BeautifulSoupCrawler The [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) downloads web pages using an HTTP library and provides HTML-parsed content to the user. By default it uses [`HttpxHttpClient`](https://crawlee.dev/python/api/class/HttpxHttpClient) for HTTP communication and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) for parsing HTML. It is ideal for projects that require efficient extraction of data from HTML content. This crawler has very good performance since it does not use a browser. However, if you need to execute client-side JavaScript, to get your content, this is not going to be enough and you will need to use [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler). Also if you want to use this crawler, make sure you install `crawlee` with `beautifulsoup` extra. ```python import asyncio from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext async def main() -> None: crawler = BeautifulSoupCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Extract data from the page. data = { 'url': context.request.url, 'title': context.soup.title.string if context.soup.title else None, } # Push the extracted data to the default dataset. await context.push_data(data) # Enqueue all links found on the page. await context.enqueue_links() # Run the crawler with the initial list of URLs. await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` ### PlaywrightCrawler The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) uses a headless browser to download web pages and provides an API for data extraction. It is built on [Playwright](https://playwright.dev/), an automation library designed for managing headless browsers. It excels at retrieving web pages that rely on client-side JavaScript for content generation, or tasks requiring interaction with JavaScript-driven content. For scenarios where JavaScript execution is unnecessary or higher performance is required, consider using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler). Also if you want to use this crawler, make sure you install `crawlee` with `playwright` extra. ```python import asyncio from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext async def main() -> None: crawler = PlaywrightCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=10, ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: context.log.info(f'Processing {context.request.url} ...') # Extract data from the page. data = { 'url': context.request.url, 'title': await context.page.title(), } # Push the extracted data to the default dataset. await context.push_data(data) # Enqueue all links found on the page. await context.enqueue_links() # Run the crawler with the initial list of requests. await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main()) ``` ### More examples Explore our [Examples](https://crawlee.dev/python/docs/examples) page in the Crawlee documentation for a wide range of additional use cases and demonstrations. ## Features Why Crawlee is the preferred choice for web scraping and crawling? ### Why use Crawlee instead of just a random HTTP library with an HTML parser? - Unified interface for **HTTP & headless browser** crawling. - Automatic **parallel crawling** based on available system resources. - Written in Python with **type hints** - enhances DX (IDE autocompletion) and reduces bugs (static type checking). - Automatic **retries** on errors or when you’re getting blocked. - Integrated **proxy rotation** and session management. - Configurable **request routing** - direct URLs to the appropriate handlers. - Persistent **queue for URLs** to crawl. - Pluggable **storage** of both tabular data and files. - Robust **error handling**. ### Why to use Crawlee rather than Scrapy? - **Asyncio-based** – Leveraging the standard [Asyncio](https://docs.python.org/3/library/asyncio.html) library, Crawlee delivers better performance and seamless compatibility with other modern asynchronous libraries. - **Type hints** – Newer project built with modern Python, and complete type hint coverage for a better developer experience. - **Simple integration** – Crawlee crawlers are regular Python scripts, requiring no additional launcher executor. This flexibility allows to integrate a crawler directly into other applications. - **State persistence** – Supports state persistence during interruptions, saving time and costs by avoiding the need to restart scraping pipelines from scratch after an issue. - **Organized data storages** – Allows saving of multiple types of results in a single scraping run. Offers several storing options (see [datasets](https://crawlee.dev/python/api/class/Dataset) & [key-value stores](https://crawlee.dev/python/api/class/KeyValueStore)). ## Running on the Apify platform Crawlee is open-source and runs anywhere, but since it's developed by [Apify](https://apify.com), it's easy to set up on the Apify platform and run in the cloud. Visit the [Apify SDK website](https://docs.apify.com/sdk/python/) to learn more about deploying Crawlee to the Apify platform. ## Support If you find any bug or issue with Crawlee, please [submit an issue on GitHub](https://github.com/apify/crawlee-python/issues). For questions, you can ask on [Stack Overflow](https://stackoverflow.com/questions/tagged/apify), in GitHub Discussions or you can join our [Discord server](https://discord.com/invite/jyEM2PRvMU). ## Contributing Your code contributions are welcome, and you'll be praised for eternity! If you have any ideas for improvements, either submit an issue or create a pull request. For contribution guidelines and the code of conduct, see [CONTRIBUTING.md](https://github.com/apify/crawlee-python/blob/master/CONTRIBUTING.md). ## License This project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/apify/crawlee-python/blob/master/LICENSE) file for details.

Browser Automation
9.1K Github Stars
apify-sdk-python
Open Source

apify-sdk-python

<h1 align=center>Apify SDK for Python</h1> <p align="center"> <a href="https://badge.fury.io/py/apify" rel="nofollow"><img src="https://badge.fury.io/py/apify.svg" alt="PyPI package version"></a> <a href="https://pypi.org/project/apify/" rel="nofollow"><img src="https://img.shields.io/pypi/dm/apify" alt="PyPI package downloads"></a> <a href="https://codecov.io/gh/apify/apify-sdk-python"><img src="https://codecov.io/gh/apify/apify-sdk-python/graph/badge.svg?token=Y6JBIZQFT6" alt="Codecov report"></a> <a href="https://pypi.org/project/apify/" rel="nofollow"><img src="https://img.shields.io/pypi/pyversions/apify" alt="PyPI Python version"></a> <a href="https://discord.gg/jyEM2PRvMU" rel="nofollow"><img src="https://img.shields.io/discord/801163717915574323?label=discord" alt="Chat on Discord"></a> </p> The Apify SDK for Python is the official library to create [Apify Actors](https://docs.apify.com/platform/actors) in Python. It provides useful features like Actor lifecycle management, local storage emulation, and Actor event handling. If you just need to access the [Apify API](https://docs.apify.com/api/v2) from your Python applications, check out the [Apify Client for Python](https://docs.apify.com/api/client/python) instead. ## Installation The Apify SDK for Python is available on PyPI as the `apify` package. For default installation, using Pip, run the following: ```bash pip install apify ``` For users interested in integrating Apify with Scrapy, we provide a package extra called `scrapy`. To install Apify with the `scrapy` extra, use the following command: ```bash pip install apify[scrapy] ``` ## Documentation For usage instructions, check the documentation on [Apify Docs](https://docs.apify.com/sdk/python/). ## Examples Below are few examples demonstrating how to use the Apify SDK with some web scraping-related libraries. ### Apify SDK with HTTPX and BeautifulSoup This example illustrates how to integrate the Apify SDK with [HTTPX](https://www.python-httpx.org/) and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) to scrape data from web pages. ```python from bs4 import BeautifulSoup from httpx import AsyncClient from apify import Actor async def main() -> None: async with Actor: # Retrieve the Actor input, and use default values if not provided. actor_input = await Actor.get_input() or {} start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}]) # Open the default request queue for handling URLs to be processed. request_queue = await Actor.open_request_queue() # Enqueue the start URLs. for start_url in start_urls: url = start_url.get('url') await request_queue.add_request(url) # Process the URLs from the request queue. while request := await request_queue.fetch_next_request(): Actor.log.info(f'Scraping {request.url} ...') # Fetch the HTTP response from the specified URL using HTTPX. async with AsyncClient() as client: response = await client.get(request.url) # Parse the HTML content using Beautiful Soup. soup = BeautifulSoup(response.content, 'html.parser') # Extract the desired data. data = { 'url': request.url, 'title': soup.title.string, 'h1s': [h1.text for h1 in soup.find_all('h1')], 'h2s': [h2.text for h2 in soup.find_all('h2')], 'h3s': [h3.text for h3 in soup.find_all('h3')], } # Store the extracted data to the default dataset. await Actor.push_data(data) ``` ### Apify SDK with PlaywrightCrawler from Crawlee This example demonstrates how to use the Apify SDK alongside `PlaywrightCrawler` from [Crawlee](https://crawlee.dev/python) to perform web scraping. ```python from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext from apify import Actor async def main() -> None: async with Actor: # Retrieve the Actor input, and use default values if not provided. actor_input = await Actor.get_input() or {} start_urls = [url.get('url') for url in actor_input.get('start_urls', [{'url': 'https://apify.com'}])] # Exit if no start URLs are provided. if not start_urls: Actor.log.info('No start URLs specified in Actor input, exiting...') await Actor.exit() # Create a crawler. crawler = PlaywrightCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=50, headless=True, ) # Define a request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: url = context.request.url Actor.log.info(f'Scraping {url}...') # Extract the desired data. data = { 'url': context.request.url, 'title': await context.page.title(), 'h1s': [await h1.text_content() for h1 in await context.page.locator('h1').all()], 'h2s': [await h2.text_content() for h2 in await context.page.locator('h2').all()], 'h3s': [await h3.text_content() for h3 in await context.page.locator('h3').all()], } # Store the extracted data to the default dataset. await context.push_data(data) # Enqueue additional links found on the current page. await context.enqueue_links() # Run the crawler with the starting URLs. await crawler.run(start_urls) ``` ## What are Actors? Actors are serverless cloud programs that can do almost anything a human can do in a web browser. They can do anything from small tasks such as filling in forms or unsubscribing from online services, all the way up to scraping and processing vast numbers of web pages. They can be run either locally, or on the [Apify platform](https://docs.apify.com/platform/), where you can run them at scale, monitor them, schedule them, or publish and monetize them. If you're new to Apify, learn [what is Apify](https://docs.apify.com/platform/about) in the Apify platform documentation. ## Creating Actors To create and run Actors through Apify Console, see the [Console documentation](https://docs.apify.com/academy/getting-started/creating-actors#choose-your-template). To create and run Python Actors locally, check the documentation for [how to create and run Python Actors locally](https://docs.apify.com/sdk/python/docs/quick-start). ## Guides To see how you can use the Apify SDK with other popular libraries used for web scraping, check out our guides for using [BeautifulSoup with HTTPX](https://docs.apify.com/sdk/python/docs/guides/beautifulsoup-httpx), [Parsel with Impit](https://docs.apify.com/sdk/python/docs/guides/parsel-impit), [Playwright](https://docs.apify.com/sdk/python/docs/guides/playwright), [Selenium](https://docs.apify.com/sdk/python/docs/guides/selenium), [Crawlee](https://docs.apify.com/sdk/python/docs/guides/crawlee), or [Scrapy](https://docs.apify.com/sdk/python/docs/guides/scrapy). ## Usage concepts To learn more about the features of the Apify SDK and how to use them, check out the Usage Concepts section in the sidebar, particularly the guides for the [Actor lifecycle](https://docs.apify.com/sdk/python/docs/concepts/actor-lifecycle), [working with storages](https://docs.apify.com/sdk/python/docs/concepts/storages), [handling Actor events](https://docs.apify.com/sdk/python/docs/concepts/actor-events) or [how to use proxies](https://docs.apify.com/sdk/python/docs/concepts/proxy-management).

Developer Tools Browser Automation
170 Github Stars
actor-page-analyzer
Open Source

actor-page-analyzer

# Page analyzer This Apify actor analyzes a web page on a specific URL. It extracts HTML and javascript variables from main response and HTML/JSON data from XHR requests. Then it analyses loaded data: 1) It performs analysis of initial HTML (html loaded directly from response): - Looks for Schema.org data and if it finds anything, it saves it to output as ``schemaOrgData`` variable. - Looks for JSON-LD link tags and parses found JSON, if it finds anything it outputs it as ``jsonLDData`` variable. - Looks for ``meta`` and ``title`` tags and outputs found content as ``metadata`` variable. 2) Loads all XHR requests -> discards request that do no contain HTML or JSON -> parses HTML and JSON into objects 3) When all XHR requests are finished it loads HTML from the rendered page (it might have changed thanks to JS manipulation) and does work from step 1 again because javascript might have changed the HTML of the website. 4) Loads all window variables and discards common global variables (console, innerHeight, navigator, ...), cleans the output (removes all functions and circular paths) and outputs it as ``allWindowProperties`` variable. When analysis is finished it checks INPUT parameters if there are any strings to search for and if there are. Then it attempts to find the strings in all found content. The actor ends when all output is parsed and searched. If connection to URL fails or if any part of the actor crashes, the actor ends with error in output and log. Input to actor is provided from INPUT file. If the actor is run through Apify, then INPUT comes from key value store. If you want to start the actor localy, then call ``` npm run start-local ``` and provide input as a file in directory ``kv-store-dev``. **INPUT** ```javascript { // url to website, that is supposed to be analyzed "url": "http://example.com", // array of strings too look for on the website, if empty, search is skipped during analysis "searchFor": ["About us"] } ``` During the actor run, it saves output into OUTPUT file, which is saved in key value store if the actor is run through Apify, or in ``kv-store-dev`` folder if the actor is run localy. **OUTPUT** ```javascript { // Initial response headers "initialResponse": { "url": "https://www.flywire.com/", "headers": {...} }, // True if window variables were parsed after XHR requests finished "windowPropertiesParsed": true, // True if meta tags were parsed from initial response "metaDataParsed": true, // True if Schema.org was loaded and parsed from initial response "schemaOrgDataParsed": true, // True if JSON-LD was loaded and parsed from initial response "jsonLDDataParsed": true, // True if HTML was loaded and parsed from initial response "htmlParsed": true, // True if HTML was loaded and parsed after XHR requests finished "htmlFullyParsed": true, // True if XHR requests were all parsed "xhrRequestsParsed": true, // Filtered window properties by search strings "windowProperties": {}, // Object containing cleaned up window object properties "allWindowProperties": {...}, // Array of properties which contain searched strings (at least one) with path to variable from root "windowPropertiesFound": [], // Schema.org data filtered by search strings. "schemaOrgData": {}, // Array of schema org properties which contain searched strings (at least one) with path to variable from root "schemaOrgDataFound": [], // Complete output of found schema.org data "allSchemaOrgData": [], // Complete output of all found meta tags "metaData": { "viewport": "width=device-width, initial-scale=1", "og:title": "International Payments Solution", ... }, // List of meta tags matching the searched strings "metaDataFound": [], // JSON-LD Data filtered by search strings. "jsonLDData": {}, // Array of JSON-LD data properties which contain searched strings (at least one) with path to variable from root "jsonLDDataFound": [], // Complete output of found JSON-LD "allJsonLDData": [], // Array of selectors to HTML elements that contain the searched values "htmlFound": [], // Array of parsed XHR requests with content type of JSON or HTML "xhrRequests": [ { "url": "https://www.flywire.com/destinations", "method": "GET", "responseStatus": 200, "responseHeaders": {...}, "responseBody": { // Valid provides information whether JSON was parsed successfully "valid": true/false, // Data contains the parsed JSON "data": [...], } }, { "url": "https://www.flywire.com/asdasd", "method": "GET", "responseStatus": 200, "responseHeaders": {...}, // For HTML requests responseBody contains HTML as string "responseBody": "<html>...." }, ], // same list as above, but filtered by search strings "xhrRequestsFound": [...], // contains error if actor failed outside of page function "error": null, // contains error if actor failed in page.evaluate "pageError": null, "outputFinished": true, // timestamps for debugging "analysisStarted": "2018-02-09T12:34:49.938Z", "scrappingStarted": "2018-02-09T12:34:50.050Z", "pageNavigated": "2018-02-09T12:34:53.495Z", "windowPropertiesSearched": "2018-02-09T12:34:53.810Z", "metadataSearched": "2018-02-09T12:34:51.624Z", "schemaOrgSearched": "2018-02-09T12:34:51.627Z", "jsonLDSearched": "2018-02-09T12:34:51.625Z", "htmlSearched": "2018-02-09T12:34:53.746Z", "xhrRequestsSearched": "2018-02-09T12:34:53.517Z", "analysisEnded": "2018-02-09T12:34:53.810Z", } ```

Browser Automation
152 Github Stars
actor-scraper
Open Source

actor-scraper

House of Apify Scrapers. Generic scraping actors with a simple UI to handle complex web crawling and scraping use cases.

Mobile Development Browser Automation
138 Github Stars
apify-client-python
Open Source

apify-client-python

<h1 align="center">Apify API client for Python</h1> <p align="center"> <strong>The official Python client for the <a href="https://docs.apify.com/api/v2">Apify REST API</a>.</strong> </p> <p align="center"> <a href="https://pypi.org/project/apify-client/"><img src="https://badge.fury.io/py/apify-client.svg" alt="PyPI version"></a> <a href="https://pypi.org/project/apify-client/"><img src="https://img.shields.io/pypi/dm/apify-client" alt="PyPI downloads"></a> <a href="https://pypi.org/project/apify-client/"><img src="https://img.shields.io/pypi/pyversions/apify-client" alt="Python versions"></a> <a href="https://codecov.io/gh/apify/apify-client-python"><img src="https://codecov.io/gh/apify/apify-client-python/graph/badge.svg?token=TYQQWYYZ7A" alt="Coverage"></a> <a href="https://github.com/apify/apify-client-python/blob/master/LICENSE"><img src="https://img.shields.io/pypi/l/apify-client" alt="License"></a> <a href="https://discord.gg/jyEM2PRvMU"><img src="https://img.shields.io/discord/801163717915574323?label=discord" alt="Chat on Discord"></a> </p> `apify-client` lets you talk to the [Apify platform](https://apify.com) from Python — run [Actors](https://docs.apify.com/platform/actors), manage [storages](https://docs.apify.com/platform/storage) (datasets, key-value stores, request queues), schedule tasks, configure webhooks, and use everything else exposed by the [Apify API](https://docs.apify.com/api/v2). It ships both synchronous and asynchronous clients, fully typed responses, automatic retries with exponential backoff, tiered timeouts, pagination helpers, streaming, and a pluggable HTTP layer. > If you want to **build** Apify Actors in Python rather than consume the API, use the [Apify SDK for Python](https://docs.apify.com/sdk/python) instead — it bundles this client and adds Actor-side primitives. ## Table of contents - [Installation](#installation) - [Quick start](#quick-start) - [Features](#features) - [Usage examples](#usage-examples) - [Documentation](#documentation) - [Related projects](#related-projects) - [Support and community](#support-and-community) - [Contributing](#contributing) - [License](#license) ## Installation `apify-client` requires **Python 3.11 or higher**. It is published on [PyPI](https://pypi.org/project/apify-client/) and can be installed for example with [pip](https://pip.pypa.io/): ```bash pip install apify-client ``` or with [uv](https://docs.astral.sh/uv/): ```bash uv add apify-client ``` or any other Python package manager that consumes PyPI. ## Quick start You'll need an Apify API token — find yours in the [Integrations section of Apify Console](https://console.apify.com/account/integrations). Pass it to the client and you're ready to go. ### Synchronous client ```python from apify_client import ApifyClient client = ApifyClient('MY-APIFY-TOKEN') # Start an Actor and wait for it to finish. run = client.actor('apify/hello-world').call( run_input={'message': 'Hello, Apify!'}, ) # Iterate items from the run's default dataset. for item in client.dataset(run.default_dataset_id).iterate_items(): print(item) ``` ### Asynchronous client ```python import asyncio from apify_client import ApifyClientAsync async def main() -> None: client = ApifyClientAsync('MY-APIFY-TOKEN') run = await client.actor('apify/hello-world').call( run_input={'message': 'Hello, Apify!'}, ) # Iterate items from the run's default dataset. async for item in client.dataset(run.default_dataset_id).iterate_items(): print(item) asyncio.run(main()) ``` > **Keep your token secret.** It authorizes requests on your behalf and can incur usage costs. Never commit it to source control or expose it to client-side code. For a guided walkthrough — authenticating, running an Actor, and reading its results — see the [Quick start guide](https://docs.apify.com/api/client/python/docs/quick-start). ## Features - **Synchronous and asynchronous clients** — pick [`ApifyClient`](https://docs.apify.com/api/client/python/reference/class/ApifyClient) or [`ApifyClientAsync`](https://docs.apify.com/api/client/python/reference/class/ApifyClientAsync) to match your codebase; both expose the same API ([Asyncio support](https://docs.apify.com/api/client/python/docs/concepts/asyncio-support)). - **Fully typed responses** — every method returns a [Pydantic](https://docs.pydantic.dev/) model generated from the Apify OpenAPI spec, with IDE autocomplete and runtime validation ([Typed models](https://docs.apify.com/api/client/python/docs/concepts/typed-models)). - **Automatic retries** — exponential backoff for network errors, HTTP 429, and 5xx responses, configurable per client ([Retries](https://docs.apify.com/api/client/python/docs/concepts/retries)). - **Tiered timeouts** — short / medium / long tiers picked per endpoint, overridable per call ([Timeouts](https://docs.apify.com/api/client/python/docs/concepts/timeouts)). - **Pagination and streaming** — iterate datasets, key-value store keys, or live logs without manual paging or buffering ([Pagination](https://docs.apify.com/api/client/python/docs/concepts/pagination), [Streaming](https://docs.apify.com/api/client/python/docs/concepts/streaming-resources)). - **Convenience methods** — `call()`, `wait_for_finish()`, nested resource access, and other shortcuts that hide platform quirks ([Convenience methods](https://docs.apify.com/api/client/python/docs/concepts/convenience-methods)). - **Pluggable HTTP layer** — swap the default [Impit](https://github.com/apify/impit)-based HTTP client for `httpx`, `requests`, `aiohttp`, or any custom implementation ([Custom HTTP clients](https://docs.apify.com/api/client/python/docs/concepts/custom-http-clients)). - **Structured errors** — every API error surfaces as an [`ApifyApiError`](https://docs.apify.com/api/client/python/reference/class/ApifyApiError) with HTTP-specific subclasses for precise handling ([Error handling](https://docs.apify.com/api/client/python/docs/concepts/error-handling)). - **Debug logging** — opt-in structured logging on the `apify_client` logger captures request URLs, status codes, retry attempts, and more ([Logging](https://docs.apify.com/api/client/python/docs/concepts/logging)). ## Usage examples The client mirrors the platform's resource model. Each entry point returns either a **single-resource client** for an individual item or a **collection client** for listing and creating items ([Single and collection clients](https://docs.apify.com/api/client/python/docs/concepts/single-and-collection-clients)). ### List Actors and create one ```python actors = client.actors() print(actors.list(limit=10).items) new_actor = actors.create(name='my-actor') ``` ### Stream live logs while a run is in progress ```python run = client.actor('apify/web-scraper').start(run_input={...}) with client.run(run.id).log().stream() as log_stream: for chunk in log_stream.iter_bytes(): print(chunk.decode(), end='') ``` ### Read and write key-value store records ```python store = client.key_value_store('STORE-ID') store.set_record('greeting', {'message': 'Hello!'}) record = store.get_record('greeting') ``` ### Iterate dataset items with automatic pagination ```python for item in client.dataset('DATASET-ID').iterate_items(fields='title,url'): process(item) ``` ### Tune retries and timeouts ```python from datetime import timedelta from apify_client import ApifyClient client = ApifyClient( token='MY-APIFY-TOKEN', max_retries=8, min_delay_between_retries=timedelta(milliseconds=500), timeout_long=timedelta(minutes=10), ) ``` For end-to-end recipes — passing input, managing tasks for reusable input, retrieving and merging Actor data, integrating with Pandas, plugging in a custom HTTP client — see the [Guides](https://docs.apify.com/api/client/python/docs/guides/passing-input-to-actor). ## Documentation The full documentation lives at **[docs.apify.com/api/client/python](https://docs.apify.com/api/client/python)**. | Section | What you'll find | |---|---| | [Introduction](https://docs.apify.com/api/client/python/docs) | Overview, prerequisites, and a tour of the client. | | [Quick start](https://docs.apify.com/api/client/python/docs/quick-start) | Authenticate, run an Actor, and fetch its results step by step. | | [Concepts](https://docs.apify.com/api/client/python/docs/concepts/asyncio-support) | Asyncio, single vs. collection clients, nested clients, error handling, retries, logging, convenience methods, pagination, streaming, custom HTTP clients, timeouts. | | [Guides](https://docs.apify.com/api/client/python/docs/guides/passing-input-to-actor) | Pass input to an Actor, manage tasks for reusable input, retrieve Actor data, integrate with data libraries (e.g. Pandas), use HTTPX as the HTTP client. | | [Upgrading](https://docs.apify.com/api/client/python/docs/upgrading/upgrading-to-v3) | Migrating between major versions. | | [API reference](https://docs.apify.com/api/client/python/reference) | Generated reference for every class, method, and model. | | [Changelog](https://docs.apify.com/api/client/python/docs/changelog) | Release history and breaking changes. | ## Related projects - **[Apify SDK for Python](https://docs.apify.com/sdk/python)** — toolkit for **building** Apify Actors in Python (this client is bundled with it). - **[Crawlee for Python](https://crawlee.dev/python)** — high-level web scraping and browser automation framework that powers many Actors. - **[Apify API client for JavaScript / TypeScript](https://docs.apify.com/api/client/js)** — equivalent Apify API client for Node.js. - **[Apify SDK for JavaScript / TypeScript](https://docs.apify.com/sdk/js)** — equivalent Apify SDK for Node.js. - **[Crawlee for JavaScript / TypeScript](https://crawlee.dev)** — original Node.js implementation of the Crawlee framework. - **[Apify CLI](https://docs.apify.com/cli)** — command-line tool for interacting with the Apify platform: managing Actors, runs, storages, local development, and deployment. ## Support and community - **Discord** — chat with the team and other users on the [Apify Discord server](https://discord.gg/jyEM2PRvMU). - **GitHub issues** — report a bug or request a feature in the repository's [issue tracker](https://github.com/apify/apify-client-python/issues). ## Contributing Bug reports, fixes, and improvements are welcome! See [CONTRIBUTING.md](./CONTRIBUTING.md) for the development setup, coding standards, testing, and the release process. The repo uses [uv](https://docs.astral.sh/uv/) for project management and [Poe the Poet](https://poethepoet.natn.io/) as a task runner; the typical loop is: ```bash uv run poe install-dev # install dev deps and git hooks uv run poe check-code # lint, type-check, unit tests, docstring check ``` ## License Released under the [Apache License 2.0](./LICENSE).

Browser Automation
93 Github Stars
super-scraper
Open Source

super-scraper

# SuperScraper API SuperScraper API is an Actor that provides a REST API for scraping websites. Just pass the URL of a web page and get back the fully rendered HTML content. SuperScraper API is compatible with [ScrapingBee](https://www.scrapingbee.com/), [ScrapingAnt](https://scrapingant.com/), and [ScraperAPI](https://scraperapi.com/) interfaces. Main features: - Extract HTML from arbitrary URLs with a headless browser for dynamic content rendering. - Circumvent blocking using datacenter or residential proxies, as well as browser fingerprinting. - Seamlessly scale to a large number of web pages as needed. - Capture screenshots of the web pages. Note that SuperScraper API uses the new experimental Actor Standby mode, so it's not started the traditional way from Apify Console. Instead, it's invoked via the HTTP REST API provided directly by the Actor. See the examples below. ## Usage examples To run these examples, you need an Apify API token, which you can find under [Settings > Integrations](https://console.apify.com/account/integrations) in Apify Console. You can create an Apify account free of charge. ### Node.js ```ts import axios from 'axios'; const resp = await axios.get('https://super-scraper-api.apify.actor/', { params: { url: 'https://apify.com/store', wait_for: '.ActorStoreItem-title', json_response: true, screenshot: true, }, headers: { Authorization: 'Bearer <YOUR_APIFY_API_TOKEN>', }, }); console.log(resp.data); ``` ### curl ```shell curl -X GET \ 'https://super-scraper-api.apify.actor/?url=https://apify.com/store&wait_for=.ActorStoreItem-title&screenshot=true&json_response=true' \ --header 'Authorization: Bearer <YOUR_APIFY_API_TOKEN>' ``` ## Authentication The best way to authenticate is to pass your Apify API token using the `Authorization` HTTP header. Alternatively, you can pass the API token via the `token` query parameter to authenticate the requests, which is more convenient for testing in a web browser. ### Node.js ```ts const resp = await axios.get('https://super-scraper-api.apify.actor/', { params: { url: 'https://apify.com/store', token: '<YOUR_APIFY_API_TOKEN>' }, }); ``` ### curl ```shell curl -X GET 'https://super-scraper-api.apify.actor/?url=https://apify.com/store&wait_for=.ActorStoreItem-title&json_response=true&token=<YOUR_APIFY_API_TOKEN>' ``` ## Pricing When using SuperScraper API, you're charged based on your actual usage of the Apify platform's computing, storage, and networking resources. Cost depends on the target sites, your settings and API parameters, the load of your requests, and random network and target site conditions. The best way to see your price is to conduct a real-world test. An example cost on a free account (the pricing is cheaper on higher plans) for 30 one-by-one requests plus 50 batched requests test: | parameters | cost estimate | ------------- |-----------------------------------| | no `render_js` + basic proxy | $1/1000 requests | no `render_js` + premium (residential) proxy | $2/1000 requests | `render_js` + basic proxy | $4/1000 requests | `render_js` + premium (residential) proxy | $5/1000 requests ## API parameters ### ScrapingBee API parameters SuperScraper API supports most of the API parameters of [ScrapingBee](https://www.scrapingbee.com/documentation/): | parameter | description | | -------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `url` | URL of the webpage to be scraped. **This parameter is required.** | | `json_response` | Return a verbose JSON response with additional details about the webpage. Can be either `true` or `false`, default is `false`. | | `extract_rules` | A stringified JSON containing custom rules how to extract data from the webpage. | | `render_js` | Indicates that the webpage should be scraped using a headless browser, with dynamic content rendered. Can be `true` or `false`, default is `true`. This is equivalent to ScrapingAnt's `browser`. | | `screenshot` | Get screenshot of the browser's current viewport. If `json_response` is set to `true`, screenshot will be returned in the Base64 encoding. Can be `true` or `false`, default is `false`. | | `screenshot_full_page` | Get screenshot of the full page. If `json_response` is set to `true`, screenshot will be returned in the Base64 encoding. Can be `true` or `false`, default is `false`. | | `screenshot_selector` | Get screenshot of the element specified by the selector. If `json_response` is set to `true`, screenshot will be returned in Base64. Must be a non-empty string. | | `js_scenario` | JavaScript instructions that will be executed after loading the webpage. | | `wait` | Specify a duration that the browser will wait after loading the page, in milliseconds. | | `wait_for` | Specify a CSS selector of an element for which the browser will wait after loading the page. | | `wait_browser` | Specify a browser event to wait for. Can be either `load`, `domcontentloaded`, or `networkidle`. | | `block_resources` | Specify that you want to block images and CSS. Can be `true` or `false`, default is `true`. | | `window_width` | Specify the width of the browser's viewport, in pixels. | | `window_height` | Specify the height of the browser's viewport, in pixels. | | `cookies` | Custom cookies to use to fetch the web pages. This is useful for fetching webpage behing login. The cookies must be specified in a string format: `cookie_name_1=cookie_value1;cookie_name_2=cookie_value_2`. | | `own_proxy` | A custom proxy to be used for scraping, in the format `<protocol><username>:<password>@<host>:<port>`. | | `premium_proxy` | Use residential proxies to fetch the web content, in order to reduce the probability of being blocked. Can be either `true` or `false`, default is `false`. | | `stealth_proxy` | Works same as `premium_proxy`. | | `country_code` | Use IP addresses that are geolocated in the specified country by specifying its [2-letter ISO code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements). When using code other than `US`, `premium_proxy` must be set to `true`. This is equivalent to setting ScrapingAnt's `proxy_country`. | | `custom_google` | Use this option if you want to scrape Google-related websites (such as Google Searach or Google Shopping). Can be `true` or `false`, default is `false`. | | `return_page_source` | Return HTML of the webpage from the response before any dynamic JavaSript rendering. Can be `true` or `false`, default is `false`. | | `transparent_status_code` | By default, if target webpage responds with HTTP status code other than a 200-299 or a 404, the API will return a HTTP status code 500. Set this paremeter to `true` to disable this behavior and return the status code of the actual response. | | `timeout` | Set maximum timeout for the response from this Actor, in milliseconds. The default is 140 000 ms. | | `forward_headers` | If set to `true`, HTTP headers starting with prefix `Spb-` or `Ant-` will be forwarded to the target webpage alongside headers generated by us (the prefix will be trimmed). | | `forward_headers_pure` | If set to `true`, only headers starting with prefix `Spb-` or `Ant-` will be forwarded to the target webpage (prefix will be trimmed), without any other HTTP headers from our side. | | `device` | Can be either `desktop` (default) or `mobile`. | ScrapingBee's API parameters `block_ads` and `session_id` are currently not supported. ### ScrapingAnt API parameters SuperScraper API supports most of the API parameters of [ScrapingAnt](https://docs.scrapingant.com/request-response-format#available-parameters): | parameter | description | | -------- |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `url` | URL of the webpage to be scraped. **This parameter is required.** | | `browser` | Indicates that the webpage should be scraped using a headless browser, with dynamic content rendered. Can be `true` or `false`, default is `true`. This is equivalent as ScrapingBee's `render_js`. | (Same as `render_js`.) | | `cookies` | Use custom cookies, must be in a string format: `cookie_name_1=cookie_value1;cookie_name_2=cookie_value_2`. | | `js_snippet` | A Base64-encoded JavaScript code to be executed on the webpage. Will be treated as the [evaluate](#evaluate) instruction. | | `proxy_type` | Specify the type of proxies, which can be either `datacenter` (default) or `residential`. This is equivalent to setting ScrapingBee's `premium_proxy` or `steath_proxy` to `true`. | | `wait_for_selector` | Specify a CSS selector of an element for which the browser will wait after loading the page. This is equivalent to setting ScrapingBee's `wait_for`. | | `block_resource` | Specify one or more resources types you want to block from being downloaded. The parameter can be repeated in the URL (e.g. `block_resource=image&block_resource=media`). Available options are: `document`, `stylesheet`, `image`, `media`, `font`, `script`, `texttrack`, `xhr`, `fetch`, `eventsource`, `websocket`, `manifest`, `other`. | | `return_page_source` | Return HTML of the webpage from the response before any dynamic JavaSript rendering. Can be `true` or `false`, default is `false`. | | `proxy_country` | Use IP addresses that are geolocated in the specified country by specifying its [2-letter ISO code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements). When using code other than `US`, `premium_proxy` must be set to `true`. This is equivalent to setting ScrapingBee's `country_code`. | ScrapingAnt's API parameter `x-api-key` is not supported. Note that HTTP headers in a request to this Actor beginning with prefix `Ant-` will be forwarded (without the prefix) to the target webpage alongside headers generated by the Actor. This behavior can be changed using ScrapingBee's `forward_headers` or `forward_headers_pure` parameters. ### ScraperAPI API parameters SuperScraper API supports most of the API parameters of [ScraperAPI](https://docs.scraperapi.com/making-requests/customizing-requests): | parameter | description | | -------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `url` | URL of the webpage to be scraped. **This parameter is required.** | | `render` | Specify, if you want to scrape the webpage with or without using a headless browser, can be `true` or `false`, default `true`. (Same as `render_js`.) | | `wait_for_selector` | Specify a CSS selector of an element for which the browser will wait after loading the page. This is equivalent to setting ScrapingBee's `wait_for`. | | `premium` | Use residential proxies to fetch the web content, in order to reduce the probability of being blocked. Can be either `true` or `false`, default is `false`. This is equivalent to setting ScrapingBee's `premium_proxy`. | | `ultra_premium` | Same as `premium`. | | `country_code` | Use IP addresses that are geolocated in the specified country by specifying its [2-letter ISO code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements). When using code other than `US`, `premium_proxy` must be set to `true`. This is equivalent to setting ScrapingAnt's `proxy_country`. | | `keep_headers` | If `true`, then all headers sent to this Actor will be forwarded to the target website. The `Authorization` header will be removed. | | `device_type` | Can be either `desktop` (default) or `mobile`. This is equivalent to setting ScrapingBees's `device`. | | `binary_target` | Specify whether the target is a file. Can be `true` or `false`, default is `false`. Currently only supported when JS rendering is set to `false` via the `render_js`, `browser`, or `render` parameters. | ScraperAPI's API parameters `session_number` and `autoparse` are currently not supported, and they are ignored. ### Custom extraction rules Using ScrapingBee's `extract_rules` parameter, you can specify a set of rules to extract specific data from the target web pages. You can create an extraction rule in one of two ways: with shortened options, or with full options. #### Shortened options - value for the given key serves as a `selector` - using `@`, we can access attribute of the selected element ##### Example: ```json { "title": "h1", "link": "a@href" } ``` #### Full options - `selector` is required - `type` can be either `item` (default) or `list` - `output` indicates how the result for these element(s) will look like. It can be: - `text` (default option when `output` is omitted) - text of the element - `html` - HTML of the element - attribute name (starts with `@`, for example `@href`) - object with other extract rules for the given item (key + shortened or full options) - `table_json` or `table_array` to scrape a table in a json or array format - `clean` - relevant when having `text` as `output`, specifies whether the text of the element should be trimmed of whitespaces (can be `true` or `false`, default `true`) ##### Example: ```json { "custom key for links": { "selector": "a", "type": "list", "output": { "linkName" : { "selector": "a", "clean": "false" }, "href": { "selector": "a", "output": "@href" } } } } ``` #### Example This example extracts all links from [Apify Blog](https://blog.apify.com/) along with their titles. ```ts const extractRules = { title: 'h1', allLinks: { selector: 'a', type: 'list', output: { title: 'a', link: 'a@href', }, }, }; const resp = await axios.get('https://super-scraper-api.apify.actor/', { params: { url: 'https://blog.apify.com/', extract_rules: JSON.stringify(extractRules), // verbose: true, }, headers: { Authorization: 'Bearer <YOUR_APIFY_API_TOKEN>', }, }); console.log(resp.data); ``` The results look like this: ```json { "title": "Apify Blog", "allLinks": [ { "title": "Data for generative AI & LLM", "link": "https://apify.com/data-for-generative-ai" }, { "title": "Product matching AI", "link": "https://apify.com/product-matching-ai" }, { "title": "Universal web scrapers", "link": "https://apify.com/store/scrapers/universal-web-scrapers" } ] } ``` ### Custom JavaScript code Use ScrapingBee's `js_scenario` parameter to specify instructions in order to be executed one by one after opening the page. Set `json_response` to `true` to get a full report of the executed instructions, the results of `evaluate` instructions will be added to the `evaluate_results` field. Example of clicking a button: ```ts const instructions = { instructions: [ { click: '#button' }, ], }; const resp = await axios.get('https://super-scraper-api.apify.actor/', { params: { url: 'https://www.example.com', js_scenario: JSON.stringify(instructions), }, headers: { Authorization: 'Bearer <YOUR_APIFY_API_TOKEN>', }, }); console.log(resp.data); ``` #### Strict mode If one instruction fails, then the subsequent instructions will not be executed. To disable this behavior, you can optionally set `strict` to `false` (by default it's `true`): ```json { "instructions": [ { "click": "#button1" }, { "click": "#button2" } ], "strict": false } ``` #### Supported instructions ##### `wait` - wait for some time specified in ms - example: `{"wait": 10000}` ##### `wait_for` - wait for an element specified by the selector - example `{"wait_for": "#element"}` ##### `click` - click on an element specified by the selector - example `{"click": "#button"}` ##### `wait_for_and_click` - combination of previous two - example `{"wait_for_and_click": "#button"}` ##### `scroll_x` and `scroll_y` - scroll a specified number of pixels horizontally or vertically - example `{"scroll_y": 1000}` or `{"scroll_x": 1000}` ##### `fill` - specify a selector of the input element and the value you want to fill - example `{"fill": ["input_1", "value_1"]}` ##### `evaluate` - evaluate custom javascript on the webpage - text/number/object results will be saved in the `evaluate_results` field - example `{"evaluate":"document.querySelectorAll('a').length"}`

Browser Automation API Tools
40 Github Stars