Home
Softono
o

oxylabs

Professional software vendor delivering innovative solutions on the Softono platform. Specialized in both open-source and proprietary software development.

Total Products
27

Software by oxylabs

browser-agent-py
Open Source

browser-agent-py

# Browser Agent [![AI-Scraper Header](https://github.com/oxylabs/browser-agent-py/blob/main/Github-AI-Studio-1200x628px-Browser-Agent.png)](https://aistudio.oxylabs.io/apps/browser_agent?utm_source=877&utm_medium=affiliate&utm_campaign=ai_studio&utm_content=browser-agent-py&groupid=877&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) **[Browser Agent](https://aistudio.oxylabs.io/apps/browser_agent)** is an AI browser automation tool from **[Oxylabs AI Studio](https://aistudio.oxylabs.io/)**. It simulates real user browsing by executing multi-step actions like clicking links, filling forms, scrolling, capturing screenshots, and then extracting structured data – all controlled through natural language prompts. Unlike traditional automation frameworks (e.g., Puppeteer or Selenium), Browser Agent requires no static scraping rules or manual scripting. Users can describe tasks in plain English or provide a sequence of steps, and the AI will carry them out just like a human would. ## Key features - **Full control through browser AI** – execute clicks, inputs, navigation, and scrolling. - **Multi-step task execution** – define browsing flows in natural language. - **Multiple outputs** – get results in JSON, Markdown, HTML, or PNG screenshots. - **Dynamic content support** – interact with JavaScript-rendered pages. - **Schema-based extraction** – request structured JSON after the browsing sequence completes. ## How it works To run tasks with browser AI agent, follow these steps: 1. **Enter the target URL.** 2. **Describe the browsing process as:** - **Natural language prompt** (e.g. “Open the pricing page, accept cookies, and extract all product names with prices.) - **Structured step list** – provide an array of AI browser actions (`click`, `type`, `navigate`, `wait`, `extract`). 3. **Select output format:** JSON, Markdown, HTML, or PNG screenshot. 4. **(Optional) If JSON is selected**, define or auto-generate a schema to structure the gathered data. ### Installation To begin, be sure you have access to an API key (or get a [free trial](https://aistudio.oxylabs.io/register) with 1000 credits) and `Python ver. 3.10` or above installed. You can install the `oxylabs-ai-studio` package using pip: ```bash pip install oxylabs-ai-studio ``` ### Code examples (Python) The following examples show how to use the browser AI agent to perform browsing and data extraction. ```python from oxylabs_ai_studio.apps.browser_agent import BrowserAgent browser_agent = BrowserAgent(api_key="<API_KEY>") schema = browser_agent.generate_schema( prompt="game name, platform, review stars and price" ) print("schema: ", schema) prompt = "Find if there is game 'super mario odyssey' in the store. If there is, find the price. Use search bar to find the game." url = "https://sandbox.oxylabs.io/" result = browser_agent.run( url=url, user_prompt=prompt, output_format="json", schema=schema, ) print(result.data) ``` The example below captures a PNG screenshot while using Browser Agent. ```python import base64 from oxylabs_ai_studio.apps.browser_agent import BrowserAgent browser_agent = BrowserAgent(api_key="<API_KEY>") result = browser_agent.run( url = "https://sandbox.oxylabs.io/", user_prompt= "Go to the website and take a screenshot of the home page", output_format="screenshot", ) with open("screenshot.png", "wb") as f: f.write(base64.b64decode(result.data.content["data"])) ``` Learn more about Browser Agent and Oxylabs AI Studio Python SDK in our [PyPI repository](https://pypi.org/project/oxylabs-ai-studio/). You can also check out our [AI Studio JavaScript SDK](https://github.com/oxylabs/oxylabs-ai-studio-js?tab=readme-ov-file#oxylabs-ai-studio-javascript-sdk) guide for JS users. ### Request parameters | Parameter | Description | Default Value | |-------------------|--------------------------------------------------------------|---------------| | `url`* | Starting URL to browse | – | | `user_prompt`* | Natural language prompt for extraction | – | | `output_format` | Output format (`json`, `markdown`, `html`, `screenshot`) | `markdown` | | `schema` | OpenAPI schema for structured extraction (mandatory for JSON) | – | | `geo_location` | Proxy location in ISO2 format | – | \* – mandatory parameters ### Output samples Browser Agent can return parsed results or screenshots that are easy to integrate into your applications. This is a direct output example of our first request code: ```json Results: { "type": "json", "content": { "games": [ { "game_name": "Super Mario Odyssey", "platform": "Nintendo Switch", "review_stars": null, "price": 89.99 } ] } } ``` Here is a screenshot output of our second request: ![Browser Agent screenshot output](./screenshot.png) Browser Agent supports multiple output formats (`"output": "YOUR_FORMAT"`): - `json` – structured data using schema-based parsing. - `markdown` – easy-to-read data, perfect for AI and automation workflows. - `html` – raw HTML data of the webpage. - `screenshot` – PNG image of the browser content. ## Practical use cases You can use AI Browser Agent in various ways, including: 1. **E-commerce checkout simulation** – add items to cart, apply coupon, confirm checkout flow. 2. **Travel search automation** – enter destinations, apply filters, and extract flight or hotel prices. 3. **Job search scraping** – search for a role, click through postings, extract job details. 4. **Event & ticket discovery** – navigate event sites, retrieve titles, dates, and prices. 5. **And many more…** ## FAQ ### How is Browser Agent different from Puppeteer or Selenium? Traditional tools rely on writing selectors and scripts for every action. AI browser agents replace that with natural language instructions and add organic browsing, making automation much faster, easier, and less fragile. ### Can Browser Agent log in to websites or fill forms? Yes, you can instruct the agent to enter text, submit forms, or click buttons. Keep in mind that sites with advanced bot detection may require advanced setup. ### Can I use Browser Agent on any website? Browser Agent works on most public websites, including ones that rely on JavaScript or interactive flows. However, you should always make sure your use case complies with the target website’s Terms of Service and applicable laws. ### Is Browser Agent free to use? Oxylabs AI Studio Browser Agent is free to try by signing up for a free trial that includes 1,000 credits. After the trial, the [monthly plans](https://aistudio.oxylabs.io/pricing) start at just $12/month with 3000 credits and 1 request/s, with higher plans offering more credits and higher request rates. ## Learn more For a deeper dive into available parameters, advanced integrations, and additional examples, check out the [AI Studio documentation](https://aistudio.oxylabs.io/apps/browser_agent). ## Contact us If you have questions or need support, reach out to us at [email protected], or through live chat, accessible via [Oxylabs Dashboard](https://dashboard.oxylabs.io/en/), or join our [Discord community](https://discord.gg/Pds3gBmKMH). For enterprise-related inquiries, contact your dedicated account manager.

AI Agents Browser Automation
1.2K Github Stars
ai-scraper-py
Open Source

ai-scraper-py

# AI-Scraper [![AI-Scraper Header](https://github.com/oxylabs/ai-crawler-py/blob/main/Ai-studio%20.png)](https://aistudio.oxylabs.io/apps/crawl?utm_source=877&utm_medium=affiliate&utm_campaign=ai_studio&groupid=877&utm_content=ai-crawler-py-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) The [**AI-Scraper**](https://aistudio.oxylabs.io/apps/scrape) is an experimental scraping tool by [**Oxylabs AI Studio**](https://aistudio.oxylabs.io/) that extracts data from a single webpage using AI. It identifies and parses relevant information based on a natural language prompt, then delivers results in either **structured JSON** (for automation and APIs) or **Markdown format** (best for readable outputs and AI workflows). This AI scraper removes the need for CSS/XPath selectors or custom parsers, so it integrates seamlessly with various automation pipelines. **Automatic schema generation** and flexible output formats provide users with an easy way to extract clean, structured data without ever needing to maintain parsing logic. ## Key features - **Natural language prompt-based extraction** – Define your needs in plain English, and the scrape agent will retrieve the relevant information. - **Multiple output formats** – Choose JSON for structured workflows or Markdown for human-readable results and AI workflows. - **Automatic schema generation** – Generate a schema automatically from a prompt or define it manually for precise JSON parsing. - **Works on any public webpage** – Extract from e-commerce, news, blogs, or any other accessible source. ## How it works To scrape a webpage with AI-Scraper, follow these steps: 1. **Provide the webpage URL** you want to scrape. 2. **Describe the data to extract** in natural language (e.g. “Get all product names and prices”). 3. **Select the output format** – structured JSON or Markdown. 4. **(Optional) Define a schema** – Let AI-Scraper generate one automatically, or provide your own OpenAPI schema for the exact structure you desire. ### Installation To begin, make sure you have access to an AI Studio API key (or [get a free trial](https://aistudio.oxylabs.io/register) with 1000 credits) and `Python ver. 3.10` or above installed. You can install the `oxylabs-ai-studio` package using pip: ```bash pip install oxylabs-ai-studio ``` ### Code examples (Python) The following examples show how to use `AiScraper` to extract data from a sample page. ```python from oxylabs_ai_studio.apps.ai_scraper import AiScraper import json # Initialize the AI Scraper with your API key scraper = AiScraper(api_key="YOUR_API_KEY") # Generate a schema automatically from natural language schema = scraper.generate_schema(prompt="want to parse developer, platform, type, price game title, and genre (array)") print(f"Generated schema: {schema}") # Scrape a webpage and extract structured data url = "https://sandbox.oxylabs.io/products/3" result = scraper.scrape( url=url, output_format="json", schema=schema, render_javascript=False, geo_location="US", ) # Print the scrape output as JSON print("Results:") print(json.dumps(result.data, indent=2)) ``` Learn more about AI-Scraper and Oxylabs AI Studio Python SDK in our [PyPI repository](https://pypi.org/project/oxylabs-ai-studio/). You can also check out our [AI Studio JavaScript SDK](https://github.com/oxylabs/oxylabs-ai-studio-js) guide for JS users. ### Request parameters | Parameter | Description | Default Value | |---------------------|----------------------------------------------------------------|---------------| | `url`* | Target URL to scrape | – | | `output_format` | Output format (`json`, `markdown`) | `markdown` | | `schema` | OpenAPI schema for structured extraction (mandatory for JSON) | – | | `render_javascript` | Enable render JavaScript | `False` | | `geo_location` | Proxy location in ISO2 format | – | `*` – mandatory parameters ### Output samples The AI-Scraper can return parsed, ready-to-use output that is easy to integrate into your applications. This is a structured JSON of the response output: ```json Results: { "games": [ { "developer": "Nintendo EAD Tokyo", "platform": "wii", "type": "singleplayer", "price": 91.99, "title": "Super Mario Galaxy 2", "genre": [ "Action", "Platformer" ] }, { "developer": "Eidos Interactive", "platform": "wii", "type": null, "price": 80.99, "title": "Death Jr.: Root of Evil", "genre": [ "Action", "Platformer", "3D" ] }... ``` Alternatively, you can use `output_format=”markdown”` to receive Markdown results instead of parsed JSON. ## Practical use cases Oxylabs AI-Scraper can be applied to a wide variety of data collection tasks: 1. **Extract product details** – Gather product names, descriptions, and prices from e-commerce sites. 2. **Parse news articles** – Retrieve article titles, dates, authors, and body text. 3. **Scrape pricing pages** – Collect structured pricing information for competitor or market research. 4. **Extract job postings** – Capture job titles, locations, salaries, and posting dates from recruitment portals. ## FAQ ### How does AI-Scraper differ from normal scrapers? AI-Scraper doesn’t rely on CSS/XPath selectors or custom parsing logic. Instead, it uses natural language prompts and AI-powered extraction, making it more adaptable to layout changes and much easier to set up. ### Can I scrape any webpage? Yes, you can scrape any public webpage as long as the page is publicly accessible. AI-Scraper also supports JavaScript rendering for dynamic pages. Private or login-protected content isn’t supported out of the box. ### Is schema mandatory for AI-Scraper? No, schema is not mandatory, but it’s required if you want structured JSON output. If you don’t provide one, AI-Scraper can generate a schema automatically based on your prompt. ### What happens if the page structure changes? Unlike traditional scrapers, AI-Scraper is more resilient to layout changes because it interprets content with AI. However, major changes may require you to adjust either your prompt or the schema. ### Is AI-Scraper free? Oxylabs AI Studio AI-Scraper is free to try by signing up for a free trial that includes 1,000 credits. After the trial, the [monthly plans](https://aistudio.oxylabs.io/pricing) start at just $12/month with 3000 credits and 1 request/s, with higher plans offering more credits and higher request rates. ## Learn more For a deeper dive into available parameters, advanced integrations, and additional examples, check out the [AI Studio documentation](https://aistudio.oxylabs.io/apps/scrape). ## Contact us If you have questions or need support, reach out to us at [email protected], or through live chat, accessible via [Oxylabs Dashboard](https://dashboard.oxylabs.io/en/), or join our [Discord community](https://discord.gg/Pds3gBmKMH). For enterprise-related inquiries, contact your dedicated account manager.

AI Agents Browser Automation
685 Github Stars
amazon-scraper
Open Source

amazon-scraper

A simple web scraper to extract Product Data and Pricing from Amazon

Browser Automation E-commerce Platforms
435 Github Stars
oxylabs-ai-studio-py
Open Source

oxylabs-ai-studio-py

# OxyLabs AI Studio Python SDK [![AI-Studio Python (1)](https://github.com/oxylabs/oxylabs-ai-studio-py/blob/main/Ai-Studio2.png)](https://aistudio.oxylabs.io/?utm_source=877&utm_medium=affiliate&utm_campaign=ai_studio&groupid=877&utm_content=ai-studio-js-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) A simple Python SDK for seamlessly interacting with [Oxylabs AI Studio API](https://aistudio.oxylabs.io/) services, including AI-Scraper, AI-Crawler, AI-Browser-Agent and other data extraction tools. ## Requirements - python 3.10 and above - API KEY ## Installation ```bash pip install oxylabs-ai-studio ``` ## Usage ### Crawl (`AiCrawler.crawl`) ```python from oxylabs_ai_studio.apps.ai_crawler import AiCrawler crawler = AiCrawler(api_key="<API_KEY>") url = "https://oxylabs.io" result = crawler.crawl( url=url, user_prompt="Find all pages with proxy products pricing", output_format="markdown", render_javascript=False, return_sources_limit=3, geo_location="United States", ) print("Results:") for item in result.data: print(item, "\n") ``` **Parameters:** - `url` (str): Starting URL to crawl (**required**) - `user_prompt` (str): Natural language prompt to guide extraction (**required**) - `output_format` (Literal["json", "markdown", "csv", "toon"]): Output format (default: "markdown") - `schema` (dict | None): Json schema for structured extraction (required if output_format is "json", "csv" or "toon") - `render_javascript` (bool): Render JavaScript (default: False) - `return_sources_limit` (int): Max number of sources to return (default: 25) - `geo_location` (str): Proxy location in ISO2 format or country canonical name. See [docs](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/localization/proxy-location#list-of-supported-geo_location-values) - `max_credits` (int | None): Maximum of credits to use (optional) ### Scrape (`AiScraper.scrape`) ```python from oxylabs_ai_studio.apps.ai_scraper import AiScraper scraper = AiScraper(api_key="<API_KEY>") schema = scraper.generate_schema(prompt="want to parse developer, platform, type, price game title, genre (array) and description") print(f"Generated schema: {schema}") url = "https://sandbox.oxylabs.io/products/3" result = scraper.scrape( url=url, output_format="json", schema=schema, render_javascript=False, ) print(result) ``` **Parameters:** - `url` (str): Target URL to scrape (**required**) - `output_format` (Literal["json", "markdown", "csv", "screenshot", "toon"]): Output format (default: "markdown") - `schema` (dict | None): JSON schema for structured extraction (required if output_format is "json", "csv" or "toon") - `render_javascript` (bool | string): Render JavaScript. Can be set to "auto", meaning the service will detect if rendering is needed (default: False) - `geo_location` (str): Proxy location in ISO2 format or country canonical name. See [docs](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/localization/proxy-location#list-of-supported-geo_location-values) - `user_agent` (str): User-Agent request header. See more at https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/http-context-and-job-management/user-agent-type. ### Browser Agent (`BrowserAgent.run`) ```python from oxylabs_ai_studio.apps.browser_agent import BrowserAgent browser_agent = BrowserAgent(api_key="<API_KEY>") schema = browser_agent.generate_schema( prompt="game name, platform, review stars and price" ) print("schema: ", schema) prompt = "Find if there is game 'super mario odyssey' in the store. If there is, find the price. Use search bar to find the game." url = "https://sandbox.oxylabs.io/" result = browser_agent.run( url=url, user_prompt=prompt, output_format="json", schema=schema, ) print(result.data) ``` **Parameters:** - `url` (str): Starting URL to browse (**required**) - `user_prompt` (str): Natural language prompt for extraction (**required**) - `output_format` (Literal["json", "markdown", "html", "screenshot", "csv", "toon"]): Output format (default: "markdown") - `schema` (dict | None): Json schema for structured extraction (required if output_format is "json", "csv" or "toon") - `geo_location` (str): Proxy location in ISO2 format or country canonical name. For example 'Germany' (capitalized). ### Search (`AiSearch.search`) ```python from oxylabs_ai_studio.apps.ai_search import AiSearch search = AiSearch(api_key="<API_KEY>") query = "lasagna recipe" result = search.search( query=query, limit=5, render_javascript=False, return_content=True, ) print(result.data) # Or for fast search result = search.instant_search( query=query, limit=10, ) print(result.data) ``` **Parameters:** - `query` (str): What to search for (**required**) - `limit` (int): Maximum number of results to return (default: 10, maximum: 50) - `render_javascript` (bool): Render JavaScript (default: False) - `return_content` (bool): Whether to return markdown contents in results (default: True) - `geo_location` (*string*): ISO 2-letter format, country name, coordinate formats are supported. See more at [SERP Localization](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/localization/serp-localization). > **Note:** When `limit <= 10` and `return_content=False`, the search automatically uses the instant endpoint (`/search/instant`) which returns results immediately without polling, providing faster response times. Instant search supported parameters: - `query` (*string*): The search query. - `limit` (*integer*): The maximum number of search results to return. Maximum: 10. - `geo_location` (*string*): Google's canonical name of the location. See more at [Google Ads GeoTargets](https://developers.google.com/google-ads/api/data/geotargets). ### Map (`AiMap.map`) ```python from oxylabs_ai_studio.apps.ai_map import AiMap ai_map = AiMap(api_key="<API_KEY>") payload = { "url": "https://career.oxylabs.io", "search_keywords": ["career", "jobs", "vacancy"], "user_prompt": "job ad pages", "max_crawl_depth": 2, "limit": 10, "geo_location": "Germany", "render_javascript": False, "include_sitemap": True, "max_credits": None, "allow_subdomains": False, "allow_external_domains": False, } result = ai_map.map(**payload) print(result.data) ``` **Parameters:** - `url` (str): Starting URL or domain to map (**required**) - `search_keywords` (list[str]): Keywords for URLs paths filtering (default: None) - `user_prompt` (str | None): Natural language prompt for keyword search. Can be used together with 'search_keywords' or standalone (optional) - `max_crawl_depth` (int): Max crawl depth (1..5, default: 1) - `limit` (int): Max number of URLs to return (default: 25) - `geo_location` (str): Proxy location in ISO2 format or country canonical name. See [docs](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/localization/proxy-location#list-of-supported-geo_location-values) - `render_javascript` (bool): JavaScript rendering (default: False) - `include_sitemap` (bool): Whether to include sitemap as seed (default: True) - `max_credits` (int | None): Maximum of credits to use (optional) - `allow_subdomains` (bool): Include subdomains (default: False) - `allow_external_domains` (bool): Include external domains (default: False) --- See the [examples](https://github.com/oxylabs/oxylabs-ai-studio-py/tree/main/examples) folder for usage examples of each method. Each method has corresponding async version.

AI Agents Browser Automation
2.9K Github Stars
how-to-scrape-amazon-product-data
Open Source

how-to-scrape-amazon-product-data

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/amazon-scraper/refs/heads/main/Scrape%20Amazon%20data%20with%20Web%20Scraper%20API.png)](https://oxylabs.io/products/scraper-api/ecommerce/amazon?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=how-to-scrape-amazon-product-data-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) # Scraping Amazon Product Data With Python You can find an extended version of this guide on our [blog](https://oxylabs.io/blog/scrape-amazon-product-data). This guide uses Python to scrape the following data points from Amazon: - Product name - Product rating - Product price - Product images - Product description ## Contents - [Setting up](#setting-up) + [Installing packages](#installing-packages) - [Scraping product data](#scraping-product-data) + [1. Sending a GET request with custom headers](#1.-sending-a-get-request-with-custom-headers) + [2. Locating and scraping product name](#2.-locating-and-scraping-product-name) + [3. Locating and scraping product rating](#3.-locating-and-scraping-product-rating) + [4. Locating and scraping product price](#4.-locating-and-scraping-product-price) + [5. Locating and scraping product image](#5.-locating-and-scraping-product-image) + [6. Locating and scraping product description](#6.-locating-and-scraping-product-description) + [7. Handling product listing](#7.-handling-product-listing) + [8. Exporting scraped product data to a CSV file](#8.-exporting-scraped-product-data-to-a-CSV-file) - [Reviewing the final script](#reviewing-the-final-script) - [An easier solution to extract Amazon data](#an-easier-solution-to-extract-Amazon-data) + [Scraping products from search results](#scraping-products-from-search-results) + [Extracting product details](#extracting-product-details) + [Scraping products by ASIN](#scraping-products-by-ASIN) ## Setting up Create a folder to save your code files. Also, creating a virtual environment is generally a good practice. The following commands work on macOS and Linux. The commands will create a virtual environment and activate it: ``` python3 -m venv .env source .env/bin/activate ``` If you are on Windows, these commands will vary a little: ``` python -m venv .env .env\scripts\activate ``` ### Installing packages ``` python3 -m pip install requests beautifulsoup4 lxml pandas ``` For Windows, use Python instead of Python3: ``` python -m pip install requests beautifulsoup4 lxml pandas ``` To try the Requests library, create a new file with the name amazon.py and enter the following: ``` import requests url = 'https://www.amazon.com/Bose-QuietComfort-45-Bluetooth-Canceling-Headphones/dp/B098FKXT8L' response = requests.get(url) print(response.text) ``` Save the file and run it from the terminal: ``` python3 amazon.py ``` In most cases, you cannot view the desired HTML. Amazon will block this request, and you will see the following text in the response: ``` To discuss automated access to Amazon data, please contact [email protected]. ``` If you print the `response.status_code`, you will see that instead of getting 200, which means success, you may get 503, which means an error. Amazon knows this request was not using a browser and thus blocks it. Many websites employ this practice. Amazon will block your requests and return an error code beginning with 500 or sometimes even 400. The solution is simple in most cases. You can send HTTP headers along with your request just like an actual browser. Sometimes, sending only the `user-agent` is enough. At other times, you may need to send more headers. A good example is sending the `accept-language` header. To identify the user-agent sent by your browser, press F12 and open the Network tab. Reload the page. Select the first request and examine Request Headers. ![](https://raw.githubusercontent.com/oxylabs/how-to-scrape-amazon-product-data/main/images/Amazon%20(1).jpg?token=GHSAT0AAAAAACW62VSTLRWZD5SJMWB7ZKI6ZWYLXUA) ![](https://oxylabs.io/_next/image?url=https%3A%2F%2Foxylabs.io%2Foxylabs-web%2FZpBeQh5LeNNTxEWk_ZmK9sZm069VX1icx_Amazon-1-.jpg%3Fauto%3Dformat%2Ccompress&w=1200&q=75) You can copy this user-agent and create a dictionary for the headers. The following shows a dictionary with the `user-agent` and `accept-language` headers: ``` custom_headers = { 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36', 'accept-language': 'en-GB,en;q=0.9', } ``` You can send this dictionary to the optional parameter of the `get` method as follows: ``` response = requests.get(url, headers= custom_headers) ``` Executing the code with these changes may show the expected HTML with the product details. You will not need Javascript rendering if you send as many headers as possible. If you need rendering, you will have to use tools like Playwright or Selenium. If the `User-Agent` and `Accept-Language` strings still bring you the `503` error, you can try to use the following headers: ``` custom_headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15', 'Accept-Language': 'da, en-gb, en', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'Referer': 'https://www.google.com/' } ``` It’s also a good idea to rotate different `User-Agent` strings and try your requests again to overcome the `503` error. ## Scraping product data When scraping Amazon products, typically, you would work with two categories of pages — the category page and the product details page. For example, open [this](https://www.amazon.com/b?node=12097479011) or search for Over-Ear Headphones on Amazon. The page that shows the search results is the category page. The category page displays the product title, product image, product rating, product price, and, most importantly, the product URLs page. If you want more details, such as product descriptions, you will get them only from the product details page. Let's examine the structure of the product details page. Open a product URL, such as [this](https://www.amazon.com/Bose-QuietComfort-45-Bluetooth-Canceling-Headphones/dp/B098FKXT8L), in Chrome or any other modern browser, right-click the product title, and select Inspect. You will see that the HTML markup of the product title is highlighted. ![](https://oxylabs.io/_next/image?url=https%3A%2F%2Foxylabs.io%2Foxylabs-web%2FZpBeQx5LeNNTxEWl_ZmK9xZm069VX1ic0_Amazon-2-.jpg%3Fauto%3Dformat%2Ccompress&w=1200&q=75) You will see that it is a span tag with its id attribute set to `productTitle`. Similarly, if you right-click the price and select Inspect, you will see the HTML markup of the price. ![](https://oxylabs.io/_next/image?url=https%3A%2F%2Foxylabs.io%2Foxylabs-web%2FZpBeRR5LeNNTxEWm_ZmK905m069VX1ic4_Amazon-3-.jpg%3Fauto%3Dformat%2Ccompress&w=1200&q=75) You can see that the dollar component of the price is in a span tag with the class `a-price-whole`, and the cents component is in another span tag with the class set to `a-price-fraction`. Similarly, you can locate the rating, image, and description. ### 1. Sending a GET request with custom headers ``` from bs4 import BeautifulSoup response = requests.get(url, headers=custom_headers) soup = BeautifulSoup(response.text, 'lxml') ``` This guide uses CSS selectors. You can now use the `Soup` object to query for specific information. ### 2. Locating and scraping product name The product name or title is located in a `span` element with its id `productTitle`. It's easy to select elements using a unique ID. ``` title_element = soup.select_one('#productTitle') ``` Send the CSS selector to the `select_one` method, which returns an element instance. You can extract information from the text using the `text` attribute. ``` title = title_element.text ``` Upon printing, you will see that there are few white spaces. To fix that, add `.strip()` function call as follows: ``` title = title_element.text.strip() ``` ### 3. Locating and scraping product rating Create a selector for rating: ``` #acrPopover ``` The following statement can select the element that contains the rating: ``` rating_element = soup.select_one('#acrPopover') ``` Note that the rating value is actually in the title attribute: ``` rating_text = rating_element.attrs.get('title') print(rating_text) # prints '4.6 out of 5 stars' ``` Lastly, use the `replace` method to get the number: ``` rating = rating_text.replace('out of 5 stars','') ``` ### 4. Locating and scraping product price The product price is located in two places: below the product title and on the Buy Now box. You can use either of these tags. Create a CSS selector for the price: ``` span.a-offscreen ``` The CSS selector can be passed to the `select_one` method of BeautifulSoup as follows: ``` price_element = soup.select_one('span.a-offscreen') ``` You can now print the price: ``` print(price_element.text) ``` ### 5. Locating and scraping product image Let's scrape the default image. This image has the CSS selector as `#landingImage`. Write the following to get the image URL from the `src` attribute: ``` image_element = soup.select_one('#landingImage') image = image_element.attrs.get('src') ``` ### 6. Locating and scraping product description The methodology remains the same — create a CSS selector and use the `select_one` method. ``` #productDescription ``` You can extract the element as follows: ``` description_element = soup.select_one('#productDescription').text.strip() print(description_element) ``` ### 7. Handling product listing To reach the product information, begin with product listing or category pages. For example, [here](https://www.amazon.com/b?node=12097479011) is the category page for over-ear headphones. Notice that all the products are contained in a `div` with the special attribute `[data-asin]`. In the `div`, all the product links are in an `h2` tag. The CSS Selector is as follows: ``` [data-asin] h2 a ``` You can read the `href` attribute of this selector and run a loop. However, note that the links will be relative. You would need to use the `urljoin` method to parse these links. ``` from urllib.parse import urljoin def parse_listing(listing_url): global visited_urls response = requests.get(listing_url, headers=custom_headers) print(response.status_code) soup_search = BeautifulSoup(response.text, "lxml") link_elements = soup_search.select("[data-asin] h2 a") page_data = [] for link in link_elements: full_url = urljoin(listing_url, link.attrs.get("href")) if full_url not in visited_urls: visited_urls.add(full_url) print(f"Scraping product from {full_url[:100]}", flush=True) product_info = get_product_info(full_url) if product_info: page_data.append(product_info) ``` #### Handling pagination The link to the next page contains the text "Next". Look for this link using the contains operator of CSS as follows: ``` next_page_el = soup_search.select_one('a.s-pagination-next') if next_page_el: next_page_url = next_page_el.attrs.get('href') next_page_url = urljoin(listing_url, next_page_url) print(f'Scraping next page: {next_page_url}', flush=True) page_data += parse_listing(next_page_url) return page_data ``` ### 8. Exporting scraped product data to a CSV file The scraped data is being returned as a dictionary. It is intentional. You can create a list that contains all the scraped products: ``` def main(): data = [] search_url = "https://www.amazon.com/s?k=bose&rh=n%3A12097479011&ref=nb_sb_noss" data = parse_listing(search_url) ``` This `page_data` can then be used to create a Pandas `DataFrame` object: ``` df = pd.DataFrame(data) df.to_csv("headphones.csv", index=False) ``` ## Reviewing the final script Putting together everything, here is the final script: ``` import requests from bs4 import BeautifulSoup from urllib.parse import urljoin import pandas as pd custom_headers = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 13_1) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.1 Safari/605.1.15', 'Accept-Language': 'da, en-gb, en', 'Accept-Encoding': 'gzip, deflate, br', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'Referer': 'https://www.google.com/' } visited_urls = set() def get_product_info(url): response = requests.get(url, headers=custom_headers) if response.status_code != 200: print(f"Error in getting webpage: {url}") return None soup = BeautifulSoup(response.text, "lxml") title_element = soup.select_one("#productTitle") title = title_element.text.strip() if title_element else None price_element = soup.select_one('span.a-offscreen') price = price_element.text if price_element else None rating_element = soup.select_one("#acrPopover") rating_text = rating_element.attrs.get("title") if rating_element else None rating = rating_text.replace("out of 5 stars", "") if rating_text else None image_element = soup.select_one("#landingImage") image = image_element.attrs.get("src") if image_element else None description_element = soup.select_one("#productDescription") description = description_element.text.strip() if description_element else None return { "title": title, "price": price, "rating": rating, "image": image, "description": description, "url": url } def parse_listing(listing_url): global visited_urls response = requests.get(listing_url, headers=custom_headers) print(response.status_code) soup_search = BeautifulSoup(response.text, "lxml") link_elements = soup_search.select("[data-asin] h2 a") page_data = [] for link in link_elements: full_url = urljoin(listing_url, link.attrs.get("href")) if full_url not in visited_urls: visited_urls.add(full_url) print(f"Scraping product from {full_url[:100]}", flush=True) product_info = get_product_info(full_url) if product_info: page_data.append(product_info) next_page_el = soup_search.select_one('a.s-pagination-next') if next_page_el: next_page_url = next_page_el.attrs.get('href') next_page_url = urljoin(listing_url, next_page_url) print(f'Scraping next page: {next_page_url}', flush=True) page_data += parse_listing(next_page_url) return page_data def main(): data = [] search_url = "https://www.amazon.com/s?k=bose&rh=n%3A12097479011&ref=nb_sb_noss" data = parse_listing(search_url) df = pd.DataFrame(data) df.to_csv("headphones.csv", orient='records') if __name__ == '__main__': main() ``` ## An easier solution to extract Amazon data You can simplify the whole process with Oxylabs [Amazon Scraper](https://oxylabs.io/products/scraper-api/ecommerce/amazon) (a free trial is available). ### Scraping products from search results Extract product data with the following code: ``` import requests from pprint import pprint # Structure payload. payload = { 'source': 'amazon_search', 'query': 'bose', # Search for "bose" 'start_page': 1, 'pages': 10, 'parse': True, 'context': [ {'key': 'category_id', 'value': 12097479011} # category id for headphones ], } # Get response response = requests.request( 'POST', 'https://realtime.oxylabs.io/v1/queries', auth=('USERNAME', 'PASSWORD'), json=payload, ) # Print prettified response to stdout. pprint(response.json()) ``` Notice how it requests 10 pages beginning with the page 1. Also, we limit the search to category ID 12097479011, which is Amazon's category ID for headphones. You’ll get the data in JSON format: ![](https://oxylabs.io/_next/image?url=https%3A%2F%2Foxylabs.io%2Foxylabs-web%2FZpBeRh5LeNNTxEWn_0dcb25ef-f532-49c2-8ef5-5960d9773bd3_amazon_product_search.png%3Fauto%3Dformat%2Ccompress&w=1200&q=75) ### Extracting product details You only need the product URL, regardless of the country where the Amazon store is located. The only code change is the payload. The following payload extracts details, such as name, price, stock availability, description, and more, for the Bose QC 45: ``` payload = { 'source': 'amazon', 'url': 'https://www.amazon.com/dp/B098FKXT8L', 'parse': True } ``` The output: ![](https://oxylabs.io/_next/image?url=https%3A%2F%2Foxylabs.io%2Foxylabs-web%2FZpBeRx5LeNNTxEWo_fddcfa94-6d5c-4a61-b9ff-7035108bf36d_amazon_product_details.png%3Fauto%3Dformat%2Ccompress&w=1200&q=75) ### Scraping products by ASIN Another way to get data is by the ASIN of a product. You need to modify the payload: ``` payload = { 'source': 'amazon_product', 'domain': 'co.uk', 'query': 'B098FKXT8L', 'parse': True, 'context': [ {'key': 'autoselect_variant', 'value': True} ] } ``` Note the optional parameter `domain`. Use this parameter to get Amazon data from any domain, such as amazon.co.uk. Looking to scrape more other Amazon data? [Amazon Review Scraper](https://github.com/oxylabs/amazon-review-scraper), [Amazon ASIN Scraper](https://github.com/oxylabs/amazon-asin-scraper), [Handle Amazon CAPTCHA](https://github.com/oxylabs/how-to-handle-amazon-captcha), [How to Scrape Amazon Prices](https://github.com/oxylabs/how-to-scrape-amazon-prices)

Browser Automation Code Editors & IDEs
2.9K Github Stars
how-to-scrape-google-trends
Open Source

how-to-scrape-google-trends

# How to Scrape Google Trends Data With Python [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/how-to-scrape-google-scholar/refs/heads/main/Google-Scraper-API-1090x275.png)](https://oxylabs.io/products/scraper-api/serp/google?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=how-to-scrape-google-trends-github&transaction_id=102c8d36f7f0d0e5797b8f26152160) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) - [Why scrape Google Trends data?](#why-scrape-google-trends-data) - [1. Install libraries](#1-install-libraries) - [2. Send a request](#2-send-a-request) - [3. Save results to CSV](#3-save-results-to-csv) - [4. Create a result comparison](#4-create-a-result-comparison) This article provides step-by-step instructions on how to get Google Trends data with Python and [SERP Scraper API](https://developers.oxylabs.io/scraper-apis/web-scraper-api) (a part of Web Scraper API), which requires a **paid subscription** or a **free trial**. ## Why scrape Google Trends data? Here are some of the uses for scraped Google Trends data: - **Keyword research:** Google Trends is widely used among SEO specialists and content marketers. Since it provides insights into the past and present popularity of search terms, these professionals can tailor their marketing strategies to gain more website traffic. - **Market research:** Google Trends data can be used for market research, helping businesses understand consumer interests and preferences over time. For example, e-commerce businesses can use Google Trends search insights for product development. - **Societal research:** Google Trends website is a valuable resource for journalists and researchers, offering a glimpse into societal trends and public interest in specific topics. These are just a few examples. Google Trends data can also help with investment decisions, brand reputation monitoring, and other cases. ## 1. Install libraries For this guide, you'll need the following: - Credentials for [SERP Scraper API](https://oxylabs.io/products/scraper-api/serp) – you can claim a **7-day free trial** by registering on the [dashboard](https://dashboard.oxylabs.io/en/); - [Python](https://www.python.org/downloads/); - [Requests](https://requests.readthedocs.io/en/latest/) library to make requests; - [Pandas](https://pandas.pydata.org/docs/index.html) library to manipulate received data. Open your terminal and run the following `pip` command: ```bash pip install requests pandas ``` Then, import these libraries in a new Python file: ```python import requests import pandas as pd ``` ## 2. Send a request Let’s begin with building an initial request to the API: ```python import requests from pprint import pprint USERNAME = "YourUsername" PASSWORD = "YourPassword" query = "persian cat" print(f"Getting data from Google Trends for {query} keyword..") url = "https://realtime.oxylabs.io/v1/queries" auth = (USERNAME, PASSWORD) payload = { "source": "google_trends_explore", "query": query, } try: response = requests.request("POST", url, auth=auth, json=payload, timeout=180) except requests.exceptions.RequestException as e: print("Caught exception while getting trend data") raise e data = response.json() content = data["results"][0]["content"] pprint(content) ``` For more information about possible parameters, check our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/google/trends-explore). If everything’s in order, when you run the code, you should see the raw results of the query in the terminal window like this: ![](images/trends_data.png) ## 3. Save results to CSV Now that you have the results, adjust the formatting and save in the CSV format – this way, it’ll be easier to analyze the data. All this can be done with the help of the `pandas` Python library. The response you get from the API provides you with four categories of information: `interest_over_time`, `breakdown_by_region`, `related_topics`, and `related_queries`. Let’s split each category into its own separate CSV file. Begin by converting each into a `pandas` dataframe: ```python def flatten_topic_data(topics_data: List[dict]) -> List[dict]: """Flattens related_topic data""" topics_items = [] for item in topics_data[0]["items"]: item_dict = { "mid": item["topic"]["mid"], "title": item["topic"]["title"], "type": item["topic"]["type"], "value": item["value"], "formatted_value": item["formatted_value"], "link": item["link"], "keyword": topics_data[0]["keyword"], } topics_items.append(item_dict) return topics_items trend_data = json.loads(content) print("Creating dataframes..") # Interest over time iot_df = pd.DataFrame(trend_data["interest_over_time"][0]["items"]) iot_df["keyword"] = trend_data["interest_over_time"][0]["keyword"] # Breakdown by region bbr_df = pd.DataFrame(trend_data["breakdown_by_region"][0]["items"]) bbr_df["keyword"] = trend_data["breakdown_by_region"][0]["keyword"] # Related topics rt_data = flatten_topic_data(trend_data["related_topics"]) rt_df = pd.DataFrame(rt_data) # Related queries rq_df = pd.DataFrame(trend_data["related_queries"][0]["items"]) rq_df["keyword"] = trend_data["related_queries"][0]["keyword"] ``` As the data for `related_topics` is multi-leveled, you'll have to flatten the structure into a single-leveled one. Thus, the function `flatten_topic_data` was added to do so. The only thing left is to save the data to a file: ```python CSV_FILE_DIR = "./csv/" keyword = trend_data["interest_over_time"][0]["keyword"] keyword_path = os.path.join(CSV_FILE_DIR, keyword) try: os.makedirs(keyword_path, exist_ok=True) except OSError as e: print("Caught exception while creating directories") raise e print("Dumping to csv..") iot_df.to_csv(f"{keyword_path}/interest_over_time.csv", index=False) bbr_df.to_csv(f"{keyword_path}/breakdown_by_region.csv", index=False) rt_df.to_csv(f"{keyword_path}/related_topics.csv", index=False) rq_df.to_csv(f"{keyword_path}/related_queries.csv", index=False) ``` You’ve now created a folder structure to hold all of your separate CSV files grouped by keyword: ![](images/trends_data_csv.png) ## 4. Create a result comparison Let’s begin with multiple keyword handling. To make the code iterable, split it into reusable functions. First, extract the code for the request to the API into a function that takes a query as an argument and returns you the response: ```python def get_trend_data(query: str) -> dict: """Gets a dictionary of trends based on given query string from Google Trends via SERP Scraper API""" print(f"Getting data from Google Trends for {query} keyword..") url = "https://realtime.oxylabs.io/v1/queries" auth = (USERNAME, PASSWORD) payload = { "source": "google_trends_explore", "query": query, } try: response = requests.request("POST", url, auth=auth, json=payload) except requests.exceptions.RequestException as e: print("Caught exception while getting trend data") raise e data = response.json() content = data["results"][0]["content"] return json.loads(content) ``` Next, you need a function that would transform a raw response into `pandas` dataframes, save said dataframes as CSV files, and return them: ```python def dump_trend_data_to_csv(trend_data: dict) -> dict: """Dumps given trend data to generated CSV file""" CSV_FILE_DIR = "./csv/" # Interest over time print("Creating dataframes..") iot_df = pd.DataFrame(trend_data["interest_over_time"][0]["items"]) iot_df["keyword"] = trend_data["interest_over_time"][0]["keyword"] # Breakdown by region bbr_df = pd.DataFrame(trend_data["breakdown_by_region"][0]["items"]) bbr_df["keyword"] = trend_data["breakdown_by_region"][0]["keyword"] # Related topics rt_data = flatten_topic_data(trend_data["related_topics"]) rt_df = pd.DataFrame(rt_data) # Related queries rq_df = pd.DataFrame(trend_data["related_queries"][0]["items"]) rq_df["keyword"] = trend_data["related_queries"][0]["keyword"] keyword = trend_data["interest_over_time"][0]["keyword"] keyword_path = os.path.join(CSV_FILE_DIR, keyword) try: os.makedirs(keyword_path, exist_ok=True) except OSError as e: print("Caught exception while creating directories") raise e print("Dumping to csv..") iot_df.to_csv(f"{keyword_path}/interest_over_time.csv", index=False) bbr_df.to_csv(f"{keyword_path}/breakdown_by_region.csv", index=False) rt_df.to_csv(f"{keyword_path}/related_topics.csv", index=False) rq_df.to_csv(f"{keyword_path}/related_queries.csv", index=False) result_set = {} result_set["iot"] = iot_df result_set["bbr"] = bbr_df result_set["rt"] = rt_df result_set["rq"] = rq_df return result_set ``` Now that the request and dataframe creation is covered, you can create comparisons: ```python def create_comparison(trend_dataframes : dict) -> None: comparison = trend_dataframes[0] i = 1 for df in trend_dataframes[1:]: comparison["iot"] = pd.merge(comparison["iot"], df["iot"], on="time", suffixes=("", f"_{i}")) comparison["bbr"] = pd.merge(comparison["bbr"], df["bbr"], on="geo_code", suffixes=("", f"_{i}")) comparison["rt"] = pd.merge(comparison["rt"], df["rt"], on="title", how="inner", suffixes=("", f"_{i}")) comparison["rq"] = pd.merge(comparison["rq"], df["rq"], on="query", how="inner", suffixes=("", f"_{i}")) i = i + 1 comparison["iot"].to_csv("comparison_interest_over_time.csv", index=False) comparison["bbr"].to_csv("comparison_breakdown_by_region.csv", index=False) comparison["rt"].to_csv("comparison_related_topics.csv", index=False) comparison["rq"].to_csv("comparison_related_queries.csv", index=False) ``` This function will accept the dataframes for all the queries you have created, go over them, and merge them for comparison on key metrics. The last thing to do is to create the core logic of your application. Adding it all together, the final version of the code should look like this: ```python import json import os from typing import List import pandas as pd import requests def get_trend_data(query: str) -> dict: """Gets a dictionary of trends based on given query string from Google Trends via SERP Scraper API""" USERNAME = "yourUsername" PASSWORD = "yourPassword" print(f"Getting data from Google Trends for {query} keyword..") url = "https://realtime.oxylabs.io/v1/queries" auth = (USERNAME, PASSWORD) payload = { "source": "google_trends_explore", "query": query, } try: response = requests.request("POST", url, auth=auth, json=payload) except requests.exceptions.RequestException as e: print("Caught exception while getting trend data") raise e data = response.json() content = data["results"][0]["content"] return json.loads(content) def flatten_topic_data(topics_data: List[dict]) -> List[dict]: """Flattens related_topic data""" topics_items = [] for item in topics_data[0]["items"]: item_dict = { "mid": item["topic"]["mid"], "title": item["topic"]["title"], "type": item["topic"]["type"], "value": item["value"], "formatted_value": item["formatted_value"], "link": item["link"], "keyword": topics_data[0]["keyword"], } topics_items.append(item_dict) return topics_items def dump_trend_data_to_csv(trend_data: dict) -> dict: """Dumps given trend data to generated CSV file""" CSV_FILE_DIR = "./csv/" # Interest over time print("Creating dataframes..") iot_df = pd.DataFrame(trend_data["interest_over_time"][0]["items"]) iot_df["keyword"] = trend_data["interest_over_time"][0]["keyword"] # Breakdown by region bbr_df = pd.DataFrame(trend_data["breakdown_by_region"][0]["items"]) bbr_df["keyword"] = trend_data["breakdown_by_region"][0]["keyword"] # Related topics rt_data = flatten_topic_data(trend_data["related_topics"]) rt_df = pd.DataFrame(rt_data) # Related queries rq_df = pd.DataFrame(trend_data["related_queries"][0]["items"]) rq_df["keyword"] = trend_data["related_queries"][0]["keyword"] keyword = trend_data["interest_over_time"][0]["keyword"] keyword_path = os.path.join(CSV_FILE_DIR, keyword) try: os.makedirs(keyword_path, exist_ok=True) except OSError as e: print("Caught exception while creating directories") raise e print("Dumping to csv..") iot_df.to_csv(f"{keyword_path}/interest_over_time.csv", index=False) bbr_df.to_csv(f"{keyword_path}/breakdown_by_region.csv", index=False) rt_df.to_csv(f"{keyword_path}/related_topics.csv", index=False) rq_df.to_csv(f"{keyword_path}/related_queries.csv", index=False) result_set = {} result_set["iot"] = iot_df result_set["bbr"] = bbr_df result_set["rt"] = rt_df result_set["rq"] = rq_df return result_set def create_comparison(trend_dataframes : dict) -> None: comparison = trend_dataframes[0] i = 1 for df in trend_dataframes[1:]: comparison["iot"] = pd.merge(comparison["iot"], df["iot"], on="time", suffixes=("", f"_{i}")) comparison["bbr"] = pd.merge(comparison["bbr"], df["bbr"], on="geo_code", suffixes=("", f"_{i}")) comparison["rt"] = pd.merge(comparison["rt"], df["rt"], on="title", how="inner", suffixes=("", f"_{i}")) comparison["rq"] = pd.merge(comparison["rq"], df["rq"], on="query", how="inner", suffixes=("", f"_{i}")) i = i + 1 comparison["iot"].to_csv("comparison_interest_over_time.csv", index=False) comparison["bbr"].to_csv("comparison_breakdown_by_region.csv", index=False) comparison["rt"].to_csv("comparison_related_topics.csv", index=False) comparison["rq"].to_csv("comparison_related_queries.csv", index=False) def main(): keywords = ["cat", "cats"] results = [] for keyword in keywords: trend_data = get_trend_data(keyword) df_set = dump_trend_data_to_csv(trend_data) results.append(df_set) create_comparison(results) if __name__ == "__main__": main() ``` Running the code will create comparison CSV files that have the combined information of the supplied keywords on each of the categories: - `interest_over_time` - `breakdown_by_region` - `related_topics` - `related_queries` ![](images/trends_data_comparison.png) Looking to scrape data from other Google sources? [Google Sheets for Basic Web Scraping](https://github.com/oxylabs/web-scraping-google-sheets), [Google Play Scraper](https://github.com/oxylabs/google-play-scraper), [How To Scrape Google Jobs](https://github.com/oxylabs/how-to-scrape-google-jobs), [Google News Scrpaer](https://github.com/oxylabs/google-news-scraper), [How to Scrape Google Scholar](https://github.com/oxylabs/how-to-scrape-google-scholar), [How to Scrape Google Flights with Python](https://github.com/oxylabs/how-to-scrape-google-flights), [Scrape Google Search Results](https://github.com/oxylabs/scrape-google-python)

Analytics & BI Browser Automation
2.6K Github Stars
how-to-scrape-amazon-prices
Open Source

how-to-scrape-amazon-prices

# How to Scrape Amazon Prices [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/amazon-scraper/refs/heads/main/Scrape%20Amazon%20data%20with%20Web%20Scraper%20API.png)](https://oxylabs.io/products/scraper-api/ecommerce/amazon?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=how-to-scrape-amazon-prices-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) * [Free Amazon Prices Scraper](#free-amazon-prices-scraper) * [Prerequisites](#prerequisites) * [Installation](#installation) * [Retrieving the URL of an Amazon page to scrape prices from](#retrieving-the-url-of-an-amazon-page-to-scrape-prices-from) * [Scraping Amazon prices](#scraping-amazon-prices) * [Retrieved data](#retrieved-data) * [Notes](#notes) * [Scraping with Oxylabs API](#scraping-with-oxylabs-api) * [The complete code](#the-complete-code) Follow this quick and easy tutorial to scrape public Amazon pricing data; in the first part of the tutorial, we'll show how you can do it for free. If you need the pricing data on a bigger scale, please refer to the second part of the tutorial; there, we demonstrate how to do it with [Oxylabs API](https://oxylabs.io/products/scraper-api/web). ### Free Amazon Prices Scraper A free tool used to get Amazon product prices for a provided Amazon department page. ### Prerequisites To run this tool, you need to have Python 3.11 installed in your system. ### Installation Open up a terminal window, navigate to this repository and run this command: ```make install``` ### Retrieving the URL of an Amazon page to scrape prices from First off, open up Amazon and select a department from which you want to scrape prices for products. For this example, we'll be using the `Camera & Photo` department. <img width="993" alt="image" src="https://github.com/user-attachments/assets/0ee1f37b-d99a-4cb7-b699-9b6f6c95bfff" /> After the page loads, simply copy the URL in the browser and save it. We'll need it for scraping price data. ### Scraping Amazon prices To get prices from products listed on the department page you chose, simply run this command in your terminal: ```make scrape URL="<amazon_department_page_url>"``` With the URL we retrieved earlier, the command would look like this: ```make scrape URL="https://www.amazon.com/s?i=specialty-aps&bbn=16225009011&rh=n%3A%2116225009011%2Cn%3A502394&ref=nav_em__nav_desktop_sa_intl_camera_and_photo_0_2_6_3"``` Make sure to surround the URL with quotation marks, otherwise the tool might have trouble parsing it. After running the command, your terminal should look something like this: <img width="1163" alt="image" src="https://github.com/user-attachments/assets/eb0e83f9-d995-4c48-8cc8-db7635c55feb" /> If a listed product is out of stock, the tool will notify you with a dedicated message. The product will be skipped if that is the case. ### Retrieved data After the tool has finished running, you should see a file named `amazon_prices.csv` in your directory. The generated CSV file contains data with these columns inside it: - `title` - The title of the product. - `url` - The URL pointing to the product's Amazon page. - `price` - The price of the product. - `currency` - The currency that the product is sold in. The data should look something like this: <img width="717" alt="image" src="https://github.com/user-attachments/assets/4e00fecc-8176-4248-9b63-0bc6dd382905" /> ### Notes In case the code doesn't work or your project is of bigger scale, please refer to the second part of the tutorial. There, we showcase how to scrape public data with Oxylabs Scraper API. ### Scraping with Oxylabs API Here's the process of scraping best-selling items, search results, and currently available deals from Amazon using Python and Oxylabs [E-Commerce Scraper API](https://oxylabs.io/products/scraper-api/ecommerce) (a part of Web Scraper API). You can claim a **1-week free trial** by registering on the [dashboard](https://dashboard.oxylabs.io/). For a detailed walkthrough with explanations and visuals, check our [blog post](https://oxylabs.io/blog/scraping-amazon-prices). ## The complete code ```python import requests import pandas as pd USERNAME = "USERNAME" PASSWORD = "PASSWORD" def parse_price_results(results): return [ { "price": result["price"], "title": result["title"], "currency": result["currency"], } for result in results ] def get_best_seller_results(category_id): payload = { "source": "amazon_bestsellers", "domain": "com", "query": category_id, "start_page": 1, "parse": True, } response = requests.post( "https://realtime.oxylabs.io/v1/queries", auth=(USERNAME, PASSWORD), json=payload, ) response.raise_for_status() results = response.json()["results"][0]["content"]["results"] return parse_price_results(results) def get_search_results(query): payload = { "source": "amazon_search", "domain": "com", "query": query, "start_page": 1, "parse": True, } response = requests.post( "https://realtime.oxylabs.io/v1/queries", auth=(USERNAME, PASSWORD), json=payload, ) response.raise_for_status() results = response.json()["results"][0]["content"]["results"]["organic"] return parse_price_results(results) def get_deals_results(url): payload = { "source": "amazon", "url": url, "parse": True, } response = requests.post( "https://realtime.oxylabs.io/v1/queries", auth=(USERNAME, PASSWORD), json=payload, ) response.raise_for_status() results = response.json()["results"][0]["content"]["results"]["organic"] return parse_price_results(results) dog_food_category_id = "2975359011" best_seller_results = get_best_seller_results(dog_food_category_id) best_seller_df = pd.DataFrame(best_seller_results) best_seller_df.to_csv("best_seller.csv") search_results = get_search_results("couch") search_df = pd.DataFrame(search_results) search_df.to_csv("search.csv") deal_url = "https://www.amazon.com/s?i=sporting&rh=n%3A3400371%2Cp_n_deal_type%3A23566064011&s=exact-aware-popularity-rank&pf_rd_i=10805321&pf_rd_m=ATVPDKIKX0DER&pf_rd_p=bf702ff1-4bf6-4c17-ab26-f4867bf293a9&pf_rd_r=ER3N9MGTCESZPZ0KRV8R&pf_rd_s=merchandised-search-3&pf_rd_t=101&ref=s9_acss_bw_cg_SODeals_3e1_w" deal_results = get_deals_results(deal_url) deal_df = pd.DataFrame(deal_results) deal_df.to_csv("deals.csv") ``` ## Final word Check our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/amazon) for all of the API parameters found in this guide. If you have any questions, feel free to contact us at [email protected] or via the live chat on our [homepage](https://oxylabs.io/). Looking to scrape more other Amazon data? [Amazon Review Scraper](https://github.com/oxylabs/amazon-review-scraper), [Amazon ASIN Scraper](https://github.com/oxylabs/amazon-asin-scraper), [Handle Amazon CAPTCHA](https://github.com/oxylabs/how-to-handle-amazon-captcha), [Scraping Amazon Product Data](https://github.com/oxylabs/how-to-scrape-amazon-product-data)

Browser Automation Code Editors & IDEs
1.8K Github Stars
how-to-scrape-google-scholar
Open Source

how-to-scrape-google-scholar

# How to Scrape Google Scholar [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/how-to-scrape-google-scholar/refs/heads/main/Google-Scraper-API-1090x275.png)](https://oxylabs.io/products/scraper-api/serp/google?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=how-to-scrape-google-scholar-github&transaction_id=102c8d36f7f0d0e5797b8f26152160) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) Take a look at the process of getting titles, authors, and citations from [Google Scholar](https://scholar.google.com/) using Oxylabs [SERP Scraper API](https://oxylabs.io/products/scraper-api/serp) (a part of Web Scraper API) and Python. You can get a **1-week free trial** by registering on the [dashboard](https://dashboard.oxylabs.io/). For a detailed walkthrough with explanations and visuals, check our [blog post](https://oxylabs.io/blog/how-to-scrape-google-scholar). Also, do not hesitate to check this [Best SERP APIs](https://medium.com/@oxylabs.io/the-10-best-serp-apis-in-2025-22bf7f91f8f0) list ## The complete code ```python import requests from bs4 import BeautifulSoup USERNAME = "USERNAME" PASSWORD = "PASSWORD" def get_html_for_page(url): payload = { "url": url, "source": "google", } response = requests.post( "https://realtime.oxylabs.io/v1/queries", auth=(USERNAME, PASSWORD), json=payload, ) response.raise_for_status() return response.json()["results"][0]["content"] def get_citations(article_id): url = f"https://scholar.google.com/scholar?q=info:{article_id}:scholar.google.com&output=cite" html = get_html_for_page(url) soup = BeautifulSoup(html, "html.parser") data = [] for citation in soup.find_all("tr"): title = citation.find("th", {"class": "gs_cith"}).get_text(strip=True) content = citation.find("div", {"class": "gs_citr"}).get_text(strip=True) entry = { "title": title, "content": content, } data.append(entry) return data def parse_data_from_article(article): title_elem = article.find("h3", {"class": "gs_rt"}) title = title_elem.get_text() title_anchor_elem = article.select("a")[0] url = title_anchor_elem["href"] article_id = title_anchor_elem["id"] authors = article.find("div", {"class": "gs_a"}).get_text() return { "title": title, "authors": authors, "url": url, "citations": get_citations(article_id), } def get_url_for_page(url, page_index): return url + f"&start={page_index}" def get_data_from_page(url): html = get_html_for_page(url) soup = BeautifulSoup(html, "html.parser") articles = soup.find_all("div", {"class": "gs_ri"}) return [parse_data_from_article(article) for article in articles] data = [] url = "https://scholar.google.com/scholar?q=global+warming+&hl=en&as_sdt=0,5" NUM_OF_PAGES = 1 page_index = 0 for _ in range(NUM_OF_PAGES): page_url = get_url_for_page(url, page_index) entries = get_data_from_page(page_url) data.extend(entries) page_index += 10 print(data) ``` ## Final word Check our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/google) for more API parameters and variables found in this tutorial. If you have any questions, feel free to contact us at [email protected]. Read More Google Scraping Related Repositories: [Google Sheets for Basic Web Scraping](https://github.com/oxylabs/web-scraping-google-sheets), [Google Play Scraper](https://github.com/oxylabs/google-play-scraper), [How To Scrape Google Jobs](https://github.com/oxylabs/how-to-scrape-google-jobs), [Google News Scrpaer](https://github.com/oxylabs/google-news-scraper), [How to Scrape Google Flights with Python](https://github.com/oxylabs/how-to-scrape-google-flights), [How To Scrape Google Images](https://github.com/oxylabs/how-to-scrape-google-images), [Scrape Google Search Results](https://github.com/oxylabs/scrape-google-python), [Scrape Google Trends](https://github.com/oxylabs/how-to-scrape-google-trends)

Browser Automation Code Editors & IDEs
1.7K Github Stars
how-to-scrape-google-jobs
Open Source

how-to-scrape-google-jobs

# How To Scrape Google Jobs [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/how-to-scrape-google-scholar/refs/heads/main/Google-Scraper-API-1090x275.png)](https://oxylabs.io/products/scraper-api/serp/google?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=how-to-scrape-google-jobs-github&transaction_id=102c8d36f7f0d0e5797b8f26152160) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) In this two-part tutorial, we're going to show how to scrape Google Jobs data. First, we'll show how to do it for free but if you need data at scale, please refer to the second part of the tutorial. There, we'll demonstrate how to gather large-scale data with Oxylabs API. * [Free Google Jobs Scraper](#free-google-jobs-scraper) + [Prerequisites](#prerequisites) + [Installation](#installation) + [Scraping Google Jobs](#scraping-google-jobs) + [Notes](#notes) * [Scraping Google Jobs with Oxylabs API](#scraping-google-jobs-with-oxylabs-api) * [Google Jobs website overview](#google-jobs-website-overview) * [1. Get a free trial and send a request](#1-get-a-free-trial-and-send-a-request) + [Install Python](#install-python) + [Send a request for testing](#send-a-request-for-testing) * [2. Install and import libraries](#2-install-and-import-libraries) * [3. Add your API user credentials](#3-add-your-api-user-credentials) * [4. Set up queries and locations](#4-set-up-queries-and-locations) + [URL parameters](#url-parameters) + [Locations](#locations) * [5. Prepare the API payload with parsing instructions](#5-prepare-the-api-payload-with-parsing-instructions) + [Job title](#job-title) + [Company name](#company-name) + [Location](#location) + [Date](#date) + [Salary](#salary) + [Job posted via](#job-posted-via) + [URL](#url) * [6. Define functions](#6-define-functions) + [Submit job](#submit-job) + [Check job status](#check-job-status) + [Get job results](#get-job-results) + [Save data to a CSV file](#save-data-to-a-csv-file) + [Scrape Google Jobs](#scrape-google-jobs) * [7. Create the main() function](#7-create-the-main-function) * [8. Run the complete code](#8-run-the-complete-code) ## Free Google Jobs Scraper A free tool used to get data about jobs from Google Jobs for a provided search query. ### Prerequisites To run this tool, you need to have Python 3.11 installed in your system. ### Installation Open up a terminal window, navigate to this repository and run this command: ```make install``` ### Scraping Google Jobs To scrape jobs from Google Jobs, simply run this command in your terminal with a search query for a job that you need data for: ```make scrape QUERY="<job_query>"``` For this example, let's try scraping for designer jobs. The command should look something like this: ```make scrape QUERY="designer"``` Make sure to enclose your query in quotation marks, otherwise the tool might have trouble parsing it. After running the command, your terminal should look something like this: <img width="701" alt="image" src="https://github.com/user-attachments/assets/7958807d-c30b-44fc-bb0a-cd3140ceb7d8"> After the tool has finished running, you should see a file named `jobs.csv` in your current directory. The CSV file contains available jobs in your area for the position you entered in the query. The jobs are listed with these attributes: - `title` - The title of position. - `company` - The name of the company. - `location` - The location of the position. - `url` - The URL of the Google Jobs page for that job. Here's an example of how the data can look like: <img width="854" alt="image" src="https://github.com/user-attachments/assets/cfdb3e6e-c57f-46ba-943c-6f58c0e8e9c2"> ### Notes In case the code doesn't work or your project is of bigger scale, please refer to the second part of the tutorial. There, we showcase how to scrape public data with Oxylabs API. ## Scraping Google Jobs with Oxylabs API ## Google Jobs website overview Once you visit the Google Jobs page, you'll see that all job listings for a query are displayed on the left side. Looking at the HTML structure, you can see that each listing is enclosed in the ```<li>``` tag and collectively wrapped within the ```<ul>``` tag: In this guide, let’s scrape Google Jobs results asynchronously and extract the following publicly available data: 1. Job title 2. Company name 3. Job location 4. Job posted via *[platform]* 5. Job listing date 6. Salary If you want to extract even more public data, such as job highlights, job description, and similar jobs, expand the code shown in this article to make additional API calls to the scraped job URLs. ### 1. Get a free trial and send a request Visit the Oxylabs dashboard and create an account to claim your 1-week free trial for Google Jobs API, part of the Oxylabs Web Scraper API. It’s equipped with proxy servers, Custom Browser Instructions, Custom Parser, and other advanced features that’ll help you overcome blocks and fingerprinting. See this short guide that shows how to navigate the dashboard and get the free trial. ### Install Python If you don’t have Python installed yet, you can download it from the official Python website. This tutorial is written with Python 3.12.0, so ensure that you have a compatible version. ### Send a request for testing After creating an API user, copy and save your API user credentials, which you’ll use for authentication. Next, open your terminal and install the requests library: ```bash pip install requests ``` Then run the following code that scrapes Google Jobs results and retrieves the entire HTML file: ```python import requests payload = { "source": "google", "url": "https://www.google.com/search?q=developer&ibp=htl;jobs&hl=en&gl=us", "render": "html" } response = requests.post( "https://realtime.oxylabs.io/v1/queries", auth=("USERNAME", "PASSWORD"), # Replace with your API user credentials json=payload ) print(response.json()) print(response.status_code) ``` Once it finishes running, you should see a JSON response with HTML results and a status code of your request. If everything works correctly, the status code should be ```200```. ## 2. Install and import libraries For this project, let’s use the ```asyncio``` and ```aiohttp``` libraries to make asynchronous requests to the API. Additionally, the ```json``` and ```pandas``` libraries will help you deal with JSON and CSV files. Open your terminal and run the following command to install the necessary libraries: ```bash pip install asyncio aiohttp pandas ``` Then, import them into your Python file: ```python import asyncio, aiohttp, json, pandas as pd from aiohttp import ClientSession, BasicAuth ``` ## 3. Add your API user credentials Create the API user ```credentials``` variable and use ```BasicAuth```, as ```aiohttp``` requires this for authentication: ```python credentials = BasicAuth("USERNAME", "PASSWORD") # Replace with your API user credentials ``` ## 4. Set up queries and locations You can easily form Google Jobs URLs for different queries by manipulating the q= parameter: ``` https://www.google.com/search?q=developer&ibp=htl;jobs&hl=en&gl=us ``` This enables you to scrape job listings for as many search queries as you want. **Note** that the ```q=```, ```ibp=htl;jobs```, ```hl=```, and ```gl=``` parameters are mandatory for the URL to work. Additionally, you could set the UULE parameter for geo-location targeting yourself, but that’s unnecessary since the ```geo_location``` parameter of Google Jobs Scraper API does that by default. ### URL parameters Create the URL_parameters list to store your search queries: ```python URL_parameters = ["developer", "chef", "manager"] ``` ### Locations Then, create the ```locations``` dictionary where the key refers to the country, and the value is a list of geo-location parameters. This dictionary will be used to dynamically form the API payload and localize Google Jobs results for the specified location. The two-letter country code will be used to modify the ```gl=``` parameter in the Google Jobs URL: ```python locations = { "US": ["California,United States", "Virginia,United States", "New York,United States"], "GB": ["United Kingdom"], "DE": ["Germany"] } ``` Visit our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/geo-location#google) for more details about geo-locations. ## 5. Prepare the API payload with parsing instructions Google Jobs Scraper API takes web scraping instructions from a ```payload``` dictionary, making it the most important configuration to fine-tune. The ```url``` and ```geo_location``` keys are set to ```None```, as the scraper will pass these values dynamically for each search query and location. The ```"render": "html"``` parameter enables JavaScript rendering and returns the rendered HTML file: ```python payload = { "source": "google", "url": None, "geo_location": None, "user_agent_type": "desktop", "render": "html" } ``` Next, use [Custom Parser](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/custom-parser) to define your own parsing logic with ```xPath``` or ```CSS``` selectors and retrieve only the data you need. Remember that you can create as many functions as you want and extract even more data points than shown in this guide. Head to this Google Jobs URL in your browser and open Developer Tools by pressing Ctrl+Shift+I (Windows) or Option + Command + I (macOS). Use Ctrl+F or Command+F to open a search bar and test selector expressions. As mentioned previously, the job listings are within the <li> tags, which are wrapped with the <ul> tag. As there is more than one ```<ul>``` list on the Google Jobs page, you can form an xPath selector by specifying the ```div``` element that contains the targeted list. ```//div[@class='nJXhWc']//ul/li``` You can use this selector to specify the location of all job listings in the HTML file. In the ```payload``` dictionary, set the ```parse``` key to ```True``` and create the ```parsing_instructions``` parameter with the jobs function: ```python payload = { "source": "google", "url": None, "geo_location": None, "user_agent_type": "desktop", "render": "html", "parse": True, "parsing_instructions": { "jobs": { "_fns": [ { "_fn": "xpath", "_args": ["//div[@class='nJXhWc']//ul/li"] } ], } } } ``` Next, create the ```_items``` iterator that will loop over the jobs list and extract details for each listing: ```python payload = { "source": "google", "url": None, "geo_location": None, "user_agent_type": "desktop", "render": "html", "parse": True, "parsing_instructions": { "jobs": { "_fns": [ { "_fn": "xpath", # You can use CSS or xPath "_args": ["//div[@class='nJXhWc']//ul/li"] } ], "_items": { "data_point_1": { "_fns": [ { "_fn": "selector_type", # You can use CSS or xPath "_args": ["selector"] } ] }, "data_point_2": { "_fns": [ { "_fn": "selector_type", "_args": ["selector"] } ] }, } } } } ``` For each data point, you can create a separate function within the ```_items``` iterator. Let’s see how ```xPath``` selectors should look like for each Google Jobs data point: ### Job title ``` .//div[@class='BjJfJf PUpOsf']/text() ``` ### Company name ``` .//div[@class='vNEEBe']/text() ``` ### Location ``` .//div[@class='Qk80Jf'][1]/text() ``` ### Date ``` .//div[@class='PuiEXc']//span[@class='LL4CDc' and contains(@aria-label, 'Posted')]/span/text() ``` ### Salary ``` .//div[@class='PuiEXc']//div[@class='I2Cbhb bSuYSc']//span[@aria-hidden='true']/text() ``` ### Job posted via ``` .//div[@class='Qk80Jf'][2]/text() ``` ### URL ``` .//div[@data-share-url]/@data-share-url ``` **Please be aware** that you can only access this job listing URL in your browser with an IP address from the same country used during web scraping. If you’ve used a United States proxy, make sure to use a US IP address in your browser. In the end, you should have a ```payload``` that looks like shown below. Save it to a separate JSON file and ensure that the ```None``` and ```True``` parameter values are converted to respective JSON values: ```null``` and ```true```: ```python import json payload = { "source": "google", "url": None, "geo_location": None, "user_agent_type": "desktop", "render": "html", "parse": True, "parsing_instructions": { "jobs": { "_fns": [ { "_fn": "xpath", "_args": ["//div[@class='nJXhWc']//ul/li"] } ], "_items": { "job_title": { "_fns": [ { "_fn": "xpath_one", "_args": [".//div[@class='BjJfJf PUpOsf']/text()"] } ] }, "company_name": { "_fns": [ { "_fn": "xpath_one", "_args": [".//div[@class='vNEEBe']/text()"] } ] }, "location": { "_fns": [ { "_fn": "xpath_one", "_args": [".//div[@class='Qk80Jf'][1]/text()"] } ] }, "date": { "_fns": [ { "_fn": "xpath_one", "_args": [".//div[@class='PuiEXc']//span[@class='LL4CDc' and contains(@aria-label, 'Posted')]/span/text()"] } ] }, "salary": { "_fns": [ { "_fn": "xpath_one", "_args": [".//div[@class='PuiEXc']//div[@class='I2Cbhb bSuYSc']//span[@aria-hidden='true']/text()"] } ] }, "posted_via": { "_fns": [ { "_fn": "xpath_one", "_args": [".//div[@class='Qk80Jf'][2]/text()"] } ] }, "URL": { "_fns": [ { "_fn": "xpath_one", "_args": [".//div[@data-share-url]/@data-share-url"] } ] } } } } } with open("payload.json", "w") as f: json.dump(payload, f, indent=4) ``` This allows you to import the payload and make the scraper code much shorter: ```python payload = {} with open("payload.json", "r") as f: payload = json.load(f) ``` ## 6. Define functions There are several ways you can [integrate](https://developers.oxylabs.io/scraper-apis/web-scraper-api/integration-methods) Oxylabs API, namely Realtime, [Push-Pull](https://developers.oxylabs.io/scraper-apis/web-scraper-api/integration-methods/push-pull) (supports [batch requests](https://developers.oxylabs.io/scraper-apis/web-scraper-api/integration-methods/push-pull-batch)), and Proxy endpoint. For this guide, let’s use Push-Pull with batch requests, as you won’t have to keep your connection open after submitting a scraping job to the API. The API endpoint to use in this scenario is https://data.oxylabs.io/v1/queries. You could also use another endpoint to submit batches of up to 5000 URLs or queries. Keep in mind that making this choice will require you to modify the code shown in this tutorial. Read up about batch queries in our documentation. ### Submit job Define an ```async``` function called ```submit_job``` and pass the ```session: ClientSession``` together with the ```payload``` to submit a web scraping job to the Oxylabs API using the ```POST``` method. This will return the ID number of the submitted job: ```python async def submit_job(session: ClientSession, payload): async with session.post( "https://data.oxylabs.io/v1/queries", auth=credentials, json=payload ) as response: return (await response.json())["id"] ``` ### Check job status Then, create another ```async``` function that passes the ```job_id``` (this will be defined later) and returns the ```status``` of the scraping job from the response: ```python async def check_job_status(session: ClientSession, job_id): async with session.get(f"https://data.oxylabs.io/v1/queries/{job_id}", auth=credentials) as response: return (await response.json())["status"] ``` ### Get job results Next, create an ```async``` function that retrieves the scraped and parsed jobs results. Note that the response is a JSON string that contains the API job details and the scraped content that you can access by parsing nested JSON properties: ```python async def get_job_results(session: ClientSession, job_id): async with session.get(f"https://data.oxylabs.io/v1/queries/{job_id}/results", auth=credentials) as response: return (await response.json())["results"][0]["content"]["jobs"] ``` ### Save data to a CSV file Define another ```async``` function that saves the scraped and parsed data to a CSV file. Later on, we’ll create the four parameters that are passed to the function. As the ```pandas``` library is synchronous, you must use ```asyncio.to_thread()``` to run the ```df.to_csv``` asynchronously in a separate thread: ```python async def save_to_csv(job_id, query, location, results): print(f"Saving data for {job_id}") data = [] for job in results: data.append({ "Job title": job["job_title"], "Company name": job["company_name"], "Location": job["location"], "Date": job["date"], "Salary": job["salary"], "Posted via": job["posted_via"], "URL": job["URL"] }) df = pd.DataFrame(data) filename = f"{query}_jobs_{location.replace(',', '_').replace(' ', '_')}.csv" await asyncio.to_thread(df.to_csv, filename, index=False) ``` ### Scrape Google Jobs Make another ```async``` function that passes parameters to form the Google Jobs ```URL``` and the payload dynamically. Create a variable ```job_id``` and then call the ```submit_job``` function to submit the request to the API and create a ```while True``` loop by calling the ```check_job_status``` function to keep checking whether the API has finished web scraping. At the end, initiate the ```get_job_results``` and ```save_to_csv``` functions: ```python async def scrape_jobs(session: ClientSession, query, country_code, location): URL = f"https://www.google.com/search?q={query}&ibp=htl;jobs&hl=en&gl={country_code}" payload["url"] = URL payload["geo_location"] = location job_id = await submit_job(session, payload) await asyncio.sleep(15) print(f"Checking status for {job_id}") while True: status = await check_job_status(session, job_id) if status == "done": print(f"Job {job_id} done. Retrieving {query} jobs in {location}.") break elif status == "failed": print(f"Job {job_id} encountered an issue. Status: {status}") return await asyncio.sleep(5) results = await get_job_results(session, job_id) await save_to_csv(job_id, query, location, results) ``` ## 7. Create the main() function You’ve written most of the code, what’s left is to pull everything together by defining an ```async``` function called ```main()``` that creates an ```aiohttp session```. It makes a list of tasks to scrape jobs for each combination of ```location``` and ```query``` and executes each task concurrently using ```asyncio.gather()```: ```python async def main(): async with aiohttp.ClientSession() as session: tasks = [] for country_code, location_list in locations.items(): for location in location_list: for query in URL_parameters: task = asyncio.ensure_future(scrape_jobs(session, query, country_code, location)) tasks.append(task) await asyncio.gather(*tasks) ``` If you run into an SSL certificate verification failure, you can manage SSL by using the ```TCPConnector```: ```python async def main(): async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False)) as session: # Remaining code... ``` Lastly, initialize the event loop and call the ```main()``` function: ```python if __name__ == "__main__": loop = asyncio.new_event_loop() asyncio.set_event_loop(loop) loop.run_until_complete(main()) print("Completed!") ``` ## 8. Run the complete code Here’s the full Python code that scrapes Google Jobs listings for each query and location asynchronously: ```python import asyncio, aiohttp, json, pandas as pd from aiohttp import ClientSession, BasicAuth credentials = BasicAuth("USERNAME", "PASSWORD") # Replace with your API user credentials URL_parameters = ["developer", "chef", "manager"] locations = { "US": ["California,United States", "Virginia,United States", "New York,United States"], "GB": ["United Kingdom"], "DE": ["Germany"] } payload = {} with open("payload.json", "r") as f: payload = json.load(f) async def submit_job(session: ClientSession, payload): async with session.post( "https://data.oxylabs.io/v1/queries", auth=credentials, json=payload ) as response: return (await response.json())["id"] async def check_job_status(session: ClientSession, job_id): async with session.get(f"https://data.oxylabs.io/v1/queries/{job_id}", auth=credentials) as response: return (await response.json())["status"] async def get_job_results(session: ClientSession, job_id): async with session.get(f"https://data.oxylabs.io/v1/queries/{job_id}/results", auth=credentials) as response: return (await response.json())["results"][0]["content"]["jobs"] async def save_to_csv(job_id, query, location, results): print(f"Saving data for {job_id}") data = [] for job in results: data.append({ "Job title": job["job_title"], "Company name": job["company_name"], "Location": job["location"], "Date": job["date"], "Salary": job["salary"], "Posted via": job["posted_via"], "URL": job["URL"] }) df = pd.DataFrame(data) filename = f"{query}_jobs_{location.replace(',', '_').replace(' ', '_')}.csv" await asyncio.to_thread(df.to_csv, filename, index=False) async def scrape_jobs(session: ClientSession, query, country_code, location): URL = f"https://www.google.com/search?q={query}&ibp=htl;jobs&hl=en&gl={country_code}" payload["url"] = URL payload["geo_location"] = location job_id = await submit_job(session, payload) await asyncio.sleep(15) print(f"Checking status for {job_id}") while True: status = await check_job_status(session, job_id) if status == "done": print(f"Job {job_id} done. Retrieving {query} jobs in {location}.") break elif status == "failed": print(f"Job {job_id} encountered an issue. Status: {status}") return await asyncio.sleep(5) results = await get_job_results(session, job_id) await save_to_csv(job_id, query, location, results) async def main(): async with aiohttp.ClientSession() as session: tasks = [] for country_code, location_list in locations.items(): for location in location_list: for query in URL_parameters: task = asyncio.ensure_future(scrape_jobs(session, query, country_code, location)) tasks.append(task) await asyncio.gather(*tasks) if __name__ == "__main__": loop = asyncio.new_event_loop() asyncio.set_event_loop(loop) loop.run_until_complete(main()) print("Completed!") ``` After the scraper finishes running, you’ll see all the CSV files saved in your local directory. Refer to this blog [blog post](https://oxylabs.io/blog/how-to-scrape-google-jobs) for a full guide featuring detailed insights and images.

Browser Automation Code Editors & IDEs
1.4K Github Stars
scrape-google-python
Open Source

scrape-google-python

# How to Scrape Google Search Results: Python Tutorial [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/how-to-scrape-google-scholar/refs/heads/main/Google-Scraper-API-1090x275.png)](https://oxylabs.io/products/scraper-api/serp/google?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=scrape-google-python-github&transaction_id=102c8d36f7f0d0e5797b8f26152160) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) In this tutorial, we showcase how to scrape public Google data with Python and Oxylabs [SERP Scraper API](https://oxylabs.io/products/scraper-api/serp) (a part of Web Scraper API), which requires a subscription or **a free trial**. * [What is a Google SERP?](#what-is-a-google-serp) * [Is it legal to scrape Google results?](#is-it-legal-to-scrape-google-results) * [Scraping public Google data with Python and Oxylabs Scraper API](#scraping-public-google-data-with-python-and-oxylabs-scraper-api) * [Set up a payload and send a POST request](#set-up-a-payload-and-send-a-post-request) + [Customizing query parameters](#customizing-query-parameters) + [Basic parameters](#basic-parameters) * [Location query parameters](#location-query-parameters) + [Controlling the number of results](#controlling-the-number-of-results) + [Python code for scraping Google search data](#python-code-for-scraping-google-search-data) + [Export scraped data to a CSV](#export-scraped-data-to-a-csv) + [Handling errors and exceptions](#handling-errors-and-exceptions) ## What is a Google SERP? Upon any discussion of scraping Google search results, you’ll likely run into the “SERP” abbreviation. SERP stands for Search Engine Results Page; it’s the page you get after entering a query into the search bar. SERPs contain various features and elements, such as: 1. Featured snippets 2. Paid ads 3. Video carousel 4. People also ask 5. Local pack 6. Related searches ## Is it legal to scrape Google results? The legality of scraping Google search data is largely discussed in the scraping field. As a matter of fact, scraping publicly available data on the internet – including Google SERP data – is legal. However, it may vary from one situation to another, so it’s best to seek legal advice about your specific case. ## Scraping public Google data with Python and Oxylabs Scraper API 1. Install required Python libraries To follow this guide on scraping Google search results, you’ll need the following: - Credentials for Oxylabs' [SERP Scraper API](https://oxylabs.io/products/scraper-api/serp) – you can get a 7-day free trial by registering on the [dashboard](https://dashboard.oxylabs.io/); - Python; - Requests library. First, sign up for Oxylabs' Google Search Results API and save your `username` and `password`. Then, download and install Python 3.8 or above from the [python.org](https://www.python.org/) website. Finally, install the [Request library](https://pypi.org/project/requests/) by using the following command: `$python3 -m pip install requests` If you’re using Windows, choose Python instead of Python3. The rest of the command remains the same: `d:\amazon>python -m pip install requests` ## Set up a payload and send a POST request Create a new file and enter the following code: ``` import requests from pprint import pprint payload = { 'source': 'google', 'url': 'https://www.google.com/search?hl=en&q=newton' # search for newton } response = requests.request( 'POST', 'https://realtime.oxylabs.io/v1/queries', auth=('USERNAME', 'PASSWORD'), json=payload, ) pprint(response.json()) ``` Here’s what the result should look like: ``` { "results": [ { "content": "<!doctype html><html>...</html>", "created_at": "YYYY-DD-MM HH:MM:SS", "updated_at": "YYYY-DD-MM HH:MM:SS", "page": 1, "url": "https://www.google.com/search?hl=en&q=newton", "job_id": "1234567890123456789", "status_code": 200 } ] } ``` Notice how the `url` in the payload dictionary is a Google search results page. In this example, the keyword is `newton`. As you can see, the query is executed and the page result in HTML is returned in the content key of the response. ### Customizing query parameters Let's review the payload dictionary from the above example for scraping Google search data. ``` payload = { 'source': 'google', 'url': 'https://www.google.com/search?hl=en&q=newton' } ``` The dictionary keys are parameters used to inform Google Scraper API about required customization. The first parameter is the `source`, which is really important because it sets the scraper we’re going to use. The default value is `Google` – when you use it, you can set the url as any Google search page, and all the other parameters will be extracted from the URL. Although in this guide we’ll be using the `google_search` parameter, there's many others: `google_ads`, `google_travel_hotels`, `google_lens`, and more (full list [here](https://developers.oxylabs.io/scraper-apis/web-scraper-api/google). Keep in mind that if you set the source as `google_search`, you cannot use the `url` parameter. Luckily, you can use several different parameters for acquiring public Google SERP data without having to create multiple URLs (more on that in the next paragraph.) ### Basic parameters We’ll build the payload by adding the parameters one by one. First, begin with setting the source as `google_search`. ``` payload = { 'source': 'google_search', } ``` Now, let’s add `query` – a crucial parameter that determines what search results you’ll be retrieving. In our example, we’ll use `newton` as our search query. At this stage, the payload dictionary looks like this: ``` payload = { 'source': 'google_search', 'query': 'newton', } ``` That said, `google_search` and query are the two essential parameters for scraping public Google search data. If you want the API to return Google search results at this stage, you can use `payload`. Now, let’s move to the next parameter. ## Location query parameters If you want to see the results from Germany, use the `geo_location` parameter— `'geo_location':'Germany'`. See the [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/features/geo-location#google) for the `geo_location` parameter to learn more about the correct values. Also, here’s what changing the locale parameter looks like: ``` payload = { 'source':'google_search', 'query':'newton', 'geo_location': 'Germany', 'locale' : 'en-us' } ``` To learn more about the potential values of the locale parameter, check the [documentation](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/localization/domain-locale-results-language), as well. If you send the above payload, you’ll receive search results in American English from google.de, just like anyone physically located in Germany would. ### Controlling the number of results By default, you’ll see the first ten results from the first page. If you want to customize this, you can use these parameters: `start_page`, `pages`, and `limit`. The `start_page` parameter determines which page of search results to return. The `pages` parameter specifies the number of pages. Finally, the `limit parameter` sets the number of results on each page. For example, the following set of parameters fetch results from pages 11 and 12 of the search engine results, with 20 results on each page: ``` payload = { 'start_page': 11, 'pages': 2, 'limit': 20, ... # other parameters } ``` Apart from the search parameters we’ve covered so far, there are a few more you can use to fine-tune your results – see our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/google/search#request-parameter-values) on collecting public Google Search data. ### Python code for scraping Google search data Now, let’s put together everything we’ve learned so far – here’s what the final script with the shoes keyword looks like: ``` import requests from pprint import pprint payload = { 'source': 'google_search', 'query': 'shoes', 'geo_location': 'Germany', 'locale': 'en-us', 'parse': True, 'start_page': 1, 'pages': 5, 'limit': 10, } # Get response. response = requests.request( 'POST', 'https://realtime.oxylabs.io/v1/queries', auth=('USERNAME', 'PASSWORD'), json=payload, ) if response.status_code != 200: print("Error - ", response.json()) exit(-1) pprint(response.json()) ``` ### Export scraped data to a CSV One of the best Google Scraper API features is the ability to parse an HTML page into JSON. For that, you don't need to use BeautifulSoup or any other library – just send the parse parameter as True. Here is a sample payload: ``` payload = { 'source': 'google_search', 'query': 'adidas', 'parse': True, } ``` When sent to the Google Scraper API, this payload will return the results in JSON. To see a detailed JSON data structure, see our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api/google/search#structured-data). The key highlights: - The results are in the dedicated results list. Here, each page gets a new entry. - Each result contains the content in a dictionary key named content. - The actual results are in the results key. Note that there’s a `job_id` in the results. The easiest way to save the data is by using the Pandas library, since it can normalize JSON quite effectively. ``` import pandas as pd ... data = response.json() df = pd.json_normalize(data['results']) df.to_csv('export.csv', index=False) ``` Alternatively, you can also take note of the `job_id` and send a GET request to the following URL, along with your credentials. ``` http://data.oxylabs.io/v1/queries/{job_id}/results/normalized?format=csv ``` In case you need multiple data formats, you can utilize the [multi-format output](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/result-processing-and-storage/output-types/multi-format-output) feature to get all result types in a single API response. For example, the following will fetch parsed data, a screenshot of the page, and the raw HTML document: ``` https://data.oxylabs.io/v1/queries/{job_id}/results?type=parsed,png,raw ``` ### Handling errors and exceptions When scraping Google, you can run into several challenges: network issues, invalid query parameters, or API quota limitations. To handle these, you can use try-except blocks in your code. For example, if an error occurs when sending the API request, you can catch the exception and print an error message: ``` try: response = requests.request( 'POST', 'https://realtime.oxylabs.io/v1/queries', auth=('USERNAME', 'PASSWORD'), json=payload, ) except requests.exceptions.RequestException as e: print("Error:", e) ``` If you send an invalid parameter, Google Scraper API will return the 400 response code. To catch these errors, check the status code: ``` if response.status_code != 200: print("Error - ", response.json()) ``` Looking to scrape data from other Google sources? [Google Sheets for Basic Web Scraping](https://github.com/oxylabs/web-scraping-google-sheets), [Google Play Scraper](https://github.com/oxylabs/google-play-scraper), [How To Scrape Google Jobs](https://github.com/oxylabs/how-to-scrape-google-jobs), [Google News Scrpaer](https://github.com/oxylabs/google-news-scraper), [How to Scrape Google Scholar](https://github.com/oxylabs/how-to-scrape-google-scholar), [How to Scrape Google Flights with Python](https://github.com/oxylabs/how-to-scrape-google-flights), [Scrape Google Search Results](https://github.com/oxylabs/scrape-google-python), [Scrape Google Trends](https://github.com/oxylabs/how-to-scrape-google-trends)

Customer Engagement
1.3K Github Stars
ai-map-py
Open Source

ai-map-py

# AI-Map [![AI-Map header](https://github.com/oxylabs/ai-map-py/blob/main/Github-AI-Studio-1200x628px-Map.png)](https://aistudio.oxylabs.io/apps/map?utm_source=877&utm_medium=affiliate&utm_campaign=ai_studio&utm_content=ai-map-py&groupid=877&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) [**AI-Map**](https://aistudio.oxylabs.io/apps/map) is a website map search agent designed by [**Oxylabs AI Studio**](https://aistudio.oxylabs.io) to intelligently explore and enumerate relevant pages online based on prompts. Unlike traditional crawlers or sitemaps, **AI-Map** uses natural language instructions to guide a map search agent and return structured lists of URLs that match your request. With AI-Map, you can quickly map entire websites or any content area on a domain without writing any complex crawling logic. ## Key features - **Flexible domain or URL mapping** – Start from any URL or domain root. - **Prompt-based instructions** – Describe what kind of pages you want in natural language. - **Adjustable parameters** – Customize the number of sources, rendering, and geo-location. - **Structured outputs** – Get structured lists of all relevant URLs. ## How it works To start mapping target websites with AI-Map: 1. **Enter a URL or domain** you want to map. 2. **Describe your target pages** with a natural language prompt. 3. **Configure mapping parameters** such as source limit, mapping depth, and geo-location. 4. **Run the AI mapping task** to collect URLs of your target pages from the domain. ### Installation To begin, make sure you have access to an API key (or [get a free trial](https://aistudio.oxylabs.io/register) with 1000 credits) and `Python 3.10` or above installed. You can install the `oxylabs-ai-studio` package using pip: ```bash pip install oxylabs-ai-studio ``` ### Code example (Python) The following example shows how to use an AI-Map search agent to perform a simple website mapping task. ```python from oxylabs_ai_studio.apps.ai_map import AiMap import json # Initialize authorization ai_map = AiMap(api_key="<API_KEY>") # Define the payload for request payload = { "url": "https://career.oxylabs.io", "user_prompt": "job ad pages", "return_sources_limit": 10, "geo_location": None, "render_javascript": False, } # Start the mapping result = ai_map.map(**payload) # Output the results in JSON print("URLs:") print(json.dumps(result.data, indent=2)) ``` Learn more about AI-Map and the Oxylabs AI Studio Python SDK in our [PyPI repository](https://pypi.org/project/oxylabs-ai-studio/). You can also check out our [AI Studio JavaScript SDK](https://github.com/oxylabs/oxylabs-ai-studio-js?tab=readme-ov-file#oxylabs-ai-studio-javascript-sdk) guide for JS users. ### Request parameters | Parameter | Description | Default Value | |------------|-------------|----------------| | `url`* | Starting URL or domain for mapping | – | | `user_prompt`* | Natural language prompt for pages to find | – | | `output_format` | Max number of sources to return | `25` | | `render_javascript` | Enable JavaScript rendering for dynamic content | `False` | | `geo_location` | Proxy location in ISO2 format | – | \* – mandatory parameters ### Output sample AI-Map search returns results that are easy to integrate into various data collection workflows. This is a direct JSON output example from the earlier request: ```json URLs: [ "https://career.oxylabs.io/job/247ac098/head-of-marketing", "https://career.oxylabs.io/job/2d9ae321/product-owner", "https://career.oxylabs.io/job/496841ed/php-developer-golang", "https://career.oxylabs.io/job/736ef009/talent-sourcer", "https://career.oxylabs.io/job/7a4a4415/senior-python-engineer-webshare-product", "https://career.oxylabs.io/job/88f7340c/devops-squad-lead", "https://career.oxylabs.io/job/9946db47/junior-project-manager", "https://career.oxylabs.io/job/b2e1397e/account-executive-north-america", "https://career.oxylabs.io/job/d72db15b/senior-event-marketing-manager", "https://career.oxylabs.io/job/e1efecf0/seo-and-growth-marketing-lead" ] ``` ## Practical use cases You can use the AI-Map search agent in various ways, including: - **Mapping product categories** on e-commerce sites. - **Listing all blog posts** from a company’s domain. - **Documenting API** endpoints. - **Discovering all documentation pages** or help articles. - **Finding job postings** on career portals. - **And many more…** ## FAQ ### What is website mapping? Website mapping involves discovering and listing all relevant URLs within a website or domain. AI-Map automates this process using AI to interpret your request, intelligently explore the target domain, and return structured results — such as all blog posts, documentation pages, or product categories. ### How does AI-Map differ from AI-Crawler? AI-Map focuses on **finding** relevant pages within a website, while AI-Crawler **extracts** structured data from those pages. In short, AI-Map helps you find *where* to collect data, and AI-Crawler helps you get the *data itself*. ### Does AI-Map work on any website? AI-Map search works with most public websites, including those with JavaScript-rendered content. However, pages that require authentication, have paywalls, or exist on private networks are not accessible out of the box. ### Is AI-Map free to use? Oxylabs AI Studio AI-Map is free to try by signing up for a free trial that includes 1,000 credits. After the trial, the [monthly plans](https://aistudio.oxylabs.io/pricing) start at just $12/month with 3,000 credits and 1 request/s, with higher plans offering more credits and higher request rates. ## Learn more For a deeper dive into available parameters, advanced integrations, and additional examples, check out the [AI Studio documentation](https://aistudio.oxylabs.io/apps/map). ## Contact us If you have questions or need support, reach out to us at [email protected], or through live chat, accessible via [Oxylabs Dashboard](https://dashboard.oxylabs.io/en/), or join our [Discord community](https://discord.gg/Pds3gBmKMH). For enterprise-related inquiries, contact your dedicated account manager.

AI Agents
1.3K Github Stars
how-to-scrape-google-finance
Open Source

how-to-scrape-google-finance

# How to Scrape Google Finance with Python [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/google-news-scraper/refs/heads/main/Scrape%20Google%20data%20with%20Web%20Scraper%20API.png)](https://oxylabs.io/products/scraper-api/serp/google?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=how-to-scrape-google-finance-github&transaction_id=102c8d36f7f0d0e5797b8f26152160) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) Follow this Python tutorial to learn how to scrape public data from Google Finance, such as **stock titles**, **pricing**, and **price changes in percentages**. We'll show how you can use Oxylabs' [Web Scraper API](https://oxylabs.io/products/scraper-api/web) for this task, which requires a subscription or a free trial. You can claim a **1-week free trial** by registering on the [Oxylabs dashboard](https://dashboard.oxylabs.io/). - [Step 1: Install prerequisite libraries](#step-1-install-prerequisite-libraries) - [Step 2: Build the core structure](#step-2-build-the-core-structure) - [Step 3: Create a parsing logic](#step-3-create-a-parsing-logic) * [1) Collect prices](#1-collect-prices) * [2) Get the stock price change in %](#2-get-the-stock-price-change-in-percentages) * [3) Retrieve the stock title](#3-retrieve-the-stock-title) - [Complete code sample](#complete-code-sample) ## Step 1: Install prerequisite libraries In your terminal, run this `pip` command: ```bash pip install requests bs4 ``` You may skip Beautiful Soup altogether and instead use [Custom Parser](https://oxylabs.io/features/custom-parser), which is built into the API. ## Step 2: Build the core structure Next, let's define the general logic for the finance data scraper. We’ll create functionality for defining multiple Google Finance URLs that we’d like to scrape. Afterwards, we’ll take these URLs one by one, collect the information we need and save it as a JSON file. The following function will return the scraped Google Finance HTML page: ```python import requests from bs4 import BeautifulSoup def get_finance_html(url): payload = { 'source': 'google', 'render': 'html', 'url': url, } response = requests.request( 'POST', 'https://realtime.oxylabs.io/v1/queries', auth=('username', 'password'), # User your API credentials here. json=payload, ) response_json = response.json() html = response_json['results'][0]['content'] return html ``` > [!NOTE] > Don’t forget to replace the **USERNAME** and **PASSWORD** with your own Oxylabs API credentials. For the next step, we’ll be creating a function that accepts a `BeautifulSoup` object created from the HTML of the whole page. This function will create and return an object containing stock information. Let’s try to form the function in a way that makes it easy to extend (in case we need to.) ```python def extract_finance_information_from_soup(soup_of_the_whole_page): # Put data extraction here. listing = {} return listing ``` Since we can now get the HTML and have a function to hold our information extraction, we can combine both of those into one: ```python def extract_finance_data_from_urls(urls): constructed_finance_results = [] for url in urls: html = get_finance_html(url) soup = BeautifulSoup(html,'html.parser') finance = extract_finance_information_from_soup(soup) constructed_finance_results.append({ 'url': url, 'data': finance }) return constructed_finance_results ``` This function will take an array of URLs as a parameter and return an object of extracted financial data. Last but not least, we need a function that takes this data and saves it as a file: ```python def save_results(results, filepath): with open(filepath, 'w', encoding='utf-8') as file: json.dump(results, file, ensure_ascii=False, indent=4) return ``` To wrap this up, we’ll create a simple `main()` function that invokes all that we’ve built so far: ```python def main(): results_file = 'data.json' urls = [ 'https://www.google.com/finance/quote/BNP:EPA?hl=en', 'https://www.google.com/finance/quote/.DJI:INDEXDJX?hl=en', 'https://www.google.com/finance/quote/.INX:INDEXSP?hl=en' ] constructed_finance_results = extract_finance_data_from_urls(urls) save_results(constructed_finance_results, results_file) ``` We’ve successfully built the core of the application. Now, let’s move on to creating functions for extracting specific data from Google Finance. ## Step 3: Create a parsing logic ### 1) Collect prices First on the list is the pricing data. Navigating the HTML of Google Finance can get tricky (it seems to be quite dynamic), so let’s see how we can pinpoint the price. We can see that most of the information about the stock is located inside a container named `main`. ![](/images/image_1.png) Then, we’ll specify the `div` with the price itself – `AHmHk`. ![](/images/image_2.png) Now that we’ve gathered everything, let’s write the function itself: ```python def get_price(soup_element): price = soup_element.find('main').find('div','AHmHk').get_text() return price ``` ### 2) Get the stock price change in percentages Another important piece of information is the historical data for price changes. We’ll begin with the same `main` container that we’ve found earlier and specify an inner `div` that will contain only the price change – `JwB6zf`. ![](/images/image_3.png) We’ve got all of the needed CSS information, so let’s extract the stock price change. ```python def get_change(soup_element): change = soup_element.find('main').find('div','JwB6zf').get_text() return change ``` ### 3) Retrieve the stock title For the last piece of information, we’ll need the **name of the stock**. Again, we begin with the same `main` container. Then, we can specify an inner `div` that contains the name, which is `zzDege`. ![](/images/image_4.png) The final step is to put this into a function for extraction. ```python def get_name(soup_element): name = soup_element.find('main').find('div','zzDege').get_text() return name ``` Having all of these functions for financial data extraction, we just need to add them to the place we designated earlier to finish up our code: ```python def extract_finance_information_from_soup(soup_of_the_whole_page): price = get_price(soup_of_the_whole_page) change = get_change(soup_of_the_whole_page) name = get_name(soup_of_the_whole_page) listing = { "name": name, "change": change, "price": price } return listing ``` ## Complete code sample ```python from bs4 import BeautifulSoup import requests import json def get_price(soup_element): price = soup_element.find('main').find('div','AHmHk').get_text() return price def get_change(soup_element): change = soup_element.find('main').find('div','JwB6zf').get_text() return change def get_name(soup_element): name = soup_element.find('main').find('div','zzDege').get_text() return name def save_results(results, filepath): with open(filepath, 'w', encoding='utf-8') as file: json.dump(results, file, ensure_ascii=False, indent=4) return def get_finance_html(url): payload = { 'source': 'google', 'render': 'html', 'url': url, } response = requests.request( 'POST', 'https://realtime.oxylabs.io/v1/queries', auth=('username', 'password'), json=payload, ) response_json = response.json() html = response_json['results'][0]['content'] return html def extract_finance_information_from_soup(soup_of_the_whole_page): price = get_price(soup_of_the_whole_page) change = get_change(soup_of_the_whole_page) name = get_name(soup_of_the_whole_page) listing = { "name": name, "change": change, "price": price } return listing def extract_finance_data_from_urls(urls): constructed_finance_results = [] for url in urls: html = get_finance_html(url) soup = BeautifulSoup(html,'html.parser') finance = extract_finance_information_from_soup(soup) constructed_finance_results.append({ 'url': url, 'data': finance }) return constructed_finance_results def main(): results_file = 'data.json' urls = [ 'https://www.google.com/finance/quote/BNP:EPA?hl=en', 'https://www.google.com/finance/quote/.DJI:INDEXDJX?hl=en', 'https://www.google.com/finance/quote/.INX:INDEXSP?hl=en' ] constructed_finance_results = extract_finance_data_from_urls(urls) save_results(constructed_finance_results, results_file) if __name__ == "__main__": main() ``` Looking to scrape data from other Google sources? [Google Search Results](https://github.com/oxylabs/scrape-google-python), [Google Sheets for Basic Web Scraping](https://github.com/oxylabs/web-scraping-google-sheets), [How to Scrape Google Shopping Results](https://github.com/oxylabs/scrape-google-shopping), [Google Play Scraper](https://github.com/oxylabs/google-play-scraper), [How To Scrape Google Jobs](https://github.com/oxylabs/how-to-scrape-google-jobs), [Google News Scrpaer](https://github.com/oxylabs/google-news-scraper), [How to Scrape Google Scholar](https://github.com/oxylabs/how-to-scrape-google-scholar), [How to Scrape Google Flights with Python](https://github.com/oxylabs/how-to-scrape-google-flights), [Scrape Google Trends](https://github.com/oxylabs/how-to-scrape-google-trends)

Customer Engagement
1.1K Github Stars
quick-start-guide
Open Source

quick-start-guide

# Quick Start Guide [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=quick-start-guide-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) # Introduction - [Search Engines](./Guides/Search%20Engines.md) - [E-Commerce Websites](./Guides/E-Commerce%20Websites.md) - [Other Websites](./Guides/Other%20Websites.md) These quick start guides for Oxylabs' [Web Scraper API](https://oxylabs.io/products/scraper-api/web) will show how to get the most use out of a **one-week free trial** and help you decide whether this scraping solution serves your business’s needs. Each guide will walk you through setting up the API from scratch. Configuring it with a website of your choice will give you a good grasp of the data quality, ease of integration, and whether this product is right for you. We'll show you basic code samples using Python, but you can also find more programming languages and more complex examples in our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api). ## Interface for Testing Web Scraper API Alternatively, you may want to use our intuitive web-based interface to test the capabilities of Web Scraper API. After registering a free account on the Oxylabs [dashboard](https://dashboard.oxylabs.io/), visit the **Web Scraper API Playground**. Here, you can also try out our **AI-powered OxyCopilot** feature, which creates scraping and parsing codes by using prompts written in plain English language.

Flash Cards & Spaced Repetition Documentation
514 Github Stars
web-scraping-google-sheets
Open Source

web-scraping-google-sheets

# Guide to Using Google Sheets for Basic Web Scraping [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/products/scraper-api/serp/google?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=web-scraping-google-sheets-github&transaction_id=102c8d36f7f0d0e5797b8f26152160) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) * [What is IMPORTXML](#WhatisIMPORTXML) * [How to extract data from a website to Google Sheets](#HowtoextractdatafromawebsitetoGoogleSheets) * [Other related functions](#Otherrelatedfunctions) * [Import table from website to Google Sheets](#ImporttablefromwebsitetoGoogleSheets) * [Import data from XML Feeds to Google Sheets](#ImportdatafromXMLFeedstoGoogleSheets) * [Customizing data imported by IMPORTFEED](#CustomizingdataimportedbyIMPORTFEED) * [Importing Data from CSV to Google Sheets](#ImportingDatafromCSVtoGoogleSheets) * [Does the data stay fresh?](#Doesthedatastayfresh) * [Common Errors](#common-errors) * [Errors related to volatile functions](#Errorsrelatedtovolatilefunctions) Google Sheets can be a very effective tool for web scraping. While most ways of web scraping require you to write code, web scraping with Google Sheets needs no coding and or add on. All you need to do is use a built-in function of Google Sheets. This guide will give an overview you how to scrape website data with Google Sheets.  If you want to learn more, see our blog post. ## <a name='WhatisIMPORTXML'></a>What is IMPORTXML IMPORTXML is a function that can import data from various data types. If you want to extract the title element from the Quotes to Scrape web page, the formula would be as follows: ``` =IMPORTXML("https://quotes.toscrape.com/","//title") ``` As evident here, the first parameter is the web page URL, and the second parameter is the XPath query. If you want to extract the first quote from the webpage, the formula will be as follows: ``` =IMPORTXML("https://quotes.toscrape.com/","(//[@class='text']/text())[1]") ``` If this XPath query seems like something you are not comfortable with, we recommend reading the XPath section on [our blog](https://oxylabs.io/blog/xpath-vs-css) to learn more about writing XPath queries.  Alternatively, you can enter the URL in a cell: ``` =IMPORTXML(A1,A2) ``` ## 2. <a name='HowtoextractdatafromawebsitetoGoogleSheets'></a>How to extract data from a website to Google Sheets ### 2.1. <a name='Step1:FindXPathforselectingElements'></a>Step 1: Find XPath for selecting Elements In this example, we will work with https://books.toscrape.com/, and we want to get all the book titles. These requirements mean that we need to write a custom XPath. This Xpath is as follows: //h3/a/@title ### <a name='Step2:CreateanewGoogleSheet'></a>Step 2: Create a new Google Sheet Navigate to [Google Sheets](https://docs.google.com/spreadsheets/u/0/) and create a new sheet. This step requires you to log in to your Google account if you haven't done so already. ### <a name='Step3:EntertheURLandXPathintwocells'></a>Step 3: Enter the URL and XPath in two cells Enter the URL of the webpage and the XPath in two cells. ### <a name='Step4:ExtractWebsiteDataWithGoogleSheets'></a>Step 4: Extract Website Data With Google Sheets In a new cell, for example, A2, enter the following formula: ``` =IMPORTXML(B1,B2) ``` This formula effectively calls the following function: ``` =IMPORTXML("ttps://books.toscrape.com/","//h3/a/@title") ``` If you want to extract the book prices, the first step is to create the XPath for prices. This XPath would be as follows: ``` //*[@class="price_color"]/text() ``` Enter this XPath in a Cell, let's say, B3. After that, enter the following formula in the cell B4: ``` =IMPORTXML(B1, B3) ``` ## <a name='Otherrelatedfunctions'></a>Other related functions Apart from IMPORTXML, a few other functions can be used for web scraping directly from Google Sheets: - IMPORTHTML - IMPORTFEED - IMPORTDATA ### <a name='ImporttablefromwebsitetoGoogleSheets'></a>Import table from website to Google Sheets This function expects three parameters: - URL - Either "table" or "list" - The index of the table or the list you want to scrape. For example, see [List of highest-grossing films - Wikipedia](https://en.wikipedia.org/wiki/List_of_highest-grossing_films). This page contains the list in a table. ``` =IMPORTHTML(B1,"table",1) ``` For example, if we wanted only the movie titles, which are in column number 3, our formula would be as follows: ``` =INDEX(IMPORTHTML("https://en.wikipedia.org/wiki/List_of_highest-grossing_films","table",1),,3) ``` ## <a name='ImportdatafromXMLFeedstoGoogleSheets'></a>Import data from XML Feeds to Google Sheets Let's take the example of the [New York Times Technology feeds](https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml) to see this function in action.  Create a new sheet and enter the url of the feed in cell B1: https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml Now in the cell A2, enter the following formula: ``` =IMPORTFEED(B1) ``` ## <a name='CustomizingdataimportedbyIMPORTFEED'></a>Customizing data imported by IMPORTFEED The IMPORTFEED function has the following optional parameters: - Query - You can use this to specify which information you want to import. More on this just in a bit. - Headers - As you can see from the above image, there are no headers in the imported data. If you want to see column headers, then set this parameter to TRUE. - num_items - You can also control how many items are fetched. If you want only five items to be imported, set this parameter to 5. Update the function call to the following: ``` =IMPORTFEED(B1,,TRUE,5) ``` If you want only the information about the feed, enter the following formula: ``` =IMPORTFEED(B1,"feed") ``` If you want to get only the titles, enter the following formula: ``` =IMPORTFEED(B1,"items title") ``` ## <a name='ImportingDatafromCSVtoGoogleSheets'></a>Importing Data from CSV to Google Sheets If you have a URL that contains a CSV file, you can use the IMPORTDATA function to get the data. For example, create a new sheet and enter the following URL in the cell B1: https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment.csv In the cell A2, enter the following formula: ``` =IMPORTDATA(B1) ``` ## <a name='Doesthedatastayfresh'></a>Does the data stay fresh? If you keep your google sheet open, these functions check for updated data every hour. Data will also be refreshed if you delete and add the same cell. Note that data will not be refreshed if you refresh your sheet. Data will also not be refreshed if you copy-paste a cell with these functions. # Common Errors The following are some of the common errors you may face while creating your web scraping Google Sheet: ## <a name='Error:Arrayresultwasnotexpanded'></a>Error: Array result was not expanded Array result was not expanded because it would overwrite data in A36. This error means you need to make room by adding more cells for the results. ## <a name='Error:Resulttoolarge'></a>Error: Result too large The solution is to update the XPath query so that a smaller amount of data is returned.  ## <a name='Errorsrelatedtovolatilefunctions'></a>Errors related to volatile functions If you see the following error: Error: This function is not allowed to reference a cell with NOW(), RAND(), or RANDBETWEEN() It means that you are trying to reference one of the volatile functions, such as NOW, RAND, or RANDBETWEEN, in one of the parameters. These references may be indirect or direct. Read More Google Scraping Related Repositories: [Google Play Scraper](https://github.com/oxylabs/google-play-scraper), [How To Scrape Google Jobs](https://github.com/oxylabs/how-to-scrape-google-jobs), [Google News Scrpaer](https://github.com/oxylabs/google-news-scraper), [How to Scrape Google Scholar](https://github.com/oxylabs/how-to-scrape-google-scholar), [How to Scrape Google Flights with Python](https://github.com/oxylabs/how-to-scrape-google-flights), [How To Scrape Google Images](https://github.com/oxylabs/how-to-scrape-google-images), [Scrape Google Search Results](https://github.com/oxylabs/scrape-google-python), [Scrape Google Trends](https://github.com/oxylabs/how-to-scrape-google-trends)

Workflow Automation Flash Cards & Spaced Repetition
454 Github Stars
Python-Web-Scraping-Tutorial
Open Source

Python-Web-Scraping-Tutorial

# Python Web Scraping Tutorial: Step-By-Step [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=python-web-scraping-tutorial-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) ## Table of Contents - [Web Scraping in 5 Lines of Code](#Web-Scraping-in-5-Lines-of-Code) - [Components of a Web Scraping with Python Code](#Components-of-a-Web-Scraping-with-Python-Code) - [Python Libraries](#Python-Libraries) - [Python Web Scraping: Working with Requests](#Python-Web-Scraping-Working-with-Requests) - [BeautifulSoup](#BeautifulSoup) - [Find Methods in BeautifulSoup4](#Find-Methods-in-BeautifulSoup4) - [Finding Multiple Elements](#Finding-Multiple-Elements) - [Finding Nested Elements](#Finding-Nested-Elements) - [Exporting the data](#Exporting-the-data) - [Other Tools](#Other-Tools) In this Python Web Scraping Tutorial, we will outline everything needed to get started with web scraping. We will begin with simple examples and move on to relatively more complex. Python is arguably the most suitable programming language for web scraping because of its ease and a plethora of open source libraries. Some libraries make it easy to extract the data and to transform the data into any format needed, be it a simple CSV, to a more programmer-friendly JSON, or even save directly to the database. Web scraping with Python is so easy that it can be done in as little as 5 lines of code. ## Web Scraping in 5 Lines of Code Write these five lines in any text editor, save as a `.py` file, and run with Python. Note that this code assumes that you have the libraries installed. More on this later. ```python import requests from bs4 import BeautifulSoup response = requests.get("https://en.wikipedia.org/wiki/Web_scraping") bs = BeautifulSoup(response.text,"lxml") print(bs.find("p").text) ``` This will go to the Wikipedia page for the web scraping and print the first paragraph on the terminal. This code shows the simplicity and power of Python. You will find this code in `webscraping_5lines.py` file. ## Components of a Web Scraping with Python Code The main building blocks for any web scraping code is like this: 1. Get HTML 2. Parse HTML into Python object 3. Save the data extracted In most cases, there is no need to use a browser to get the HTML. While HTML contains the data, the other files that the browser loads, like images, CSS, JavaScript, etc., just make the website pretty and functional. Web scraping is focused on data. Thus in most cases, there is no need to get these helper files. There will be some cases when you do need to open the browser. Python makes that easy too. ## Python Libraries Web scraping with Python is easy due to the many useful libraries available A barebones installation of Python isn’t enough for web scraping. One of the [Python advantages](https://oxy.yt/RrXa) is a large selection of libraries for web scraping. For this Python web scraping tutorial, we’ll be using three important libraries – requests, BeautifulSoup, and CSV. - The [Requests](https://docs.python-requests.org/en/master/) library is used to get the HTML files, bypassing the need to use a browser - [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) is used to convert the raw HTML into a Python object, also called parsing. We will be working with Version 4 of this library, also know as `bs4` or `BeautifulSoup4`. - The [CSV](https://docs.python.org/3/library/csv.html) library is part of the standard Python installation. No separate installation is required. - Typically, a [virtual environment](https://docs.python.org/3/tutorial/venv.html) is used to install these libraries. If you don't know about virtual environments, you can install these libraries in the user folder. To install these libraries, start the terminal or command prompt of your OS and type in: ```sh pip install requests BeautifulSoup4 lxml ``` Depending on your OS and settings, you may need to use `pip3` instead of `pip`. You may also need to use `--user` switch, depending on your settings. ## Python Web Scraping: Working with Requests The requests library eliminates the need to launch a browser, which will load the web page and all the supporting files that make the website pretty. The data that we need to extract is in the HTML. Requests library allows us to send a request to a webpage and get the response HTML. Open a text editor of your choice, Visual Studio Code, PyCharm, Sublime Text, Jupyter Notebooks, or even notepad. Use the one which you are familiar with. Type in these three lines: ```python import requests url_to_parse = "https://en.wikipedia.org/wiki/Python_(programming_language)" response = requests.get(url_to_parse) print(response) ``` Save this file as a python file with `.py` extension and run it from your terminal. The output should be something like this: ``` <Response (200)> ``` It means that the response has been received and the status code is 200. The HTTP Response code 200 means a successful response. Response codes in the range of 400 and 500 mean error. You can read more about the response codes [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status). To get the HTML from the response object, we can simply use the `.text` attribute. ```python print(response.text) ``` This will print the HTML on the terminal. The first few characters will be something like this: ```html <!DOCTYPE html>\n<html class="client-nojs" lang=" ... ``` If we check the data type of this, it will be a string. The next step is to convert this string into something that can be queried to find the specific information. Meet BeautifulSoup! ## BeautifulSoup Beautiful Soup provides simple methods for navigating, searching, and modifying the HTML. It takes care of encoding by automatically converting into UTF-8. Beautiful Soup sits on top of popular Python parsers like lxml and html5lib. It is possible to [use lxml directly to query documents](https://oxy.yt/ZrZd), but BeautifulSoup allows you to try out different parsing strategies without changing the code. The first step is to decide the parser that you want to use. Usually, `lxml` is the most commonly used. This will need a separate install. ```python pip install lxml ``` Once `beautifulsoup4` and `lxml` is installed, we can create an object of BeautifulSoup: ```python soup = BeautifulSoup(response_text, 'lxml') ``` Now we have access to several methods to query the HTML elements. For example, to get the title of the page, all we need to do is access the tag name like an attribute: ```python print(soup.title) # OUTPUT: # <title>Python (programming language) - Wikipedia</title> print(soup.title.text) # OUTPUT: # Python (programming language) - Wikipedia ``` Note that to get the text inside the element, we simply used the `text` attribute. Similarly `soup.h1` will return the **first** `h1` tag it finds: ```python print(soup.h1) # OUTPUT: # <h1 class="firstHeading" id="firstHeading">Python (programming language)</h1> ``` ## Find Methods in BeautifulSoup4 Perhaps the most commonly used methods are `find()` and `find_all()`. Let’s open the Wikipedia page and get the table of contents. The signature of find looks something like this: ```python find(name=None, attrs={}, recursive=True, text=None, **kwargs) ``` As it is evident that the find method can be used to find elements based on `name`, `attributes`, or `text`. This should cover most of the scenarios. For scenarios like finding by `class`, there is `**kwargs` that can take other filters. Moving on to Wikipedia example, the first step is to look at the HTML markup for the table of contents to be extracted. Right-click on the div that contains the table of contents and examine its markup. It is clear that the whole table of contents is in a div tag with the class attribute set to toc: ```html <div id="toc" class="toc"> ``` If we simply run `soup.find("div")`, it will return the first div it finds - similar to writing `soup.div`. This needs filtering as we need a specific div. We are lucky in this case as it has an `id `attribute. The following line of code can extract the div element: ```python soup.find("div",id="toc") ``` Note that the second parameter here - `id="toc"`. The find method does not have a named parameter `id`, but still this works because of the implementation of the filter using the `**kwargs`. Be careful with CSS class though. `class `is a reserved keyword in Python. It cannot be used as a parameter name directly. There are two workarounds – first, just use `class_` instead of `class`. The second workaround is to use a dictionary as the second argument. This means that the following two statements are same: ```python soup.find("div",class_="toc") #not the underscore soup.find("div",{"class": "toc"}) ``` The advantage of using a dictionary is that more than one attribute can be specified. For example,if you need to specify both class and id, you can use the find method in the following manner: ```python soup.find("div",{"class": "toc", "id":"toc"}) ``` What if we need to find multiple elements? ## Finding Multiple Elements Consider this scenario - the object is to create a CSV file, which has two columns. The first column contains the heading number and the second column contains the heading text. To find multiple columns, we can use `find_all` method. This method works the same way find method works, just that instead of one element, it returns a list of all the elements that match criteria. If we look at the source code, we can see that all the heading text is inside a `span`, with `toctext` as class. We can use find_all method to extract all these: ```python soup.find_all("span",class_="toctext") ``` This will return a list of elements: ```shell [<span class="toctext">History</span>, <span class="toctext">Design philosophy and features</span>, <span class="toctext">Syntax and semantics</span>, <span class="toctext">Indentation</span>, .....] ``` Similarly, the heading numbers can be extracted using this statement: ```python soup.find_all("span",class_="tocnumber") ``` This will return a list of elements: ```shell [<span class="tocnumber">1</span>, <span class="tocnumber">2</span>, <span class="tocnumber">3</span>, <span class="tocnumber">3.1</span>, ...] ``` However, we need to have one list containing both the number and text. ## Finding Nested Elements We need to take one step back and look at the markup. The whole table of contents can be selected with this statement: ```python table_of_contents = soup.find("div",id="toc") ``` If we look at the markup, we can see that each heading number and text is inside an `li` tag. One of the great features of BeautifulSoup is that `find` and `find_all` methods can be used on `WebElements` too. In the above example, `whole_toc` is an instance of `WebElement`. We can find all the li tags inside this element. ```python headings = table_of_contents.find_all("li") ``` Now we have a list of elements. All these individual elements contain both the heading text and heading number. A simple for loop can be used to create a dictionary, which can be added to a list. ```python data= [] for heading in headings: heading_text = heading.find("span", class_="toctext").text heading_number = heading.find("span", class_="tocnumber").text data.append({ 'heading_number' : heading_number, 'heading_text' : heading_text, }) ``` If this data is printed, it is a list of dictionaries. ```shell [{'heading_number': '1', 'heading_text': 'History'}, {'heading_number': '2', 'heading_text': 'Design philosophy and features'}, {'heading_number': '3', 'heading_text': 'Syntax and semantics'}, {'heading_number': '3.1', 'heading_text': 'Indentation'}, {'heading_number': '3.2', 'heading_text': 'Statements and control flow'}, .....] ``` This data can now be exported easily using CSV module. ## Exporting the data The data can be easily exported to a CSV file using the csv module. The first step is to open a file in write mode. Note that the `newline` parameter should be set to an empty string. If this is not done, you will see unwarted new line characters in your CSV file ```python file= open("toc.csv", "w", newline="") ``` After that, create an instance of DictWriter object. This needs a list of headers. In our case, these are simply going to be the dictionary keys in the data. ```python writer = csv.DictWriter(file,fieldnames=['heading_number','heading_text']) ``` Optionally, write the header and then call the `write.writerows()` method to write the `data`. To write one row, use the method `writerow()`. To write all rows, use the method `writerow()`. ```python writer.writeheader() writer.writerows(data) ``` That's it! We have the data ready in a CSV. You can find this complete code in the file `wiki_toc.py` file. Also, check this tutorial on [pypi](https://pypi.org/project/python-web-scraping-tutorial-step-by-step/) ## Other Tools Some websites do not have data in the HTML but are loaded from other files using JavaScript. In such cases, you would need a solution that uses a browser. The perfect example would be to use Selenium. We have a [detailed guide on Selenium here](https://en.wikipedia.org/wiki/Web_scraping).

Education & Learning Browser Automation
305 Github Stars
playwright-web-scraping
Open Source

playwright-web-scraping

# Web Scraping With Playwright [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=playwright-web-scraping-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) [<img src="https://img.shields.io/static/v1?label=&message=Playwright&color=brightgreen" />](https://github.com/topics/playwright) [<img src="https://img.shields.io/static/v1?label=&message=Web%20Scraping&color=important" />](https://github.com/topics/web-scraping) - [Web Scraping With Playwright](#web-scraping-with-playwright) - [Support for proxies in Playwright](#support-for-proxies-in-playwright) - [Node.js](#nodejs) - [Python](#python) - [Node JS](#node-js) - [Python Code](#python-code) - [Basic scraping with Playwright](#basic-scraping-with-playwright) - [Locating elements](#locating-elements) - [Scraping text](#scraping-text) - [Scraping Images](#scraping-images) - [Node JS](#node-js-1) - [Python](#python-1) - [Intercepting HTTP Requests with Playwright](#intercepting-http-requests-with-playwright) - [Python](#python-2) - [Node JS](#node-js-2) This article discusses everything you need to know about news scraping, including the benefits and use cases of news scraping as well as how you can use Python to create an article scraper. For a detailed explanation, see our [blog post](https://oxy.yt/erHw). ## Support for proxies in Playwright Playwright supports the use of proxies. Before exploring this subject further, here is a quick code snippet showing how to start using a proxy with Chromium: ### Node.js ```javascript const chromium = require('playwright') const browser = await chromium.launch() ``` ### Python ```python from playwright.async_api import async_playwright import asyncio async def main(): with async_playwright() as p: browser = await p.chromium.launch() ``` This code needs only slight modifications to fully utilize proxies. In the case of Node.js, the launch function can accept an optional parameter of launch options. This `launchOption` object can, in turn, send several other parameters, e.g., headless. The other parameter needed is proxy. This proxy is another object with properties such as server, username, password, etc. The first step is to create an object where these parameters can be specified. And, then pass it to the launch method like the below example: ### Node JS ```javascript const playwright = require("playwright") (async() =>{ for (const browserType of ['chromium', 'firefox', 'webkit']){ const launchOptions = { headless: false, proxy: { server: "http://pr.oxylabs.io:7777", username: "USERNAME", password: "PASSWORD" } } const browser = await playwright[browserType].launch(launchOptions) } }) ``` In the case of Python, it’s slightly different. There’s no need to create an object of LaunchOptions. Instead, all the values can be sent as separate parameters. Here’s how the proxy dictionary will be sent: ### Python Code ```python from playwright.async_api import async_playwright import asyncio async def main(): with async_playwright() as p: browser = await p.chromium.launch( proxy={ 'server': "http://pr.oxylabs.io:7777", "username": "USERNAME", "password": "PASSWORD" }, headless=False ) ``` When deciding on which proxy to use, it’s best to use residential proxies as they don’t leave a footprint and won’t trigger any security alarms. Oxylabs’ Residential Proxies can help you with an extensive and stable proxy network. You can access proxies in a specific country, state, or even a city. What’s essential, you can integrate them easily with Playwright as well. ## Basic scraping with Playwright Let’s move to another topic that will cover how to get started with Playwright using Node.js and Python. If you’re using Node.js, create a new project and install the Playwright library. This can be done using these two simple commands: ```shell npm init -y npm install playwright ``` A basic script that opens a dynamic page is as follows: ```javascript const playwright = require("playwright") (async() =>{ for (const browserType of ['chromium', 'firefox', 'webkit']){ const browser = await playwright[browserType].launch() const context = await browser.newContext() const page = await context.newPage() await page.goto("https://amazon.com") await page.wait_for_timeout(1000) await browser.close() } }) ``` Let’s look at the above code – the first line of the code imports Playwright. Then, multiple browsers are launched. It allows the script to automate Chromium, Firefox, and Webkit. Then, a new browser page is opened. Afterward, the `page.goto()` function navigates to the Amazon web page. After that, there’s a wait of 1 second to show the page to the end user. Finally, the browser is closed. The same code can be written in Python easily. First, install the Playwright Python library using the pip command and also install the necessary browsers afterward using the install command: ```shell python -m pip install playwright playwright install ``` Note that Playwright supports two variations – synchronous and asynchronous. The following example uses the asynchronous API: ```python from playwright.async_api import async_playwright import asyncio async def main(): async with async_playwright() as p: browser = await p.chromium.launch(headless=False) page = await browser.new_page() await page.goto('https://amazon.com') await page.wait_for_timeout(1000) await browser.close() ``` This code is similar to the Node.js code. The biggest difference is the use of `asyncio` library. Another difference is that the function names change from camelCase to snake_case. In Node JS, If you want to create more than one browser context or if you want to have finer control, you can create a context object and create multiple pages in that context. This would open pages in new tabs: ```javascript const context = await browser.newContext() const page1 = await context.newPage() const page2 = await context.newPage() ``` You may also want to handle page context in your code. It’s possible to get the browser context that the page belongs to using the `page.context()` function. ## Locating elements To extract information from any element or to click any element, the first step is to locate the element. Playwright supports both CSS and XPath selectors. This can be understood better with a practical example. Open the following amazon link: <https://www.amazon.com/b?node=17938598011> You can see that all the items are under the International Best Seller category, which has div elements with the class name "a-spacing-base". To select all the div elements, you need to run a loop over all these elements. These div elements can be selected using the CSS selector: ```css .a-spacing-base ``` Similarly, the XPath selector would be as follows: ```text //*[@class="a-spacing-base"] ``` To use these selectors, the most common functions are as follows: - `$eval(selector, function)` – selects the first element, sends the element to the function, and the result of the function is returned; - `$$eval(selector, function)` – same as above, except that it selects all elements; - `querySelector(selector)` – returns the first element; - `querySelectorAll(selector)` – return all the elements. These methods will work correctly with both CSS and XPath Selectors. ## Scraping text Continuing with the example of Amazon, after the page has been loaded, you can use a selector to extract all products using the $$eval function. ```javascript const products = await page.$$eval('.a-spacing-base', all_products => { // run a loop here }) ``` Now all the elements that contain product data can be extracted in a loop: ```javascript all_products.forEach(product => { const title = product.querySelector('span.a-size-base-plus').innerText }) ``` Finally, the innerText attribute can be used to extract the data from each data point. Here’s the complete code in Node.js: ```javascript const playwright = require("playwright") (async() =>{ for (const browserType of ['chromium', 'firefox', 'webkit']){ const launchOptions = { headless: false, proxy: { server: "http://pr.oxylabs.io:7777", username: "USERNAME", password: "PASSWORD" } } const browser = await playwright[browserType].launch(launchOptions) const context = await browser.newContext() const page = await context.newPage() await page.goto('https://www.amazon.com/b?node=17938598011'); const products = await page.$$eval('.a-spacing-base', all_products => { const data = [] all_products.forEach(product => { const title = product.querySelector('span.a-size-base-plus').innerText const price = product.querySelector('span.a-price').innerText const rating = product.querySelector('span.a-icon-alt').innerText data.push({ title, price, rating}) }); return data }) console.log(products) await browser.close() } }) ``` The Python code will be a bit different. Python has a function eval_on_selector, which is similar to the `$``eval` of Node.js, but it’s not suitable for this scenario. The reason is that the second parameter still needs to be JavaScript. This can be good in a certain scenario, but in this case, it will be much better to write the entire code in Python. It would be better to use `query_selector` and `query_selector_all` which will return an element and a list of elements respectively. ```python from playwright.async_api import async_playwright import asyncio async def main(): async with async_playwright() as pw: browser = await pw.chromium.launch( proxy={ 'server': "http://pr.oxylabs.io:7777", "username": "USERNAME", "password": "PASSWORD" }, headless=False ) page = await browser.new_page() await page.goto('https://www.amazon.com/b?node=17938598011') await page.wait_for_timeout(5000) all_products = await page.query_selector_all('.a-spacing-base') data = [] for product in all_products: result = dict() title_el = await product.query_selector('span.a-size-base-plus') result['title'] = await title_el.inner_text() price_el = await product.query_selector('span.a-price') result['price'] = await price_el.inner_text() rating_el = await product.query_selector('span.a-icon-alt') result['rating'] = await rating_el.inner_text() data.append(result) print(data) await browser.close() if __name__ == '__main__': asyncio.run(main()) ``` The output of both the Node.js and the Python code will be the same. ## Scraping Images Next, we will learn how to scrape images using Playwright. For this instance, we will be using the Oxylabs official website as an image source. If you visit the website: <https://oxylabs.io> you will notice there are many images, we will extract all these images and save them in our current directory. First, let’s explore how we can accomplish this using Node JS. ### Node JS The code will be similar to the one that we’ve written earlier. There are multiple ways to extract images using the Javascript playwright wrapper. In this example, we will be using two additional libraries https and fs. These libraries will help us to make Network requests to download the images and store them in the current directory. Take a look at the full source code below: ```javascript const playwright = require("playwright") const https = require('https') const fs = require('fs') (async() =>{ const launchOptions = { headless: false, proxy: { server: "http://pr.oxylabs.io:7777", username: "USERNAME", password: "PASSWORD" } } const browser = await playwright["chromium"].launch(launchOptions) const context = await browser.newContext() const page = await context.newPage() await page.goto('https://oxylabs.io'); const images = await page.$$eval('img', all_images => { const image_links = [] all_images.forEach((image, index) => { const path = `image_${index}.svg` const file = fs.createWriteStream(path) https.get(image.href, function(response) { response.pipe(file); }) image_links.push(image.href) }) return image_links }) console.log(images) await browser.close() }) ``` As you can see. we are initializing a chromium browser instance with the Oxylabs Residential proxy just like the previous example. After navigating to the website, we are using the `$$eval` to extract all the image elements. After extracting all the images we are using `forEach` loop to iterate over every image element. ```javascript all_images.forEach((image, index) => { const path = `image_${index}.svg` const file = fs.createWriteStream(path) https.get(image.src, function(response) { response.pipe(file); }) ``` Inside this `forEach` loop, we are constructing the image name using the index and also the path of the image. We are using a relative path so that the images will be stored in the current directory. We then initiate a `file` object by calling the `createWriteStream` method of the fs library. Finally, we use the https library to send a `GET` request to download the image using the image src URL. We also pipe the response that we receive directly to the file stream which will write it in the current directory. Once we execute this code, the script will loop through each of the images available on the oxylabs.io website and download them to our current directory. ### Python Python’s built-in support for file I/O operations makes this task way easier than Node JS. Similar to the Node JS code, we will first extract the images using the playwright wrapper. Just like our Amazon example, we can use the `query_selector_all` method, to extract all the image elements. After extracting the image elements, we will send a GET request to each image source URL and store the response content in the current directory. The full source code is given below: ```python from playwright.async_api import async_playwright import asyncio import requests async def main(): async with async_playwright() as pw: browser = await pw.chromium.launch( proxy={ 'server': "http://pr.oxylabs.io:7777", "username": "USERNAME", "password": "PASSWORD" }, headless=False ) page = await browser.new_page() await page.goto('https://www.oxylabs.io') await page.wait_for_timeout(5000) all_images = await page.query_selector_all('img') images = [] for i, img in enumerate(all_images): image_url = await img..get_attribute("src") content = requests.get(image_url).content with open("image_{}.svg".format(i), "wb") as f: f.write(content) images.append(image_url) print(images) await browser.close() if __name__ == '__main__': asyncio.run(main()) ``` ## Intercepting HTTP Requests with Playwright Now, we will explore how to intercept HTTP requests with Playwright. It can be used for advanced web scraping, debugging, testing, and performance optimization. For example, using playwright we can Intercept the HTTP Requests to abort loading images, customize headers, modify response output, etc. Let’s take a look at the below examples: ### Python We will define a new function named `handle_route`, Playwright will invoke this function to intercept the HTTP requests. The function will be simple, we will fetch and update the title of the HTML code and also replace the header to make the `content-type: text/html`. We will also write another lambda function which will help us to prevent images from loading. So, if we execute the script the website will load without any images, and both title & header modified. The code is given below: ```python from playwright.async_api import async_playwright import asyncio import requests async def handle_route(route) -> None: response = await route.fetch() body = await response.text() body = body.replace("<title>", "<title>Modified Response") await route.fulfill( response=response, body=body, headers={**response.headers, "content-type": "text/html"}, ) async def main(): async with async_playwright() as pw: browser = await pw.chromium.launch( proxy={ 'server': "http://pr.oxylabs.io:7777", "username": "USERNAME", "password": "PASSWORD" }, headless=False ) page = await browser.new_page() # abort image loading await page.route("**/*.{png,jpg,jpeg,svg}", lambda route: route.abort()) await page.route("**/*", handle_route) await page.goto('https://www.oxylabs.io') await page.wait_for_timeout(5000) await browser.close() if __name__ == '__main__': asyncio.run(main()) ``` Notice, we are using the `route()` method to let Playwright know which function to call when intercepting the requests. It takes two parameters, first parameter is a regex to match the URI path. And, the second parameter is the name of the function or lambda. When we are using the `"**/*.{png,jpg,jpeg,svg}"` regex, we are telling Playwright to match all the URLs that end with the given extensions e.g. PNG, JPG, JPEG, and SVG. ### Node JS The same thing can be achieved using Node JS as well. The code is also quite similar to Python. ```javascript const playwright = require("playwright") (async() =>{ const launchOptions = { headless: false, proxy: { server: "http://pr.oxylabs.io:7777", username: "USERNAME", password: "PASSWORD" } } const browser = await playwright["chromium"].launch(launchOptions) const context = await browser.newContext() const page = await context.newPage() await page.route(/(png|jpeg|jpg|svg)$/, route => route.abort()) await page.route('**/*', async route => { const response = await route.fetch(); let body = await response.text(); body = body.replace('<title>', '<title>Modified Response: '); route.fulfill({ response, body, headers: { ...response.headers(), 'content-type': 'text/html' } }) }) await page.goto('https://oxylabs.io'); await browser.close() }) ``` We are using the `page.route` method to intercept the HTTP requests and modify the response’s title and headers. We are also blocking any images from loading. This can be a handy trick to speed up page loading and improve scraping performance.

Browser Automation Terminal & CLI Tools
155 Github Stars
how-to-scrape-indeed
Open Source

how-to-scrape-indeed

# How to Scrape Indeed [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=how-to-scrape-indeed-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.vercel.app/api/server/eWsVUJrnG5)](https://discord.gg/Pds3gBmKMH) Here's the process of extracting job postings from [Indeed](https://www.indeed.com/) with the help of Oxylabs [Web Scraper API](https://oxylabs.io/products/scraper-api/web) (**1-week free trial**) and Python. For the complete guide with in-depth explanations and visuals, check our [blog post](https://oxylabs.io/blog/how-to-scrape-indeed). ## Project setup ### Creating a virtual environment ```python python -m venv indeed_env #Windows python3 -m venv indeed_env #Macand Linux ``` ### Activating the virtual environment ```python .\indeed_env\Scripts\Activate#Windows source indeed_env/bin/activate #Macand Linux ``` ### Installing libraries ```python $ pip install requests ``` ## Overview of Web Scraper API The following is an example that shows how Web Scraper API works. ```python # scraper_api_demo.py import requests payload = {     "source": "universal",     "url": "https://www.indeed.com" } response = requests.post(     url="https://realtime.oxylabs.io/v1/queries",     json=payload,     auth=(username,password), ) print(response.json()) ``` ## Web Scraper API parameters ### Parsing the page title and retrieving results in JSON ```python "title": {     "_fns": [                 {                     "_fn": "xpath_one",                     "_args": ["//title/text()"]                 }             ]         } }, ``` If you send this as `parsing_instructions`, the output would be the following JSON. ```python { "title": "Job Search | Indeed", "parse_status_code": 12000 } ``` Note that the `parse_status_code` means a successful response. The following code prints the title of the Indeed page. ```python # indeed_title.py import requests payload = {     "source": "universal",     "url": "https://www.indeed.com",     "parse": True,     "parsing_instructions": {         "title": {             "\_fns": [                         {                             "\_fn": "xpath_one",                             "\_args": [                                 "//title/text()"                                 ]                         }                     ]                 }     }, } response = requests.post(     url="https://realtime.oxylabs.io/v1/queries",     json=payload,     auth=('username', 'password'), ) print(response.json()['results'][0]['content']) ``` ## Scraping Indeed job postings ### Selecting a job listing ```python `.job_seen_beacon` ``` ### Creating the placeholder for a job listing ``` "job_listings": {     "_fns": [         {             "_fn": "css",             "_args": [".job_seen_beacon"]         }     ],     "_items": {         "job_title": {             "_fns": [                 {                 "_fn": "xpath_one",                 "_args": [".//h2[contains(@class,'jobTitle')]/a/span/text()"]                 }             ]         },         "company_name": {             "_fns": [                 {                     "_fn": "xpath_one",                     "_args": [".//span[@data-testid='company-name']/text()"]                 }             ]         }, ``` ### Adding other selectors ```json { "source": "universal", "url": "https://www.indeed.com/jobs?q=work+from+home&l=San+Francisco%2C+CA", "parse": true, "parsing_instructions": { "job_listings": { "_fns": [ { "_fn": "css", "_args": [".job_seen_beacon"] } ], "_items": { "job_title": { "_fns": [ { "_fn": "xpath_one", "_args": [".//h2[contains(@class,'jobTitle')]/a/span/text()"] } ] }, "company_name": { "_fns": [ { "_fn": "xpath_one", "_args": [".//span[@data-testid='company-name']/text()"] } ] } } } } } ``` For other data points, see the file [here](src/job_search_payload.json). ### Saving the payload as a separator JSON file ```python # parse_jobs.py import requests import json payload = {} with open("job_search_payload.json") as f:     payload = json.load(f) response = requests.post(     url="https://realtime.oxylabs.io/v1/queries",     json=payload,     auth=("username", "password"), ) print(response.status_code) with open("result.json", "w") as f:     json.dump(response.json(), f, indent=4) ``` ## Exporting to JSON and CSV ```python # parse_jobs.py with open("results.json", "w") as f:     json.dump(data, f, indent=4) df = pd.DataFrame(data["results"][0]["content"]["job_listings"]) df.to_csv("job_search_results.csv", index=False) ``` ## Final word Check our [documentation](https://developers.oxylabs.io/scraper-apis/web-scraper-api) for more API parameters and variables found in this tutorial. If you have any questions, feel free to contact us at [email protected].

Education & Learning iPaaS & API Integration
148 Github Stars
headless-browser
Open Source

headless-browser

# Headless Browser [![Headless Browser](https://github.com/oxylabs/headless-browser/blob/main/Github-banner-HB-1532x354.png)](https://oxylabs.io/products/headless-browser?utm_content=oxylabs-headless-browser&groupid=877) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) # Headless browser - [What is Headless Browser?](#what-is-headless-browser) - [Key Features](#key-features) + [Handling CAPTCHA Events](#handling-captcha-events) + [Dynamic CAPTCHA Solving](#dynamic-captcha-solving) + [Geolocation Targeting](#geolocation-targeting) + [Device Type](#device-type) + [Session Inspection](#session-inspection) - [How does Headless Browser Work?](#how-does-headless-browser-work) + [Connection Details](#connection-details) - [How to Optimize Headless Browser Traffic](#how-to-optimize-headless-browser-traffic) - [Headless Broswer vs. Web Scraper API vs. Web Unblocker](#headless-browser-vs-web-scraper-api-vs-web-unblocker) + [When to Choose Each](#when-to-choose-each) - [Common Use Cases](#common-use-cases) - [Fair Usage Policy](#fair-usage-policy) - [Learn More](#learn-more) - [Contact Us](#contact-us) Welcome to the official repository overview for Oxylabs' [Headless Browser](https://oxylabs.io/products/headless-browser). This guide provides a technical overview of how Headless Browser works, its key features, some common use cases, and how to use it to optimize your web scraping infrastructure. ## What is Headless Browser? Headless Browser is a cloud-based solution from Oxylabs that lets you run and control remote browser sessions without the complexity of managing them locally or on your own infrastructure. It provides a seamless way to execute browser-based automation, testing, and web scraping without dealing with browser setup, resource constraints, or detection challenges. Unlike running a local headless browser instance (such as a self-managed Puppeteer or Playwright setup), Oxylabs' Headless Browser runs on dedicated remote servers with integrated residential proxies and built-in stealth features. This means you get a production-ready browsing environment that handles anti-bot detection, CAPTCHA solving, and proxy rotation out of the box. ## Browser Environments Headless Browser offers two specialized browser environments: - [Chrome-based browser](https://developers.oxylabs.io/scraping-solutions/headless-browser/chrome) (`ubc.oxylabs.io`) – high-performance remote browsers with advanced stealth running on dedicated servers with integrated proxies. - [Firefox-based browser (Legacy)](https://developers.oxylabs.io/scraping-solutions/headless-browser/firefox) (`ubs.oxylabs.io`) – Firefox implementation with built-in anti-detection features and proxy integration. (_**Note:** this environment will be deprecated soon._) It works with any library that supports the [Chrome DevTools Protocol (CDP)](https://developer.chrome.com/docs/devtools), including: - [Playwright](https://pypi.org/project/playwright/) (for Firefox, supported Playwright versions are 1.51 and 1.56) - [Puppeteer](https://pypi.org/project/pyppeteer/) - Other CDP-compatible automation frameworks For more information, check out how to use Oxylabs Headless Browser for [Playwright web scraping](https://github.com/oxylabs/playwright-web-scraping), [Puppeteer web scraping](https://github.com/oxylabs/puppeteer-tutorial), and [MCP integration](https://github.com/oxylabs/oxylabs-hb-mcp). ## Key Features Headless Browser comes packed with features designed to simplify large-scale web scraping and browser automation, such as – ### Handling CAPTCHA Events One of the most valuable features of Headless Browser is its built-in CAPTCHA handling. By default, it automatically detects and solves CAPTCHAs when a page loads. However, for sites that present CAPTCHAs during later interactions (such as form submissions or popup windows), you can monitor the solving process using an event-based approach. The browser extension sends messages to the `window` object that your script can listen for: | Event | Description | |--------|-------------| | `oxylabs-captcha-solve-start` | The CAPTCHA is detected and solving has begun. | | `oxylabs-captcha-solve-end` | The CAPTCHA was successfully solved. | | `oxylabs-captcha-solve-error` | The auto-solver failed to bypass the CAPTCHA. | You subscribe to these events before navigation, then pause your automation until the CAPTCHA is resolved. CAPTCHA solving typically takes up to 30 seconds, depending on the type and complexity. For more information about how to handle CAPTCHA events, see our [documentation](https://developers.oxylabs.io/scraping-solutions/headless-browser/features/handling-captcha-events). ### Dynamic CAPTCHA Solving Some websites display CAPTCHAs not on page load but at later stages – for example, after clicking a submit button or during a specific user interaction. Headless Browser lets you trigger CAPTCHA detection and solving manually at any point during your session. To manually trigger solving, send a message to the window object: ```bash window.postMessage({action: 'solve_captcha', type: '<captcha_type>'}, '*') ``` Supported CAPTCHA types: - `hcaptcha` - `recaptcha` - `turnstile` (Cloudflare CAPTCHA) To trigger reCAPTCHA solving after a form submission, visit our [documentation](https://developers.oxylabs.io/scraping-solutions/headless-browser/features/dynamic-captcha-solving). _**Important:** Cloudflare `turnstile` CAPTCHAs require a different approach. You must initiate CAPTCHA detection before performing the action that triggers it, since `turnstile` must be intercepted before it appears on screen._ ### Geolocation Targeting You can specify a geographic location for your Headless Browser session by adding parameters to your connection URL. This is essential for scraping location-specific content, verifying geo-targeted ads, or accessing region-locked pages. Available parameters: - `p_cc` – selects the country using a 2-letter ISO country code (e.g., `US`, `DE`, `FR`). If no country is specified, the system auto-assigns one based on availability. - `p_city` – selects a specific city in lowercase (e.g., `berlin`, `los_angeles`). Requires `p_cc` or `p_state` to also be specified. - `p_state` – selects a US state in lowercase (e.g., `texas`, `ohio`). If both `p_state` and `p_cc` are specified, `p_state` takes priority. If you want to learn more information about Headless Browsers geolocation targeting, check out our [documentation](https://developers.oxylabs.io/scraping-solutions/headless-browser/features/geolocation-targeting) on Headless Browsers geolocation targeting. ### Device Type Headless Browser can emulate different device types using the `p_device` parameter. This is useful for scraping responsive layouts, mobile-specific content, or device-dependent behavior such as different CAPTCHAs or UI elements. Available values: | Value | Description | |--------|-------------| | Desktop (default) | Emulates a standard desktop browser with a full-size viewport and desktop user-agent string. | | Mobile | Simulates a smartphone with smaller screen resolution, touch capabilities, and a mobile user-agent. | | Tablet | Emulates a tablet device for medium-sized layouts and hybrid interfaces (Chrome only). | **Usage example:** ```bash # Connect with mobile device emulation. wss://username:[email protected]?p_device=mobile ``` You can combine device type with geolocation targeting: ```bash # Mobile device targeting Germany. wss://username:[email protected]?p_device=mobile&p_cc=DE ``` ### Session Inspection The Session Inspection tool is a powerful debugging feature that uses VNC (Virtual Network Computing) technology to provide real-time visual access to your browser automation sessions. It creates a secure bridge between your headless browser instance and a graphical interface you can observe and even control. **When to use it:** - Diagnose complex issues that aren't evident from logs or error messages. - Understand page rendering exactly as the browser sees it. - Verify proper execution of your automation workflows. - Troubleshoot unexpected behavior in real time. For more information, check out our [documentation](https://developers.oxylabs.io/scraping-solutions/headless-browser/features/session-inspection) about Session Inspection. Some other features of Headless Browser include: - **Integrated residential proxies –** every session routes through Oxylabs' residential proxy network, eliminating the need for separate proxy management. - **Browser arguments (Chrome) –** customize browser behavior with arguments like `hide-scrollbars`, `disable-notifications`, `force-color-profile`, and more. - **MCP integration –** connect AI systems like Claude Desktop or Cursor to Headless Browser via Model Context Protocol for AI-driven web automation. - **Traffic optimization –** block unnecessary resources (images, stylesheets, fonts) to reduce bandwidth and speed up scraping. ## How does Headless Browser Work? Headless Browser operates through a secure WebSocket (WSS) connection. Your automation script connects to Oxylabs' remote browser infrastructure using a standard WebSocket endpoint, just as you would connect to a local browser instance. The key difference is that the browser runs on Oxylabs' dedicated servers with residential proxies, stealth features, and CAPTCHA solving integrated at the infrastructure level. The workflow is straightforward: - **Connect –** your script establishes a WebSocket connection to Oxylabs' endpoint using your credentials. - **Automate –** use Playwright, Puppeteer, or any CDP-compatible library to control the browser exactly as you would a local instance. - **Collect –** extract HTML, take screenshots, capture network logs, or retrieve any data your automation script produces. - **Disconnect –** close the browser session when done. Billing is based on traffic (GB) used. ### Connection details: | Browser | WebSocket Endpoint | US-Optimized Endpoint | |----------------------|---------------------------------|-----------------------| | Chrome | `wss://ubc.oxylabs.io` | `wss://ubc-us.oxylabs.io` | | Fireforx (Legacy) | `wss://ubs.oxylabs.io` | `wss://ubs-us.oxylabs.io` | **Authentication:** Credentials are passed directly in the WebSocket URL in the format `wss://username:password@endpoint`. **Rate limits:** Each account has `100` concurrent sessions and can launch up to 10 sessions per second per browser type. Contact [Oxylabs support](mailto:[email protected]) to request higher limits. ## How to Optimize Headless Browser Traffic Since Headless Browser is billed based on traffic (GB), optimizing bandwidth usage directly reduces costs. Automation scripts often download unnecessary resources like images, stylesheets, fonts, and media files that you don't need for data extraction. The most effective optimization is intercepting network requests and blocking non-essential resource types before they download. **Additional optimization tips:** - Block third-party tracking scripts and analytics that don't contribute to your data extraction. - Use the US-optimized endpoints (`ubc-us.oxylabs.io`) if you're based in the US to reduce latency. - Keep sessions short – connect, scrape, disconnect. Idle sessions still consume resources. - Combine geolocation targeting with resource blocking for maximum efficiency. ## Headless Browser vs. Web Scraper API vs. Web Unblocker Oxylabs offers three distinct scraping products, each designed for different use cases. Choosing the right one depends on the level of control you need and the complexity of your target websites. | Feature | Web Scraper API | Web Unblocker | Headless Browser | |---------|----------------------|--------------------|-------------| | **Main purpose** | Scrape and parse web data; get structured results (JSON, etc.) | Access and retrieve raw web content while handling anti-bot challenges | Full browser automation; interaction for the toughest targets | | **Request input** | URL (with optional headers and parameters) | URL (with optional headers and parameters) | Automation script (Puppeteer/Playwright commands) | | **Output** | Structured JSON, full HTML, XHR, or Markdown | Full HTML | HTML, screenshots, network logs, any script-extracted data | | **JavaScript sendering** | Yes | Yes | Yes (native – runs JS as a real browser) | | **Ease of use** | Easiest – minimal code, focus on results | Intermediate – you parse/handle HTML yourself | Advanced – you write full automation scripts | | **CAPTCHA handling** | Automatic | Automatic | Automatic + manual trigger support | | **Billing** | Per successful result | Per successful request traffic | Per traffic (GB) used | ### When to choose each: - **[Web Scraper API](https://oxylabs.io/products/scraper-api/web) –** you want structured, ready-to-use data with a single API call. Best for e-commerce, search, and common targets. - **[Web Unblocker](https://oxylabs.io/products/web-unblocker) –** you have an existing scraping pipeline and need a drop-in proxy replacement that handles unblocking automatically. - **[Headless Browser](https://oxylabs.io/products/headless-browser) –** you need full browser control for JavaScript-heavy sites, complex interactions, AI-driven automation, or when you need to interact with dynamic page elements (clicks, form fills, scrolling). ## Common Use Cases Headless Browser is the right choice when your task requires real browser interaction rather than simple HTTP requests. Common scenarios include: 1. **JavaScript-heavy websites –** single-page applications (SPAs) and sites that load content dynamically via JavaScript require a real browser engine to render properly. 2. **Complex user interactions –** scraping workflows that involve clicking buttons, filling forms, navigating multi-step processes, or handling infinite scroll. 3. **CAPTCHA-protected sites –** targets that present CAPTCHAs on load or during interaction, including hCaptcha, reCAPTCHA, and Cloudflare Turnstile. 4. **AI-driven web automation –** connecting AI systems (Claude Desktop, Cursor) via MCP to perform web navigation and data retrieval tasks autonomously. 5. **Ad verification and brand safety –** monitoring how ads render across different geolocations and device types, verifying placement and content. 6. **Visual regression testing –** taking screenshots from different geographies and devices to ensure consistent rendering. 7. **Competitive intelligence –** monitoring competitor pricing, product listings, and content changes on dynamic websites. 8. **Session-based workflows –** tasks that require maintaining state across multiple pages (login flows, shopping carts, multi-page forms). ## Fair Usage Policy To ensure service stability and consistent performance for all users, Oxylabs applies the following limits to Headless Browser: - **Concurrent sessions:** `100` per account (per browser type). - **Session launch rate:** Up to `10` sessions per second per browser type. - **Billing:** Based on traffic (GB) consumed during your sessions. To keep your usage efficient and your costs predictable, follow the traffic optimization techniques described earlier in this guide – blocking unnecessary resources alone can reduce bandwidth consumption significantly. ## Learn more For detailed configuration, advanced usage, and multi-language code examples, check these official pages: - [Get started with Headless Browser](https://oxylabs.io/products/headless-browser) - [Headless Browser documentation](https://developers.oxylabs.io/scraping-solutions/headless-browser) - [Troubleshooting guide](https://developers.oxylabs.io/scraping-solutions/headless-browser/troubleshooting-guide) ## Contact us If you have questions or need support, reach out to us at [[email protected]](mailto:[email protected]), or through live chat, accessible via [Oxylabs Dashboard](https://dashboard.oxylabs.io/en/). For enterprise-related inquiries, contact your dedicated account manager.

AI Agents Browser Automation
145 Github Stars
curl-post-requests
Open Source

curl-post-requests

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=curl-post-requests-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) # How to Send Post Requests With cURL - [How to Send Post Requests With cURL](#how-to-send-post-requests-with-curl) * [Sending POST requests](#sending-post-requests) * [Specifying the Content-Type](#specifying-the-content-type) * [Posting JSON](#posting-json) * [Posting XML](#posting-xml) * [Sending a file or multiple files via POST](#sending-a-file-or-multiple-files-via-post) * [Sending authentication credentials](#sending-authentication-credentials) cURL is a powerful command-line tool for transferring data over various network protocols, including HTTP, HTTPS, FTP, and more. Since POST is a request method of HTTP & HTTPS protocols, cURL makes sending POST requests a one-line command that you can [run in your terminal](https://oxylabs.io/resources/integrations/terminal) easily. Find the [full article](https://oxylabs.io/blog/curl-post-requests) on our website. ## Sending POST requests First, install cURL if you haven’t installed it already. You can find the installation instructions in our [How to Use cURL With Proxy](https://oxylabs.io/blog/curl-with-proxy) blog post. The basic syntax for sending a POST request using cURL is as below: ``` curl -X POST -d "Hello" https://example.com/api ``` Notice the ```-X``` flag followed by ```POST```, it tells cURL to make a request using the HTTP protocol’s POST method, and the ```-d``` flag sets request data as ```Hello``` and sends it to the website https://example.com/api. The ```-X``` flag is the short form of the command line option ```--request```. ## Specifying the Content-Type Like any HTTP request, POST requests created using cURL can also have custom headers. For specifying the ```Content-Type``` header, you’ll have to set it using the header flag. ``` curl -X POST -H "Content-Type: text/plain" -d "Hello" https://example.com/api ``` In the above command, there’s an additional ```-H``` flag which lets users [send custom HTTP request headers via cURL](https://oxylabs.io/blog/curl-send-headers). In this case, by specifying the ```Content-Type``` header as ```text/plain``` you’re letting the web server know that the request body data is in TEXT format. ## Posting JSON It’s also possible to send JSON data in the request body. All you need to do is set the appropriate ```Content-Type``` header and pass the JSON data with the ```-d``` flag. cURL will make a POST request with the JSON data specified in the argument. ``` curl -X POST -H "Content-Type: application/json" -d '{"key":"value"}' https://example.com/api ``` If you need a straightforward way to create a JSON code from a cURL command, use this [cURL to JSON converter](https://oxylabs.io/tools/curl-converter/json). ## Posting XML Similar to JSON, you can also send XML in the request body. You’ll have to make changes to the request header and set it to ```application/xml```. ``` curl -X POST -H "Content-Type: application/xml" -d '<?xml version="1.0" encoding="UTF-8"?><root><name>John Doe</name><age>30</age></root>' https://example.com/api ``` ## Sending a file or multiple files via POST To send a file via cURL POST, you’ll have to use the ```-F``` flag. Pay attention to the capitalization of the letter “F”. All of the cURL flags or command line options are case-sensitive. ``` curl -X POST -F "file=@/path/to/img.png" https://example.com/api/upload ``` As you can see, the above command is uploading an image file. Right after the ```-F``` the file path of the image was given. You can also use multiple ```-F``` flags to send multiple files to the server as below: ``` curl -X POST -F "file=@/path/to/img1.png" -F "file=@/path/to/img2.png" https://example.com/api/upload ``` ## Sending authentication credentials You can use the ```-u``` flag or the ```--user``` option to specify the username & password for basic authentication. cURL will automatically create the Authorization header based on your input. ``` curl -u username:password https://example.com/login ``` You’ll have to replace ```username``` and ```password``` with the actual authentication credentials. Also, don’t forget to replace the example URL with your own.

Utilities & System Terminal & CLI Tools
136 Github Stars
automated-web-scraper-autoscraper
Open Source

automated-web-scraper-autoscraper

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=automated-web-scraper-autoscraper-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) # Automated Web Scraper With Python AutoScraper This tutorial will show you how to automate your web scraping processes using AutoScaper – one of the several Python web scraping libraries available. Check out a more detailed tutorial on [our blog](https://oxylabs.io/blog/automated-web-scraper-autoscraper). * [Methods to install AutoScraper](#methods-to-install-autoscraper) * [Scraping products with AutoScraper](#scraping-products-with-autoscraper) + [Scraping product category URLs](#scraping-product-category-urls) + [Scraping product information from a single webpage](#scraping-product-information-from-a-single-webpage) + [Scraping all the products on a specific category](#scraping-all-the-products-on-a-specific-category) * [How to use AutoScraper with proxies](#how-to-use-autoscraper-with-proxies) * [Saving and loading an AutoScraper model](#saving-and-loading-an-autoscraper-model) ## Methods to install AutoScraper First things first, let’s install the AutoScraper library. There are actually several ways to install and use this library, but for this tutorial, we’re going to use the Python package index (PyPI) repository using the following pip command: ```pip install autoscraper``` ## Scraping products with AutoScraper This section showcases an example to auto scrape public data with the AutoScraper module in Python using the [Oxylabs Scraping Sandbox](https://sandbox.oxylabs.io/products?_gl=1*10z7xfg*_gcl_au*NzUxMjYzNDQ4LjE3MjY1NzMwNjc.) website as a target. The target website has three thousand products in different categories. ### Scraping product category URLs Now, if you want to scrape the links to the category pages, you can do it with the following trivial code: ``` from autoscraper import AutoScraper UrlToScrape = "https://sandbox.oxylabs.io/products" WantedList = [ "https://sandbox.oxylabs.io/products/category/nintendo", "https://sandbox.oxylabs.io/products/category/dreamcast" ] Scraper = AutoScraper() data = Scraper.build(UrlToScrape, wanted_list=WantedList) print(data) ``` Note that the Oxylabs Sandbox uses JavaScript to load some elements dynamically, such as the category buttons. Since AutoScraper doesn’t support JavaScript rendering, you won’t be able to scrape all category links. For instance, you can access the “Xbox platform” category but not the subcategories inside it. With that in mind, the code above first imports AutoScraper from the autoscraper library. Then, we provide the URL from which we want to scrape the information in the ```UrlToScrape```. The ```WantedList``` assigns sample data that we want to scrape from the given subject URL. To get the category page links from the target page, you need to provide two example URLs to the ```WantedList```. One link is a data sample of a JavaScript-rendered category button, while another link is a data sample of a static category button that doesn’t have any subcategories. Try running the code with only one category URL in the ```WantedList``` to see the difference. The AutoScraper() creates an AutoScraper object to initiate different functions of the autoscraper library. The Scraper.build() method scrapes the data similar to the ```WantedList``` from the target URL. After executing the Python script above, the ```data``` list will have the category page links available at https://sandbox.oxylabs.io/products. The output of the script should look like this: ```['https://sandbox.oxylabs.io/products/category/nintendo', 'https://sandbox.oxylabs.io/products/category/xbox-platform', 'https://sandbox.oxylabs.io/products/category/playstation-platform', 'https://sandbox.oxylabs.io/products/category/dreamcast', 'https://sandbox.oxylabs.io/products/category/pc', 'https://sandbox.oxylabs.io/products/category/stadia']``` ### Scraping product information from a single webpage Say that we want to get the title of the product along with its price; we can train and build an AutoScraper model as follows: ``` from autoscraper import AutoScraper UrlToScrape = "https://sandbox.oxylabs.io/products/3" WantedList = ["Super Mario Galaxy 2", "91,99 €"] InfoScraper = AutoScraper() InfoScraper.build(UrlToScrape, wanted_list=WantedList) ``` The script above feeds a URL of a product page and a sample of required information from that page to the AutoScraper model. The ```build()``` method learns the rules for scraping information and preparing our ```InfoScraper``` for future use. Now, let’s apply this ```InfoScraper``` tactic to a different product’s URL and see if it returns the desired information. ``` another_product_url = "https://sandbox.oxylabs.io/products/39" data = InfoScraper.get_result_similar(another_product_url) print(data) ``` Output: ```['Super Mario 64', '91,99 €']``` The script above applies ```InfoScraper``` to ```another_product_url``` and prints the ```data```. Depending on the target website you want to scrape, you may want to use the ```get_result_exact()``` function instead of ```get_result_similar()```. This should ensure that AutoScraper returns an accurate product title and price as defined by the ```WantedList```. Additionally, it’s very important to provide a ```UrlToScrape``` that doesn’t have duplicate data that may match some unwanted elements. Consider this example: ``` from autoscraper import AutoScraper UrlToScrape = "https://sandbox.oxylabs.io/products/1" WantedList = ["The Legend of Zelda: Ocarina of Time", "91,99 €"] InfoScraper = AutoScraper() InfoScraper.build(UrlToScrape, wanted_list=WantedList) another_product_url = "https://sandbox.oxylabs.io/products/39" data = InfoScraper.get_result_exact(another_product_url) print(data) ``` Here, the ```UrlToScrape``` has the price of ```91,99 €``` twice on the page, as highlighted in the screenshot: ![](screenshot_1.png) Hence, the code also matches the unwanted element with the price of ```91,99 €``` and additionally returns the price of a related product like this: ```['Super Mario 64', '87,99 €', '91,99 €']``` One way to solve this problem is to use the ```grouped=True``` parameter to return the data points with their corresponding AutoScraper rule names. Next, use the ```keep_rules()``` function and pass the rules you want to keep. In our code, we have to turn the data ```dictionary``` into a ```list``` to access and pass over the first and the last rule, containing accurate product title and price: ``` from autoscraper import AutoScraper UrlToScrape = "https://sandbox.oxylabs.io/products/1" WantedList = ["The Legend of Zelda: Ocarina of Time", "91,99 €"] InfoScraper = AutoScraper() InfoScraper.build(UrlToScrape, wanted_list=WantedList) another_product_url = "https://sandbox.oxylabs.io/products/39" data = InfoScraper.get_result_exact(another_product_url, grouped=True) print(data) print() InfoScraper.keep_rules([list(data)[0], list(data)[-1]]) filtered_data = InfoScraper.get_result_exact(another_product_url) print(filtered_data) ``` Please note that this method may return incorrect results if the actual price of the product isn’t the last item returned by the ```data``` object. ### Scraping all the products on a specific category Install the [pandas](https://pypi.org/project/pandas/) and [openpyxl](https://pypi.org/project/openpyxl/) libraries via the terminal, which we’ll use to save the data to an Excel file: ```pip install pandas openpyxl``` Then, use the following Python script: ``` #ProductsByCategoryScraper.py from autoscraper import AutoScraper import pandas as pd #ProductUrlScraper section Playstation_5_Category = "https://sandbox.oxylabs.io/products/category/playstation-platform/playstation-5" WantedList = ["https://sandbox.oxylabs.io/products/246"] Product_Url_Scraper = AutoScraper() Product_Url_Scraper.build(Playstation_5_Category, wanted_list=WantedList) #ProductInfoScraper section Product_Page_Url = "https://sandbox.oxylabs.io/products/246" WantedList = ["Ratchet & Clank: Rift Apart", "87,99 €"] Product_Info_Scraper = AutoScraper() Product_Info_Scraper.build(Product_Page_Url, wanted_list=WantedList) #Scraping info of each product and storing into an Excel file Products_Url_List = Product_Url_Scraper.get_result_similar(Playstation_5_Category) Products_Info_List = [] for Url in Products_Url_List: product_info = Product_Info_Scraper.get_result_exact(Url) Products_Info_List.append(product_info) df = pd.DataFrame(Products_Info_List, columns =["Title", "Price"]) df.to_excel("products_playstation_5.xlsx", index=False) ``` The script above has three main constituents: two sections for building the scrapers and the third one to scrape data from all the products in the Playstation 5 category and save it as an Excel file. For this step, we’ve built ```Product_Url_Scraper``` to scrape all the similar product links on the [Playstation 5 Category](https://sandbox.oxylabs.io/products/category/playstation-platform/playstation-5?_gl=1*1frsqtt*_gcl_au*NzUxMjYzNDQ4LjE3MjY1NzMwNjc.) page. These thirteen links are stored in the ```Products_Url_List```. Now, for each URL in the ```Products_Url_List```, we apply the ```Product_Info_Scraper``` and append the scraped information to the ```Products_Info_List```. Finally, the ```Products_Info_List``` is converted to a data frame and then exported as an Excel file for future use. Output: ![](screenshot_2.png) The output reflects achieving the initial goal – scraping titles and prices of all the products in the Playstation 5 category. Now, we know how to use a combination of multiple AutoScraper models to scrape data in bulk. You can re-formulate the script above to scrape all the products from all the categories and save them in different Excel files for each category. ## How to use AutoScraper with proxies The ```build```, ```get_result_similar```, and ```get_result_exact``` functions of AutoScraper accept request-related arguments in the request_args parameter. Here’s what testing and using AutoScraper with proxy IPs looks like: ``` from autoscraper import AutoScraper UrlToScrape = "https://ip.oxylabs.io/" WantedList = ["YOUR_REAL_IP_ADDRESS"] proxy = { "http": "proxy_endpoint", "https": "proxy_endpoint", } InfoScraper = AutoScraper() InfoScraper.build(UrlToScrape, wanted_list=WantedList) data = InfoScraper.get_result_similar(UrlToScrape, request_args={"proxies": proxy}) print(data) ``` Visit https://ip.oxylabs.io/, copy the displayed IP address, and paste it instead of YOUR_REAL_IP_ADDRESS in the ```WantedList```. This information will be used to tell AutoScraper what kind of data to look for. The proxy_endpoint refers to the address of a [proxy server](https://oxylabs.io/products/paid-proxy-servers) in the correct format (e.g., ```http://customer-USERNAME:[email protected]:7777```). The script above should work fine when proper proxy endpoints are added to the proxy dictionary. Every time you run the code, it outputs a proxy IP address that should be different from your actual IP. After you’ve successfully tested the proxy server connection, you can then use the [proxies](https://oxylabs.io/products/residential-proxy-pool) with the initial request like so: ``` from autoscraper import AutoScraper UrlToScrape = "https://sandbox.oxylabs.io/products/3" WantedList = ["Super Mario Galaxy 2", "91,99 €"] proxy = { "http": "proxy_endpoint", "https": "proxy_endpoint", } InfoScraper = AutoScraper() InfoScraper.build(UrlToScrape, wanted_list=WantedList, request_args={"proxies": proxy}) # Remaining code... ``` ## Saving and loading an AutoScraper model AutoScraper provides the ability to save and load a pre-trained scraper. We can use the following script to save the InfoScraper object to a file: ```InfoScraper.save("file_name")``` Similarly, we can load a scraper using: ``` SavedScraper = AutoScraper() SavedScraper.load("file_name") ```

Utilities & System Browser Automation
121 Github Stars
bypass-captcha-puppeteer
Open Source

bypass-captcha-puppeteer

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=bypass-captcha-puppeteer-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) # How to Bypass CAPTCHA With Puppeteer - [How to Bypass CAPTCHA With Puppeteer](#how-to-bypass-captcha-with-puppeteer) * [Using Puppeteer-stealth to bypass CAPTCHA](#using-puppeteer-stealth-to-bypass-captcha) * [Using Web Unblocker with Node.JS](#using-web-unblocker-with-nodejs) To access protected websites, you must bypass CAPTCHA. Puppeteer, a Node.js library with a user-friendly API for managing Chrome/Chromium via the DevTools Protocol, can help. It can run in full-browser mode instead of headless mode. Well, why isn’t Puppeteer enough? Automated access using Puppeteer often triggers CAPTCHA or blocks as websites detect the automation. Let’s validate it using the following steps: ### 1. You must have Node.JS installed on your system. Create a new Node.JS project and install Puppeteer using the following `npm` command: ```npm i puppeteer``` ### 2. Import the Puppeteer library in your Node.JS file. ```const puppeteer = require('puppeteer');``` ### 3. Create a new browser instance in headless mode and a new page using the following code: ``` (async () => { // Create a browser instance const browserObj = await puppeteer.launch(); // Create a new page const newpage = await browserObj.newPage(); ``` ### 4. Since we need to take the screenshot on the desktop device, we can set the viewport size using the following code: ``` // Set the width and height of viewport await newpage.setViewport({ width: 1920, height: 1080 }); ``` The setViewPort() method sets the size of the webpage. You can change it according to your device requirements. ### 5. After that, navigate to a page URL (that you think is a CAPTCHA-protected page) and take a screenshot. For demonstration purposes, the code uses Oxylabs [scraping sandbox](https://sandbox.oxylabs.io/products). Remember to close the browser object at the end. ``` const url = 'https://sandbox.oxylabs.io/products'; // Open the required URL in the newpage object await newpage.goto(url); await newpage.waitForNetworkIdle(); // Wait for network resources to fully load // Capture screenshot await newpage.screenshot({ path: 'screenshot.png', }); // Close the browser object await browserObj.close(); })(); ``` This is what the complete code looks like: ``` const puppeteer = require('puppeteer'); (async () => { const browserObj = await puppeteer.launch(); const newpage = await browserObj.newPage(); await newpage.setViewport({ width: 1920, height: 1080 }); const url = 'https://sandbox.oxylabs.io/products'; await newpage.goto(url); await newpage.waitForNetworkIdle(); await newpage.screenshot({ path: 'screenshot.png', }); await browserObj.close(); })(); ``` ## Using Puppeteer-stealth to bypass CAPTCHA Here is the step-by-step procedure to implement this CAPTCHA bypass: ### 1. To start, you need to install the `puppeteer-extra` and `puppeteer-extra-plugin-stealth` packages. ``` npm install puppeteer-extra-plugin-stealth puppeteer-extra ``` ### 2. After that, import the following required libraries in your Node.JS file: ``` const puppeteerExtra = require('puppeteer-extra'); const Stealth = require('puppeteer-extra-plugin-stealth'); puppeteerExtra.use(Stealth()); ``` ### 3. The next step is to create the browser object in headless mode, navigate to the URL and take a screenshot. ``` (async () => { const browserObj = await puppeteerExtra.launch(); const newpage = await browserObj.newPage(); await newpage.setViewport({ width: 1920, height: 1080 }); await newpage.setUserAgent( 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36' ); await newpage.goto('https://sandbox.oxylabs.io/products'); await newpage.waitForNetworkIdle(); // Wait for network resources to fully load await newpage.screenshot({ path: 'screenshot_stealth.png' }); await browserObj.close(); })(); ``` The `setUserAgent` method makes our requests imitate a real browser's User-Agent, making our automated headless browsers appear more like regular users. Setting one of the common User-Agent strings helps evade detection and bypass anti-bot mechanisms that analyze the User-Agent header. Here is what our complete script looks like: ``` const puppeteerExtra = require('puppeteer-extra'); const Stealth = require('puppeteer-extra-plugin-stealth'); puppeteerExtra.use(Stealth()); (async () => { const browserObj = await puppeteerExtra.launch(); const newpage = await browserObj.newPage(); await newpage.setViewport({ width: 1920, height: 1080 }); await newpage.setUserAgent( 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36' ); await newpage.goto('https://sandbox.oxylabs.io/products'); await newpage.waitForNetworkIdle(); // Wait for network resources to fully load await newpage.screenshot({ path: 'screenshot_stealth.png' }); await browserObj.close(); })(); ``` ## Using Web Unblocker with Node.JS Web Unblocker uses AI to help users prevent CAPTCHA and gain access to public data from websites with advanced anti-bots implemented. To begin, you can send a basic query without any special options – the Web Unblocker tool will select the fastest CAPTCHA proxy, add all necessary headers, and provide you with the response body. ### 1. Install the node-fetch and HttpsProxyAgent using the following command: ```npm install node-fetch https-proxy-agent``` ### 2. [Sign up to Oxylabs](https://dashboard.oxylabs.io/en/) and get your credentials for using the API. ### 3. Before importing the libraries, open the package.json file and enter these lines `"type": "module"`, for example: ``` { "type": "module", "dependencies": { "https-proxy-agent": "^7.0.4", "node-fetch": "^3.3.2", "puppeteer": "^22.6.5", "puppeteer-extra": "^3.3.6", "puppeteer-extra-plugin-stealth": "^2.11.2" } } ``` Since the newest version of `node-fetch` is an ESM-only module, you can’t import it using the `require()` function. Learn more about it [here](https://www.npmjs.com/package/node-fetch#installation). Next, import the required modules in your JS file using the `import-from` syntax: ``` import fetch from 'node-fetch'; import HttpsProxyAgent from 'https-proxy-agent'; import fs from 'fs'; ``` The `fs` library can help save the response in an HTML file. ### 4. Provide your user credentials and set up a proxy using `HttpsProxyAgent`. ``` const username = '<Your-username>'; const password = '<Your-password>'; (async () => { const agent = new HttpsProxyAgent.HttpsProxyAgent( `http://${username}:${password}@unblock.oxylabs.io:60000` ); ``` ### 5. Next, set the URL and issue a fetch request. ``` // Ignore the certificate process.env['NODE_TLS_REJECT_UNAUTHORIZED'] = 0; const response = await fetch('https://ip.oxylabs.io/', { method: 'get', agent: agent, }); ``` The environment variable `NODE_TLS_REJECT_UNAUTHORIZED` is set to zero so that Node.JS doesn't verify the SSL/TLS certificates. This is a required setting if you’re using Oxylabs’ Web Unblocker. ### 6. In the end, you can convert the response into text and save it in an HTML file. ``` const resp = await response.text(); fs.writeFile('result.html', resp.toString(), (err) => { if (err) throw err; console.log('Result saved to result.html'); }); })(); ``` Here is the complete script: ``` import fetch from 'node-fetch'; import HttpsProxyAgent from 'https-proxy-agent'; import fs from 'fs'; const username = '<Your-username>'; const password = '<Your-password>'; (async () => { const agent = new HttpsProxyAgent.HttpsProxyAgent( `http://${username}:${password}@unblock.oxylabs.io:60000` ); // Ignore the certificate process.env['NODE_TLS_REJECT_UNAUTHORIZED'] = 0; const response = await fetch('https://ip.oxylabs.io/', { method: 'get', agent: agent, }); const resp = await response.text(); fs.writeFile('result.html', resp.toString(), (err) => { if (err) throw err; console.log('Result saved to result.html'); }); })(); ```

Browser Automation
117 Github Stars
chatgpt-web-scraping
Open Source

chatgpt-web-scraping

# How to Use ChatGPT for Web Scraping in 2025 [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=chatgpt-web-scraping-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) - [1. Create a ChatGPT Account](#1-create-a-chatgpt-account) - [2. Locate the elements to scrape](#2-locate-the-elements-to-scrape) - [3. Prepare the ChatGPT prompt](#3-prepare-the-chatgpt-prompt) - [4. Review the code](#4-review-the-code) - [5. Execute and test](#5-execute-and-test) - [Tips and tricks for using ChatGPT](#tips-and-tricks-for-using-chatgpt) * [1. Get code editing assistance](#1-get-code-editing-assistance) * [2. Check for errors](#2-check-for-errors) * [3. Code Optimization Assistance](#3-code-optimization-assistance) * [4. Handle dynamic content](#4-handle-dynamic-content) - [Overcome web scraping blocks with a dedicated API](#overcome-web-scraping-blocks-with-a-dedicated-api) Follow this article to learn how to use [ChatGPT](https://chat.openai.com/) for developing fully-functional Python web scrapers. You'll also find out some important tips and tricks to improve the quality of a scraper’s code. Before moving to the actual topic, let’s briefly introduce our demo target for this tutorial. We'll extract data from the [Oxylabs Scraping Sandbox](https://sandbox.oxylabs.io/products), a dummy e-commerce store that maintains video game listings in several categories. Here's what the landing page of the store looks like: ![](/images/sandbox.png) Now, let’s delve into the steps required to scrape data from this webpage using ChatGPT. ## 1. Create a ChatGPT Account Visit ChatGPT’s [login page](https://chat.openai.com/auth/login) and hit Sign-up. You also have the option to sign up using your Google account. On successful sign-up, you will be redirected to the chat window. You can initiate a chat by entering your query in the text field. ## 2. Locate the elements to scrape Before prompting ChatGPT, let’s first locate the elements we need to extract from the target page. Assume that we need only the video game **titles** and **prices**. - Right-click one of the game titles and select `Inspect`. This will open the HTML code for this element in the Developer Tools window. - Right-click the element and select `Copy selector` with the game title in it. The following figure explains it all: ![](/images/sandbox_dev_tools.png) Write down the selector and repeat the same to find the selector for the price element. ## 3. Prepare the ChatGPT prompt The prompt should be well-explained, specifying the code’s programming language, tools and libraries to be used, element selectors, output, and any special instructions the code must comply with. Here's a sample prompt that you can use to create a web scraper using Python and & BeautifulSoup: ```markdown Write a web scraper using Python and BeautifulSoup. Sample Target: https://sandbox.oxylabs.io/products Rationale: Scrape the video game titles and prices of all the games on the target page. CSS selectors are as follows: 1. Title: #__next > main > div > div > div > div:nth-child(2) > div > div:nth-child(1) > a.card-header.css-o171kl.eag3qlw2 > h4 2. Price: #__next > main > div > div > div > div:nth-child(2) > div > div:nth-child(1) > div.price-wrapper.css-li4v8k.eag3qlw4 Output: Save all the Titles and Prices for all the video games in a CSV file Additional Instructions: Handle character encoding and remove undesirable symbols in the output CSV. ``` Notice we have provided CSS selectors for prices and titles that we copied in the earlier step. > [!TIP] > The scraped data might contain a few odd characters if not handled properly due to encoding issues. It happens when the web page and the Python script interpret the character encoding differently. Therefore, we can add relevant Additional Instructions to avoid these encoding issues. ## 4. Review the code Once ChatGPT replies with a code, it’s always a good idea to review it first. Just overview the code if it seems to do what you want it to do. Double-check if it uses some libraries or packages you may wish to exclude. If you find any inherent glitches, write back to ChatGPT and ask to generate the code again. ## 5. Execute and test Copy the ChatGPT-generated code and check if it's executing fine. Here's the code that ChatGPT generated for us: ```python import requests from bs4 import BeautifulSoup import csv # URL of the target page url = "https://sandbox.oxylabs.io/products" # Send a GET request to the URL response = requests.get(url) # Parse the HTML content using BeautifulSoup soup = BeautifulSoup(response.content, "html.parser") # CSS selectors for title and price title_selector = "a.card-header h4" price_selector = "div.price-wrapper" # Find all game titles and prices titles = soup.select(title_selector) prices = soup.select(price_selector) # Create a list to store the data data = [] # Iterate through the titles and prices for title, price in zip(titles, prices): game_title = title.get_text(strip=True) game_price = price.get_text(strip=True) data.append((game_title, game_price)) # Define the CSV filename filename = "game_data.csv" # Write the data to a CSV file with open(filename, "w", newline="", encoding="utf-8") as file: writer = csv.writer(file) writer.writerow(["Title", "Price"]) # Write header for game in data: writer.writerow(game) print(f"Data scraped successfully and saved to '{filename}'.") ``` > [!NOTE] > Remember to install the `BeautifulSoup` library and `requests` library before running the code. This can be done by opening the terminal and running: ```bash pip install requests beautifulsoup4 ``` Here's the snippet of the output CSV file generated after executing the code: ![](/images/scraped_csv.png) Congratulations! You've just effortlessly scraped the target website. For your convenience, we also prepared this tutorial in a [video format](https://www.youtube.com/watch?v=AUEjBzLJlE4). ## Tips and tricks for using ChatGPT ### 1. Get code editing assistance Specify the changes you want to make, such as modifying the scraped elements, boosting the effectiveness of the code, or modifying the data extraction procedure. ChatGPT can offer you additional code options or modify suggestions to improve the web scraping process. ### 2. Check for errors To adhere to coding standards and practices, you can ask ChatGPT to review the code and provide recommendations. You can even paste your code and ask ChatGPT to lint it. You can do so by adding the “lint the code” phrase in the additional instructions of the prompt. ### 3. Code Optimization Assistance When it comes to web scraping, efficiency is critical, especially when working with large datasets or challenging web scraping tasks. ChatGPT can provide tips on how to increase the performance of your code. You can ask for advice on how to use frameworks and packages that speed up web scraping, use caching techniques, exploit concurrency or parallel processing, and minimize pointless network calls. ### 4. Handle dynamic content Certain websites produce dynamic content using Javascript libraries or use AJAX requests to produce the content. ChatGPT can help you navigate such complex web content. You can inquire ChatGPT for the techniques to get the dynamic content from such Javascript-rendered pages. ChatGPT can offer suggestions on using headless browsers, parsing dynamic HTML, or even automating interactions using simulated user actions. ## Overcome web scraping blocks with a dedicated API Be aware that there are some limitations of using ChatGPT for web scraping. Many websites have implemented strong security measures to block automated scrapers from accessing the sites. Commonly, sites use CAPTCHAs and request rate-limiting to prevent automated scraping. Thereby, simple **ChatGPT-generated scrapers may fail** at these sites. However, [Web Unblocker](https://oxylabs.io/products/web-unblocker) by Oxylabs can help in these scenarios. It's a **paid proxy solution** which you can test using a **1-week free trial** by regsitering a free account on the [dashboard](https://dashboard.oxylabs.io/). Web Unblocker provides features such as rotating proxies, bypassing CAPTCHAs, managing requests, utilizing a built-in Custom Browser Instructions, etc. Such measures can help minimize the chances of triggering automated bot detection.

Browser Automation
113 Github Stars
web-scraping-tutorials
Open Source

web-scraping-tutorials

[![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=web-scraping-tutorials-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs)

Education & Learning
83 Github Stars
Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup
Open Source

Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup

# Scraping Dynamic JavaScript / Ajax Websites With BeautifulSoup: A Complete Tutorial [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=scraping-dynamic-javascript-ajax-websites-with-beautifulsoup-github&transaction_id=102f49063ab94276ae8f116d224b67) [<img src="https://img.shields.io/static/v1?label=&message=Python&color=blueviolet" />](https://github.com/topics/python) [<img src="https://img.shields.io/static/v1?label=&message=JavaScript&color=blue" />](https://github.com/topics/javascript) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) ## Table of contents - [Revisiting BeautifulSoup and Requests](#revisiting-beautifulsoup-and-requests) - [Is This Website Dynamic or Static?](#is-this-website-dynamic-or-static) - [Can BeautifulSoup Render `JavaScript`?](#can-beautifulsoup-render-javascript) - [Scraping Dynamic Web Pages With Selenium](#scraping-dynamic-web-pages-with-selenium) - [Finding Elements Using Selenium](#finding-elements-using-selenium) - [Finding Elements Using BeautifulSoup](#finding-elements-using-beautifulsoup) - [Headless browser](#headless-browser) - [Web Scraping Dynamic Sites by Locating AJAX Calls](#web-scraping-dynamic-sites-by-locating-ajax-calls) - [Data Embedded In the Same Page](#data-embedded-in-the-same-page) - [Data In Other Pages](#data-in-other-pages) Web scraping most of the websites may be comparatively easy. This topic is already covered at length in [this tutorial](https://github.com/oxylabs/Python-Web-Scraping-Tutorial). There are many sites, however, which can not be scraped using the same method. The reason is that these sites load the content dynamically using JavaScript. This technique is also known as AJAX (Asynchronous JavaScript and XML). Historically, this standard was included creating an `XMLHttpRequest` object to retrieve XML from a web server without reloading the whole page. These days, this object is rarely used directly. Usually, a wrapper like jQuery is used to retrieve content such as JSON, partial HTML, or even images. ## Revisiting BeautifulSoup and Requests To scrape a regular web page, at least two libraries are required. The `requests` library downloads the page. Once this page is available as an HTML string, the next step is parsing this as a BeautifulSoup object. This BeautifulSoup object can then be used to find specific data. Here is a simple example script that prints the text inside the `h1` element with `id` set to `firstHeading`. ```python import requests from bs4 import BeautifulSoup response = requests.get("https://quotes.toscrape.com/") bs = BeautifulSoup(response.text,"lxml") author = bs.find("small",class_="author") if author: print(author.text) ## OUTPUT # Albert Einstein ``` Note that we are working with version 4 of the Beautiful Soup library. Earlier versions are discontinued. You may see beautiful soup 4 being written as just Beautiful Soup, BeautifulSoup, or even bs4. They all refer to the same beautiful soup 4 library. The same code will not work if the site is dynamic. For example, the same site has a dynamic version at `https://quotes.toscrape.com/js/` (note *js* at the end of this URL). ```python response = requests.get("https://quotes.toscrape.com/js") # dynamic web page bs = BeautifulSoup(response.text,"lxml") author = bs.find("small",class_="author") if author: print(author.text) ## No output ``` The reason is that the second site is dynamic where the data is being generated using `JavaScript`. There are two ways to handle sites like this. - Using a tool like Selenium or Puppeteer to open a real browser to render the dynamic web page - Identify the AJAX links that contain the data, and work with those directly. These two approaches are covered at length in this tutorial. However, first, we need to understand how to determine if a site is dynamic. ## Is This Website Dynamic or Static? Here is the easiest way to determine if a website is dynamic using Chrome or Edge. (Both of these browsers use Chromium under the hood). Open Developer Tools by pressing the `F12` key. Ensure that the focus is on Developer tools and press the `CTRL+SHIFT+P` key combination to open Command Menu. ![Command Menu](images/command_menu.png) It will show a lot of commands. Start typing `disable` and the commands will be filtered to show `Disable JavaScript`. Select this option to disable `JavaScript`. Now reload this page by pressing `Ctrl+R` or `F5`. The page will reload. If this is a dynamic site, a lot of the content will disappear: ![Example of Dynamic Site with No JavaScript](images/dynamic_site_no_js.png) In some cases, the sites will still show the data but will fall back to basic functionality. For example, this site has an infinite scroll. If JavaScript is disabled, it shows regular pagination. | ![With JavaScript](images/infinite_scroll.png) | ![Without JavaScript](images/infinite_scroll_no_js.png) | | ---------------------------------------------- | ------------------------------------------------------- | | JavaScript Enabled | JavaScript Disabled | The next question that needs to be answered is the capabilities of BeautifulSoup. ## Can BeautifulSoup Render `JavaScript`? The short answer is no. It is important to understand the words like parsing and rendering. Parsing is simply converting a string representation of a Python object into an actual object. So what is Rendering? Rendering is essentially interpreting HTML, JavaScript, CSS, and images into something that we see in the browser. Beautiful Soup is a Python library for pulling data out of HTML files. This involves parsing HTML string into the the BeautifulSoup object. For parsing, first, we need the HTML as string, to begin with. Dynamic websites do not have the data in the HTML directly. It means that BeautifulSoup cannot work with dynamic websites. Selenium library can automate loading and rendering websites in a browser like Chrome or Firefox. Even though Selenium supports pulling data out of HTML, it is possible to extract complete HTML and use Beautiful Soup instead to extract the data. Let's begin dynamic web scraping with Python using Selenium first. ## Scraping Dynamic Web Pages With Selenium Installing Selenium involves installing three things: 1. The browser of your choice (which you already have): - Chrome, Firefox, Edge, Internet Explorer, Safari, and Opera browsers are supported. In this tutorial, we will be using Chrome. 2. The driver for your browser: - Driver for Chrome can be download from [this page](https://chromedriver.chromium.org/downloads). Download the zip file containing the driver and unzip it. Take a note of this path. - Visit [this link](https://www.selenium.dev/documentation/en/webdriver/driver_requirements/#quick-reference) for information about drivers for other browsers. 3. Python Selenium Package: - This package can be installed using the pip command: ```shell pip install selenium ``` - If you are using Anaconda, this can be installed from the `conda-forge` channel. ```shell conda install -c conda-forge selenium ``` The basic skeleton of the Python script to launch a browser, load the page, and then close the browser is simple: ```python from selenium.webdriver import Chrome from webdriver_manager.chrome import ChromeDriverManager driver = Chrome(ChromeDriverManager().install()) driver.get('https://quotes.toscrape.com/js/') # # Code to read data from HTML here # driver.quit() ``` Now that we can load the page in the browser, let's look into extracting specific elements. There are two ways to extract elements—Selenium and Beautiful Soup. ### Finding Elements Using Selenium Our objective in this example is to find the author element. Load the site`https://quotes.toscrape.com/js/` in Chrome, right-click the author name, and click Inspect. This should load Developer Tools with the author element highlighted as follows: ![](images/author_markup.png) This is a `small` element with its `class` attribute set to `author`. ```html <small class="author">Albert Einstein</small> ``` Selenium allows various methods to locate the HTML elements. These methods are part of the driver object. Some of the methods that can be useful here are as follows: ```python element = driver.find_element(By.CLASS_NAME, "author") element = driver.find_element(By.TAG_NAME, "small") ``` There are few other methods, may be useful for other scenario. These methods are as follows: ```python element = driver.find_element(By.ID, "abc") element = driver.find_element(By.LINK_TEXT, "abc") element = driver.find_element(By.XPATH, "//abc") element = driver.find_element(By.CSS_SELECTOR, ".abc") ``` Perhaps the most useful methods are `find_element(By.CSS_SELECTOR)` and `find_element(By.XPATH)`. Any of these two methods should be able to select most of the scenarios. Let's modify the code so that the first author can be printed. ```python from selenium.webdriver import Chrome from selenium.webdriver.common.by import By from webdriver_manager.chrome import ChromeDriverManager driver = Chrome(ChromeDriverManager().install()) driver.get('https://quotes.toscrape.com/js/') element = driver.find_element(By.CLASS_NAME, "author") print(element.text) driver.quit() ``` What if you want to print all the authors? All the `find_element` methods have a counterpart - `find_elements` . Note the pluralization. To find all the authors, simply change one line: ```python elements = driver.find_elements(By.CLASS_NAME, "author") ``` This returns a list of elements. We can simply run a loop to print all the authors: ```python for element in elements: print(element.text) ``` *Note: The complete code is in [selenium_example.py](https://github.com/oxylabs/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/blob/main/selenium_example.py) code file.* However, if you are already comfortable with BeautifulSoup, you can create the Beautiful Soup object. ### Finding Elements Using BeautifulSoup As we saw in the first example, the Beautiful Soup object needs HTML. For web scraping static sites, the HTML can be retrieved using `requests` library. The next step is parsing this HTML string into the BeautifulSoup object. ```python response = requests.get("https://quotes.toscrape.com/") bs = BeautifulSoup(response.text,"lxml") ``` Let 's find out how to scrape a dynamic website with BeautifulSoup. The following part remains unchanged from the previous example. ```python from selenium.webdriver import Chrome from webdriver_manager.chrome import ChromeDriverManager from bs4 import BeautifulSoup driver = Chrome(ChromeDriverManager().install()) driver.get('https://quotes.toscrape.com/js/') ``` The rendered HTML of the page is available in the attribute `page_source`. ```python soup = BeautifulSoup(driver.page_source, "lxml") ``` Once the soup object is available, all Beautiful Soup methods can be used as usual. ```python author_element = soup.find("small", class_="author") print(author_element.text) ``` *Note: The complete source code is in [selenium_bs4.py](https://github.com/oxylabs/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/blob/main/selenium_bs4.py)* ### Headless browser Once the script is ready, there is no need for the browser to be visible when the script is running. The browser can be hidden, and the script will still run fine. This behavior of a browser is also known as a headless browser. To make the browser headless, import `ChromeOptions`. For other browsers, their own Options classes are available. ```python from selenium.webdriver import ChromeOptions ``` Now, create an object of this class, and set the `headless` attribute to True. ```python options = ChromeOptions() options.headless = True ``` Finally, send this object while creating the Chrome instance. ```python driver = Chrome(ChromeDriverManager().install(), options=options) ``` Now when you run the script, the browser will not be visible. See [selenium_bs4_headless.py](https://github.com/oxylabs/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/blob/main/selenium_bs4_headless.py) file for the complete implementation. ## Web Scraping Dynamic Sites by Locating AJAX Calls Loading the browser is expensive—it takes up CPU, RAM, and bandwidth which are not really needed. When a website is being scraped, it's the data that is important. All those CSS, images, and rendering are not really needed. The fastest and most efficient way of scraping dynamic web pages with Python is to locate the actual place where the data is located. There are two places where this data can be located: - The main page itself, in JSON format, embedded in a `<script>` tag - Other files which are loaded asynchronously. The data can be in JSON format or as partial HTML. Let's look at few examples. ### Data Embedded In the Same Page Open <https://quotes.toscrape.com/js> in Chrome. Once the page is loaded, press <u>Ctrl+U</u> to view source. Press Ctrl+F to bring up the search box, search for Albert. ![JSON in Page](images/json_embedded.png) We can immediately see that data is embedded as a JSON object on the page. Also, note that this is part of a script where this data is being assigned to a variable `data`. In this case, we can use the Requests library to get the page and use Beautiful Soup to parse the page and get the script element. ```python response = requests.get('https://quotes.toscrape.com/js/') soup = BeautifulSoup(response.text, "lxml") ``` Note that there are multiple `<script>` elements. The one which contains the data that we need does not have `src` attribute. Let's use this to extract the script element. ```python script_tag = soup.find("script", src=None) ``` Remember that this script contains other JavaScript code apart from the data that we are interested in. For this reason, we are going to use a regular expression to extract this data. ```python import re pattern = "var data =(.+?);\n" raw_data = re.findall(pattern, script_tag.string, re.S) ``` The data variable is a list containing one item. Now we can use the JSON library to convert this string data into a python object. ```python if raw_data: data = json.loads(raw_data[0]) print(data) ``` The output will be the python object: ```python [{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': {'name': 'Albert Einstein', 'goodreads_link': '/author/show/9810.Albert_Einstein', 'slug': 'Albert-Einstein'}, 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}, {'tags': ['abilities', 'choices'], 'author': {'name': 'J.K. Rowling', ..................... ``` This list can not be converted to any format as required. Also, note that each item contains a link to the author page. It means that you can read these links and create a spider to get data from all these pages. This complete code is included in [data_in_same_page.py](https://github.com/oxylabs/Scraping-Dynamic-JavaScript-Ajax-Websites-With-BeautifulSoup/blob/main/data_in_same_page.py). ### Data In Other Pages Web scraping dynamic sites can follow a completely different path. Sometimes the data is loaded on a separate page altogether. One such example is [Librivox](https://librivox.org/search/?q=The%20Time%20Machine&search_form=advanced). Open Developer Tools, go to Network Tab and filter by XHR. Now open this link or search for any book. You will see that the data is an HTML embedded in JSON. ![Libribox](images/libribox.png) Note few things: - The URL displayed by the browser is `https://librivox.org/search/?q=...` - The data is in `https://librivox.org/advanced_search?....` - If you look at headers, you will find that the advanced_search page is sent a special header `X-Requested-With: XMLHttpRequest` Here is snippet to extract this data: ```python headers = { 'X-Requested-With': 'XMLHttpRequest' } url = 'https://librivox.org/advanced_search?title=&author=&reader=&keywords=&genre_id=0&status=all&project_type=either&recorded_language=&sort_order=alpha&search_page=1&search_form=advanced&q=The%20Time%20Machine' response = requests.get(url, headers=headers) data = response.json() soup = BeautifulSoup(data['results'], 'lxml') book_titles = soup.select('h3 > a') for item in book_titles: print(item.text) ``` The complete code is included in [librivox.py](code/librivox.py) file.

Developer Tools Browser Automation
51 Github Stars
oxylabs-ai-studio-js
Open Source

oxylabs-ai-studio-js

# Oxylabs AI Studio JavaScript SDK [![AI-Studio Java (1)](https://raw.githubusercontent.com/oxylabs/oxylabs-ai-studio-js/refs/heads/main/images/Github-AI-Studio-1262x525px%20new.png)](https://aistudio.oxylabs.io/?utm_source=877&utm_medium=affiliate&utm_campaign=ai_studio&groupid=877&utm_content=ai-studio-js-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) A JavaScript SDK for seamlessly interacting with [Oxylabs AI Studio API](https://aistudio.oxylabs.io/) services, including AI-Scraper, AI-Crawler, AI-Browser-Agent and other data extraction tools. ## Installation ```bash npm install oxylabs-ai-studio ``` ## Quick Start ### 1. Environment Setup Either add `OXYLABS_AI_STUDIO_API_URL` and `OXYLABS_AI_STUDIO_API_KEY` values to the `.env` file, or as your environment variables: ```bash export OXYLABS_AI_STUDIO_API_KEY=your_api_key_here ``` ## AI-Scraper ### Generate Schema ```javascript import { OxylabsAIStudioSDK } from 'oxylabs-ai-studio'; const sdk = new OxylabsAIStudioSDK({ apiKey: 'your_api_key_here', timeout: 120000, retryAttempts: 3, }); async function testGenerateSchema() { try { console.log('Testing schema generation...'); const schema = await sdk.aiScraper.generateSchema({ user_prompt: 'Extract the title of the page' }); console.log('Schema:', schema); } catch (error) { console.error('Schema generation error:', error.message); } } testGenerateSchema(); ``` ### Basic usage ```javascript import { OxylabsAIStudioSDK, OutputFormat } from 'oxylabs-ai-studio'; const sdk = new OxylabsAIStudioSDK({ apiKey: 'your_api_key_here', timeout: 120000, retryAttempts: 3, }); async function testScrapeOutputJson() { try { console.log('Testing synchronous scraping with JSON output...'); const options = { url: 'https://www.freelancer.com', output_format: OutputFormat.JSON, geo_location: "US", schema: { type: 'object', properties: { links: { type: 'array', items: { type: 'string' } } } } }; const results = await sdk.aiScraper.scrape(options); console.log('Sync scraping results:', results); } catch (error) { console.error('Sync scraping error:', error.message); } } testScrapeOutputJson(); ``` ### Available Parameters - `url` (*string*): The target URL to process. - `output_format` (*string*): The desired format for the output. Can be either `markdown`, `json`, `screenshot`, `csv` or `toon`. Defaults to `markdown`. - `render_javascript` (*boolean | "auto"*): Whether to render JavaScript before extraction. Can be `"auto"` to auto-detect if rendering is needed. Defaults to `false`. - `schema` (*Record<string, any>*): A JSON Schema object that defines the structure of the output data. This is required when `output_format` is `json`, `csv` or `toon`. - `geo_location` (*string*): Specifies the geographic location (ISO2 format) or country canonical name from which the request should be simulated. - `user_agent` (*string*): User-Agent request header. See [available values](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/http-context-and-job-management/user-agent-type). ## AI-Crawler ### Basic usage ```javascript import { OxylabsAIStudioSDK, OutputFormat } from 'oxylabs-ai-studio'; const sdk = new OxylabsAIStudioSDK({ apiKey: 'your_api_key_here', timeout: 120000, retryAttempts: 3, }); async function testCrawlOutputJson() { try { console.log('Testing crawling with JSON output...'); const options = { url: 'https://www.freelancer.com', output_format: OutputFormat.JSON, user_prompt: 'Get job ad pages', return_sources_limit: 3, geo_location: "Germany", schema: { type: "object", properties: { jobAd: { type: "object", properties: { position_title: { type: "string" }, salary: { type: "string" } } } } } }; const results = await sdk.aiCrawler.crawl(options); console.log('Crawling results:', JSON.stringify(results, null, 2)); } catch (error) { console.error('Crawling error:', error.message); } } testCrawlOutputJson(); ``` ### Available Parameters - `url` (*string*): The starting URL for the crawl. - `user_prompt` (*string*): Crawling instructions. - For auto-schema flow, use `parse_prompt` in `crawlWithAutoSchema()` to generate the `schema`. - `output_format` (*string*): The desired format for the output. Can be either `markdown`, `json`, `csv` or `toon`. Defaults to `markdown`. - `return_sources_limit` (*integer*): The maximum number of pages/sources to return. Defaults to `25`. - `render_javascript` (*boolean*): Whether to render JavaScript on pages before extraction. Defaults to `false`. - `schema` (*Record<string, any>*): A JSON Schema object that defines the structure of the output data. Required when `output_format` is `json`, `csv` or `toon`. - `geo_location` (*string*): Specifies the geographic location (ISO2 format) or country canonical name from which the request should be simulated. - `max_credits` (*integer | null*): Optional cap on credits to spend for the run. ## Browser-Agent ### Basic usage ```javascript import { OxylabsAIStudioSDK, OutputFormat } from 'oxylabs-ai-studio'; const sdk = new OxylabsAIStudioSDK({ apiKey: 'your_api_key_here', timeout: 120000, retryAttempts: 3, }); async function testBrowseOutputJson() { try { console.log('Testing synchronous browsing with JSON output...'); const options = { url: 'https://www.freelancer.com', output_format: OutputFormat.JSON, user_prompt: 'Navigate to the first job ad you can find.', geo_location: "US", schema: { type: 'object', properties: { job_title: { type: 'string' } } } }; const results = await sdk.browserAgent.browse(options); console.log('Sync browsing results:', JSON.stringify(results, null, 2)); } catch (error) { console.error('Sync browsing error:', error.message); } } testBrowseOutputJson(); ``` ### Available Parameters - `url` (*string*): The target URL for the browser agent to start at. - `user_prompt` (*string*): Instructions describing what actions to perform and data to extract. - For auto-schema flow, use `parse_prompt` in `browseWithAutoSchema()` to generate the `schema`. - `output_format` (*string*): The desired format for the output. Can be `markdown`, `html`, `json`, `csv`, `toon`, or `screenshot`. Defaults to `markdown`. - `schema` (*Record<string, any>*): A JSON Schema object that defines the structure of the output data. This is required when `output_format` is `json`, `csv` or `toon`. - `geo_location` (*string*): Specifies the geographic location (ISO2 format) or country canonical name from which the request should be simulated. ## AI-Search ### Basic usage ```javascript import { OxylabsAIStudioSDK, } from 'oxylabs-ai-studio'; const sdk = new OxylabsAIStudioSDK({ apiKey: 'your_api_key_here', timeout: 120000, retryAttempts: 3, }); async function testSearch() { try { console.log('Testing search...'); const options = { query: 'weather in London', limit: 3, return_content: true, render_javascript: false, geo_location: "IT", }; const results = await sdk.aiSearch.search(options); console.log('Search results:', JSON.stringify(results, null, 2)); } catch (error) { console.error('Search error:', error.message); } } testSearch(); ``` ### Available Parameters - `query` (*string*): The search query. - `limit` (*integer*): The maximum number of search results to return. Maximum: 50. - `geo_location` (*string*): ISO 2-letter format, country name, coordinate formats are supported. See more at [SERP Localization](https://developers.oxylabs.io/scraping-solutions/web-scraper-api/features/localization/serp-localization). ### Performance Optimization The `search()` method automatically optimizes performance by intelligently choosing between two endpoints: **Instant Endpoint** (used automatically when): - `limit` is ≤ 10 (or undefined) - AND `return_content` is `false` (or undefined) This provides real-time results without polling for faster response times. **Polling-based Endpoint** (used when): - `limit` > 10 - OR `return_content` is `true` You can also directly use `searchInstant()` method if you want to explicitly use the instant endpoint: ```javascript const results = await sdk.aiSearch.searchInstant({ query: 'weather today', geo_location: 'United States' }); ``` - `query` (*string*): The search query. - `limit` (*integer*): The maximum number of search results to return. Maximum: 10. - `geo_location` (*string*): Google's canonical name of the location. See more at [Google Ads GeoTargets](https://developers.google.com/google-ads/api/data/geotargets). ## AI-Map ### Basic usage ```javascript import { OxylabsAIStudioSDK } from 'oxylabs-ai-studio'; const sdk = new OxylabsAIStudioSDK({ apiKey: 'your_api_key_here', timeout: 120000, retryAttempts: 3, }); async function testMap() { try { console.log('Testing map...'); const options = { url: 'https://www.freelancer.com/jobs', user_prompt: 'Extract tech job ads', return_sources_limit: 10, geo_location: 'US', render_javascript: false }; const results = await sdk.aiMap.map(options); console.log('Map results:', JSON.stringify(results, null, 2)); } catch (error) { console.error('Map error:', error.message); } } testMap(); ``` ### Available Parameters - `url` (*string*): The target URL to map and discover URLs from. - `search_keywords` (*string[]*): Optional keywords to bias the mapping. - `user_prompt` (*string | null*): Optional instructions to focus the mapping on relevant areas. - `limit` (*integer*): The maximum number of URLs to return (default `50`, max `10000`). - `max_crawl_depth` (*integer*): Maximum depth to crawl within the site (default `1`, max `5`). - `include_sitemap` (*boolean*): Whether to include sitemap URLs (default `true`). - `allow_subdomains` (*boolean*): Whether to include subdomains (default `false`). - `allow_external_domains` (*boolean*): Whether to include external domains (default `false`). - `geo_location` (*string | null*): The location to use (ISO2 or country canonical name). - `render_javascript` (*boolean*): Whether to render JavaScript when mapping (default `false`). - `max_credits` (*integer | null*): Optional cap on credits to spend for the run. ## Running Examples You can find more examples of each application here: - [Browser-agent Example](https://github.com/oxylabs/oxylabs-ai-studio-js/blob/main/examples/browser-agent.js) - [AI-Crawler Example](https://github.com/oxylabs/oxylabs-ai-studio-js/blob/main/examples/ai-crawler.js) - [AI-Scraper Example](https://github.com/oxylabs/oxylabs-ai-studio-js/blob/main/examples/ai-scraper.js) - [AI-Search Example](https://github.com/oxylabs/oxylabs-ai-studio-js/blob/main/examples/ai-search.js) - [AI-Map Example](https://github.com/oxylabs/oxylabs-ai-studio-js/blob/main/examples/ai-map.js)

AI Agents Browser Automation
46 Github Stars
Rotating-Proxies-With-Python
Open Source

Rotating-Proxies-With-Python

# Rotating Proxies With Python [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=rotating-proxies-with-python-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) [<img src="https://img.shields.io/static/v1?label=&message=Python&color=brightgreen" />](https://github.com/topics/python) [<img src="https://img.shields.io/static/v1?label=&message=Web%20Scraping&color=important" />](https://github.com/topics/web-scraping) [<img src="https://img.shields.io/static/v1?label=&message=Rotating%20Proxies&color=blueviolet" />](https://github.com/topics/rotating-proxies) ## Table of Contents - [Finding Current IP Address](#finding-your-current-ip-address) - [Using A Single Proxy](#using-a-single-proxy) - [Rotating Multiple Proxies](#rotating-multiple-proxies) - [Rotating Multiple Proxies Using Async](#rotating-multiple-proxies-using-async) ## Prerequisites This article uses the python `requests` module. In order to install it, you can use `virtualenv`. `virtualenv` is a tool to create isolated Python environments. Start by creating a virtual environment in your project folder by running ```bash $ virtualenv venv ``` This will install python, pip and common libraries in your project folder. Next, invoke the source command to activate the environment. ```bash $ source venv/bin/activate ``` Lastly, install the `requests` module in the current virtual environment ```bash $ pip install requests ``` Alternatively, you can install the dependencies from the included [requirements.txt](requirements.txt) file by running ```bash $ pip install -r requirements.txt ``` Congratulations, you have successfully installed the `request` module. Now, it's time to find out your current Ip address! ## Finding Your Current IP Address Create a file with the `.py` extension with the following contents (or just copy [no_proxy.py](src/no_proxy.py)): ```python import requests response = requests.get('https://ip.oxylabs.io/location') print(response.text) ``` Now, run it from a terminal ```bash $ python no_proxy.py 128.90.50.100 ``` The output of this script will show your current IP address, which uniquely identifies you on the network. Instead of exposing it directly when requesting pages, we will use a proxy server. Let's start by using a single proxy. ## Using A Single Proxy Your first step is to [find a free proxy server](https://www.google.com/search?q=free+proxy+server+list). **Important Note**: free proxies are unreliable, slow and can collect the data about the pages you access. If you're looking for a reliable paid option, we highly recommend using [oxylabs.io](https://oxy.yt/GrVD) To use a proxy, you will need its: * scheme (e.g. `http`) * ip (e.g. `2.56.215.247`) * port (e.g. `3128`) * username and password that is used to connect to the proxy (optional) Once you have it, you need to set it up in the following format ``` SCHEME://USERNAME:PASSWORD@YOUR_PROXY_IP:YOUR_PROXY_PORT ``` Here are a few examples of the proxy formats you may encounter: ```text http://2.56.215.247:3128 https://2.56.215.247:8091 https://my-user:[email protected]:8044 ``` Once you have the proxy information, assign it to a constant. ```python PROXY = 'http://2.56.215.247:3128' ``` Next, define a timeout in seconds as it is always a good idea to avoid waiting indefinitely for the response that may never be returned (due to network issues, server issues or the problems with the proxy server) ```python TIMEOUT_IN_SECONDS = 10 ``` The requests module [needs to know](https://docs.python-requests.org/en/master/user/advanced/#proxies) when to actually use the proxy. For that, consider the website you are attempting to access. Does it use http or https? Since we're trying to access **https**://ip.oxylabs.io/location, we can define this configuration as follows ```python scheme_proxy_map = { 'https': PROXY, } ``` **Note**: you can specify multiple protocols, and even define specific domains for which a different proxy will be used ```python scheme_proxy_map = { 'http': PROXY1, 'https': PROXY2, 'https://example.org': PROXY3, } ``` Finally, we make the request by calling `requests.get` and passing all the variables we defined earlier. We also handle the exceptions and show the error when a network issue occurs. ```python try: response = requests.get('https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS) except (ProxyError, ReadTimeout, ConnectTimeout) as error: print('Unable to connect to the proxy: ', error) else: print(response.text) ``` The output of this script should show you the ip of your proxy: ```bash $ python single_proxy.py 2.56.215.247 ``` You are now hidden behind a proxy when making your requests through the python script. You can find the complete code in the file [single_proxy.py](src/single_proxy.py). Now we're ready to rotate through a list of proxies, instead of using a single one! ## Rotating Multiple Proxies If you're using unreliable proxies, it could prove beneficial to save a bunch of them into a csv file and run a loop to determine whether they are still available. For that purpose, first create a file `proxies.csv` with the following content: ```text http://2.56.215.247:3128 https://88.198.24.108:8080 http://50.206.25.108:80 http://68.188.59.198:80 ... any other proxy servers, each of them on a separate line ``` Then, create a python file and define both the filename, and how long are you willing to wait for a single proxy to respond: ```python TIMEOUT_IN_SECONDS = 10 CSV_FILENAME = 'proxies.csv' ``` Next, write the code that opens the csv file and reads every proxy server line by line into a `csv_row` variable and builds `scheme_proxy_map` configuration needed by the requests module. ```python with open(CSV_FILENAME) as open_file: reader = csv.reader(open_file) for csv_row in reader: scheme_proxy_map = { 'https': csv_row[0], } ``` And finally, we use the same scraping code from the previous section to access the website via proxy ```python with open(CSV_FILENAME) as open_file: reader = csv.reader(open_file) for csv_row in reader: scheme_proxy_map = { 'https': csv_row[0], } # Access the website via proxy try: response = requests.get('https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS) except (ProxyError, ReadTimeout, ConnectTimeout) as error: pass else: print(response.text) ``` **Note**: if you are only interested in scraping the content using *any* working proxy from the list, then add a break after print to stop going through the proxies in the csv file ```python try: response = requests.get('https://ip.oxylabs.io/location', proxies=scheme_proxy_map, timeout=TIMEOUT_IN_SECONDS) except (ProxyError, ReadTimeout, ConnectTimeout) as error: pass else: print(response.text) break # notice the break here ``` This complete code is available in [rotating_multiple_proxies.py](src/rotating_multiple_proxies.py) The only thing that is preventing us from reaching our full potential is speed. It's time to tackle that in the next section! ## Rotating Multiple Proxies Using Async Checking all the proxies in the list one by one may be an option for some, but it has one significant downside - this approach is painfully slow. This is because we are using a synchronous approach. We tackle requests one at a time and only move to the next once the previous one is completed. A better option would be to make requests and wait for responses in a non-blocking way - this would speed up the script significantly. In order to do that we use the `aiohttp` module. You can install it using the following cli command: ```bash $ pip install aiohttp ``` Then, create a python file where you define: * the csv filename that contains the proxy list * url that you wish to use to check the proxies * how long are you willing to wait for each proxy - the timeout setting ```python CSV_FILENAME = 'proxies.csv' URL_TO_CHECK = 'https://ip.oxylabs.io/location' TIMEOUT_IN_SECONDS = 10 ``` Next, we define an async function and run it using the asyncio module. It accepts two parameters: * the url it needs to request * the proxy to use to access it We then print the response. If the script received an error when attempting to access the url via proxy, it will print it as well. ```python async def check_proxy(url, proxy): try: session_timeout = aiohttp.ClientTimeout(total=None, sock_connect=TIMEOUT_IN_SECONDS, sock_read=TIMEOUT_IN_SECONDS) async with aiohttp.ClientSession(timeout=session_timeout) as session: async with session.get(url, proxy=proxy, timeout=TIMEOUT_IN_SECONDS) as resp: print(await resp.text()) except Exception as error: # you can comment out this line to only see valid proxies printed out in the command line print('Proxy responded with an error: ', error) return ``` Then, we define a main function that reads the csv file and creates an asynchronous task to check the proxy for every single record in the csv file. ```python async def main(): tasks = [] with open(CSV_FILENAME) as open_file: reader = csv.reader(open_file) for csv_row in reader: task = asyncio.create_task(check_proxy(URL_TO_CHECK, csv_row[0])) tasks.append(task) await asyncio.gather(*tasks) ``` Finally, we run the main function and wait until all the async tasks complete ```python asyncio.run(main()) ``` This complete code is available in [rotating_multiple_proxies.py](src/rotating_multiple_proxies_async.py) This code now runs exceptionally fast! # We are open to contribution! Be sure to play around with it and create a pull request with any improvements you may find. Also, check this [Best rotating proxy service](https://medium.com/@oxylabs.io/10-best-rotating-proxy-services-for-2025-853d840af1a4) list. Happy coding!

Networking & Hosting Browser Automation
44 Github Stars
crawlee-web-scraping-tutorial
Open Source

crawlee-web-scraping-tutorial

# Crawlee Tutorial: Easy Web Scraping and Browser Automation [![Oxylabs promo code](https://raw.githubusercontent.com/oxylabs/product-integrations/refs/heads/master/Affiliate-Universal-1090x275.png)](https://oxylabs.io/pages/gitoxy?utm_source=877&utm_medium=affiliate&groupid=877&utm_content=crawlee-web-scraping-tutorial-github&transaction_id=102f49063ab94276ae8f116d224b67) [![](https://dcbadge.limes.pink/api/server/Pds3gBmKMH?style=for-the-badge&theme=discord)](https://discord.gg/Pds3gBmKMH) [![YouTube](https://img.shields.io/badge/YouTube-Oxylabs-red?style=for-the-badge&logo=youtube&logoColor=white)](https://www.youtube.com/@oxylabs) Web scraping and browser automation have emerged as essential tools for businesses looking to stay competitive in the digital marketplace. This easy tutorial covers everything you need to get started with Crawlee a tool for web scraping and browser automation. ## What is Crawlee? Crawlee is a Node.JS package that offers a straightforward and adaptable interface for web scraping and browser automation. Users can retrieve web pages, apply CSS selectors to extract data from them, and navigate the DOM tree to follow links and scrape several sites. Crawlee is a versatile tool that provides a uniform interface for web crawling via HTTP and headless browser approaches. It has an integrated persistent queue for handling URLs to crawl in either breadth-first or depth-first order. Users can benefit from integrated proxy rotation and session management, pluggable storage solutions for files and tabular data, and other features. Moreover, Crawlee offers hook-based customized lifecycles, programmable routing, error handling, and retries. For speedy project setup, a CLI is accessible, and Dockerfiles are included to streamline deployment. Crawlee is a robust and effective tool for web scraping and crawling written in TypeScript using generics. ## The Benefits of Using Crawlee for Web Scraping and Browser Automation The following are some of the most common pros of using Crawlee for browser automation and web scraping: **Single interface:** Crawlee offers a single interface for headless browser crawling as well as HTTP crawling, making it simple to switch between the two based on your needs. **Customizable lifecycles:** Crawlee allows developers to alter their crawlers' lifecycles using hooks. These hooks can be used to carry out operations before or after specific events, such as before a request is made or after data is collected. **Pluggable storage:** Crawlee supports pluggable storage methods for both tabular data and files, making storing and managing the data you extract simple. **Proxy rotation and session management:** Crawlee has built-in support for these features, which can be used to manage complex web interactions and avoid IP blocking. **Configurable request routing, error handling, and retries:** Crawlee enables developers to control request routing, deal with errors, and retry requests as needed, making it simpler to handle edge cases and unexpected issues. **Docker-capable:** Crawlee comes with Dockerfiles that let developers rapidly and easily deploy their crawlers to production environments. ## Crawlee Web Scraping Tutorial This section will discuss the installation steps of Crawlee and how Crawlee works. The section concludes with a working example of scraping a website using Crawlee. ### Installation To use Crawlee on your system, you must install Node.JS version 16.0 or above. Along with it, NPM should also be installed. The [Crawlee CLI](https://www.npmjs.com/package/@crawlee/cli) is the quickest and most efficient way to build new projects with Crawlee. The following command will create a new Crawlee project inside the “my-crawler” directory. ```Bash npx crawlee create my-crawler ``` The `npx` CLI tool runs the `crawlee` package locally without installing it globally on your machine. Running the above command will show you a prompt message to choose a template, as shown in the following snippet: ![npx create Crawlee project](images/npx-crawlee-create.png) Since we are doing Node.JS development, we can select “Getting Started Example (JavaScript)”. This option will install all required dependencies and create a new directory named “my-crawler” in your current working directory. It will also add a “package.json” to this folder. Additionally, it will include example source code that you may use immediately. The image below shows the completed installation message: ![npx create complete](images/npx-create-complete.png) Remember, the Crawlee project was created inside the “my-crawler” folder. You first need to change your current directory to this folder. ```bash cd my-crawler ``` Next, run the following command to start the Crawlee project: ```bash npm start ``` The `start` command will start the default crawler of the project. This crawler crawls the Crawlee website and outputs the titles of all the links on the website. ![npm start command](images/npm-start-command.png) Congratulations! You have successfully installed Crawlee and run one of the crawlers with it. ### Crawlee working Crawlee has three types of crawlers: **CheerioCrawler, PuppeteerCrawler**, and **PlaywrightCrawler**. They all share some basic characteristics. Every crawler is made to visit a webpage, carry out specific tasks, save the results, navigate to the next page, and repeat this cycle until the job is finished. Each crawler must therefore respond to two queries: Where should I go? What should I do there, then? The crawler can start working when these queries are resolved since most other settings are pre-configured for the crawlers. ### Crawlee web scraping example Our target for this web scraping demonstration is the [Books to Scrape](https://books.toscrape.com/) page. We will be scraping the titles of the books listed on the website. Open the **main.js** file from the **src** folder of your project and overwrite the following code in it: ```javascript //crawlee_scraper.js import { PlaywrightCrawler } from 'crawlee'; const crawler = new PlaywrightCrawler({ requestHandler: async ({ page }) => { // Waiting for book titles to load await page.waitForSelector('h3'); // Execute a function in the browser that targets // the book title elements and allows their manipulation const bookTitles = await page.$$eval('h3', (els) => { // Extract text content from the titles list return els.map((el) => el.textContent); }); bookTitles.forEach((text, i) => { console.log(`Book_${i + 1}: ${text}\n`); }); }, }); await crawler.run(['https://books.toscrape.com/']); ``` This code illustrates how to scrape data from a website using Crawlee's **PlaywrightCrawler** class. The code first imports the `PlaywrightCrawler` class from the `crawlee` package. Then it creates a new crawler of the `PlaywrightCrawler` type. Instantiating the `PlaywrightCrawler` class requires an options object, which includes a `requestHandler` function. The `requestHandler` function is an asynchronous function executed for every page visited by the crawler. The `requestHandler` function first waits for the page's `<h3>` elements, representing the book titles, to be rendered using the `page. waitForSelector()` call. The `page.$$eval()` method is executed in the browser's context, extracting the text content from all of the `<h3>` elements on the page. Lastly, the crawler code logs the book titles to the console. The last line of the code initiates the crawling operation using the `crawler.run()` method. The following command runs the project: ```cmd npm start ``` Here is what the code output when once you run the crawlee project: ![books to scrape output](images/books-to-scrape-output.png) ## Using Headless Browsers With Crawlee Crawlee supports headless control over the browsers like Chromium, Firefox, and WebKit. You can combine headless browsers with Crawlee's `PuppeteerCrawler` and `PlaywrightCrawler` classes to perform true browser crawling and extract valuable data from complex websites. You just need to Set-up a few variables, such as the browser type, start options, and context options, to use headless browsers with Crawlee. Here is a simple code example to launch a Firefox headless instance and scrape the titles of the books (as we did in the earlier section): - ```javascript //crawlee_scraper_headless.js import { PlaywrightCrawler } from 'crawlee'; import { firefox } from 'playwright'; const crawler = new PlaywrightCrawler({ launchContext: { // Set the Firefox browser to be used by the crawler. launcher: firefox, }, requestHandler: async ({ page }) => { // Wait for the actor cards to render. await page.waitForSelector('h3'); // Execute a function in the browser which targets // the actor card elements and allows their manipulation. const bookTitles = await page.$$eval('h3', (els) => { // Extract text content from the actor cards return els.map((el) => el.textContent); }); bookTitles.forEach((text, i) => { console.log(`Book_${i + 1}: ${text}\n`); }); }, }); await crawler.run(['https://books.toscrape.com/']); ``` Crawlee offers rich [features for HTTP and real browser crawling](https://www.i-programmer.info/news/90-tools/15803-scrape-the-web-with-crawlee.html). These features include: - ### HTTP crawling: - Automation configuration of browser-like headers - Replication of browser TLS fingerprints - Integrated fast HTML parsers - Zero config HTTP2 support, even for proxies - Scraping JSON APIs ### Real browser crawling - JavaScript rendering and screenshots - Headless and headful browser support - Zero-config generation of human-like fingerprints - Automatic browser management - Use Playwright and Puppeteer with the same interface These qualities make Crawlee an effective solution for site scraping and browser automation. ## How To Manage Proxies With Crawlee Crawlee includes built-in support for managing proxies. It lets you quickly choose between a list of proxies to avoid IP-based restrictions or website blocking. You can construct a new instance of a crawler and give in a `ProxyConfiguration` object with the list of proxies. You can optionally specify the rotation technique. For example, you can set proxies to rotate every request or after a specified number of requests. Moreover, you can also use a third-party solution such as [Oxylabs' Web Unblocker](https://oxylabs.io/products/web-unblocker) to ensure that your Crawlee web scraping is not blocked or restricted. To integrate the proxy endpoints with Crawlee, you can use the `ProxyConfiguration` class. You can create an instance of this class using the constructor and provide the necessary options. You can visit the [ProxyConfigurationOptions](https://crawlee.dev/api/core/interface/ProxyConfigurationOptions) class page to learn about proxy configuration options. The following code demonstrates setting a proxy list Crawlee. You can get a list of residential proxy endpoints by registering an account at [Oxylabs' Web Unblocker](https://oxylabs.io/products/web-unblocker) page. ```javascript //crawlee_scraper_proxies.js import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee'; import {PlaywrightCrawler, ProxyConfiguration} from 'crawlee'; const proxyConfiguration = new ProxyConfiguration( { proxyUrls: [ 'http://username:[email protected]:60000', 'https://username:[email protected]:60000', ], },); proxyConfiguration.isManInTheMiddle = true; const crawler = new PlaywrightCrawler({ proxyConfiguration, requestHandler: async ({page}) => { // Waiting for book titles to load await page.waitForSelector('h3'); // Execute a function in the browser that targets // the book title elements and allows their manipulation const bookTitles = await page.$$eval('h3', (els) => { // Extract text content from the titles list return els.map((el) => el.textContent); }); bookTitles.forEach((text, i) => { console.log(`Book_${i + 1}: ${text}\n`); }); }, navigationTimeoutSecs: 120, }); await crawler.run(['https://books.toscrape.com/']); ``` The above code snippet creates a new instance of the `ProxyConfiguration` class by passing a list of Oxylabs’ Web Unblocker proxy endpoints. Then it sets the `isManInTheMiddle` property to `true`. This property indicates that the proxy server will be used as a Man-in-the-Middle (MITM) proxy. After that, it uses this `ProxyConfiguration` object (i.e., stored in the `proxyList`) to initialize a Playwright crawler instance. The `proxyList` contains a list of the Oxylabs’ Web Unblocker proxies. Make sure to replace the `username` and `password` with your account credentials. The rest of the crawler code remains the same as we wrote in the previous example ## Conclusion Crawlee is a powerful web scraping and browser automation solution with a unified interface for HTTP and headless browser crawling. It supports pluggable storage, headless browsing, automatic scaling, integrated proxy rotation and session management, customized lifecycles, and much more. Crawlee is an effective solution for developers and data analysts who want to automate browser actions and retrieve and extract data effectively.

Developer Tools Testing & QA
11 Github Stars