educative.io_scraper
# Educative.io Scraper & Downloader This tool efficiently scrapes and saves Educative.io courses, paths, projects, and cloud labs for offline use. It extracts course data directly via the Educative API and stores it in a local database for rendering. > **Disclaimer:** I am not accountable for any inappropriate use of this scraper. I developed it solely for research purposes and take no responsibility for its misuse. Repository Version: v4.0.36 (Recommended) Master Branch: v4-master --- ## 🚀 Getting Started ### Prerequisites - Python 3.12 or higher - Supported OS: Windows / macOS / Linux ### Installation Clone the repository and install dependencies using the automated setup script. ```bash git clone https://github.com/anilabhadatta/educative.io_scraper.git cd educative.io_scraper # For Windows: python setup.py --install python setup.py --run # For macOS/Linux: python3 setup.py --install python3 setup.py --run ``` *Note: `--install` creates a virtual environment and installs dependencies. `--run` starts the scraper GUI.* <div align="center"> <figure> <img src="https://github.com/user-attachments/assets/c3c3168f-88c4-432d-94c5-2f9b9c919466" style="width: 50%; height:50%;"> <br> <figcaption>Recommeded GUI Settings</figcaption> </figure> </div> --- ## 🛠️ How to Use (Recommended Workflow) The scraper is optimized to use the **API-JSON-Scraper**, which is significantly faster, cleaner, and more reliable than traditional browser automation. ### Step 1: Generate the Course URLs Excel File To quickly get the exact URLs of all available courses, paths, and projects: 1. Open the Scraper GUI (`python EducativeScraper.py`). 2. Select **`All-Course-Urls-Text-File-Generator`** as the Scraper Type. 3. Click **Start Scraper**. 4. The script will fetch the latest sitemaps and API data. Once finished, a file named `educative_sitemap_analysis_updated.xlsx` will be generated in your project folder containing highly organized, categorized links. ### Step 2: Prepare Your Download List 1. Create a plain text file (e.g., `urls.txt`). 2. Open the generated `educative_sitemap_analysis_updated.xlsx` file. 3. Copy the **Topic Link** URLs for the items you want to download and paste them into your text file. - *Note: For Projects, just use the main Project Link from the spreadsheet, not the topic link.* ### Step 3: Run the API Scraper 1. In the Scraper GUI, select **`API-JSON-Scraper`** as the Scraper Type. 2. Select the **Text File** you created in Step 2. 3. Select a **Save Directory** where you want the database to be stored. 4. Click **Login Account** to authenticate your Educative session. A browser will open—log in, and once authenticated, close the browser window. 5. Click **Start Scraper**. The scraper will use internal APIs to download the courses cleanly into a local database. ### Step 4: Scraping Public Content (Answers, Blog, Newsletter) If you want to download free public content from Educative (e.g., Blog posts, Edpresso Answers, Newsletters): 1. Ensure your text file contains the desired public URLs (these are also categorized in the Excel file generated in Step 1). 2. In the Scraper GUI, select **`Public-Content-Scraper`** as the Scraper Type. 3. Select your Text File and Save Directory. 4. Click **Start Scraper**. The tool will use clean internal JSON APIs to fetch public pages and store them in the database identically to standard courses. ### Step 5: Extract and Download Assets Once the courses or public content are scraped, use the GUI to run **Extract Assets**, followed by **Download Assets**. This step will fetch all embedded images, SVGs, and files referenced in the content. --- ## 📖 Viewing the Courses To view your newly downloaded courses, you should use the **[Educative-Viewer V5](https://github.com/Biraj2004/EducativeViewer)**. The viewer reads the database generated by this scraper and dynamically renders the courses in an interface extremely close to the native Educative.io experience. --- ## ⚠️ Important Notes & Tips - **Overwrite Option:** If you enable "Overwrite" to force redownloads, note that scraping will begin *exactly* from the topic link provided in your text file. Because the links generated in the Excel file are not guaranteed to be the very first lesson of a course, using them with Overwrite may result in only partially overwriting the course from that mid-point onward. To overwrite an entire course, ensure you manually provide its true first topic link. - **Auto Resume:** Automatically restarts the scraper process up to 3 times if it crashes due to errors, reading the log to resume from the exact failed URL. - **DB Skipping:** Independently of Auto Resume, the DB inherently tracks topic status. If you restart the scraper, it natively skips any already-downloaded topics to seamlessly resume from leftovers. - **Auto Fix URL:** Automatically updates your text file to remove URLs that have been fully scraped, preventing unnecessary rescraping. - **SeleniumBase(uc mode):** Runs the browser in an undetected mode to bypass anti-bot challenges (like Cloudflare) during login. Enable this if you get blocked. - **Retry Failed URLs:** Used by the Asset Downloader to specifically retry downloading any images or static assets that failed during previous runs. - **Proxies:** You can configure IP proxies natively in the GUI if needed. - **Headless Mode:** If you do not want to see the browser window during operations, choose the `headless` option.