About scrape-linkedin-selenium

scrape-linkedin-selenium is a Python package designed to extract structured data from public LinkedIn profiles and company pages, converting the scraped information into JSON format. Built on the Selenium web automation framework, the tool navigates LinkedIn pages to collect details such as skills, work history, and company information. It supports both command-line usage via the scrapeli interface and programmatic integration as a library module. The package allows users to target specific attributes, choose between Chrome or Firefox browsers, and process offline HTML files or live URLs. To overcome LinkedIn's anti-scraping measures and prevent IP blacklisting, the software requires the liat authentication cookie, which can be set via environment variables or passed directly to the scraper instance. It includes utilities for parallel processing to handle multiple requests efficiently. Users must be aware that LinkedIn enforces strict policies against unauthorized data extraction, so proper authentication and

a

Published by

austinoboyle

Visit View Profile

README.md

View on GitHub

scrape_linkedin

Introduction

scrape_linkedin is a python package to scrape all details from public LinkedIn profiles, turning the data into structured json. You can scrape Companies and user profiles with this package.

Warning: LinkedIn has strong anti-scraping policies, they may blacklist ips making unauthenticated or unusual requests

Installation

Install with pip

Run pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git

Install from source

git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git

Run python setup.py install

Tests

Tests are (so far) only run on static html files. One of which is a linkedin profile, the other is just used to test some utility functions.

Getting & Setting LI_AT

Because of Linkedin's anti-scraping measures, you must make your selenium browser look like an actual user. To do this, you need to add the li_at cookie to the selenium session.

Getting LI_AT

Navigate to www.linkedin.com and log in
Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option
Find and copy the li_at value

Setting LI_AT

There are two ways to set your li_at cookie:

Set the LI_AT environment variable
- $ export LI_AT=YOUR_LI_AT_VALUE
- On Windows: C:/foo/bar> set LI_AT=YOUR_LI_AT_VALUE
Pass the cookie as a parameter to the Scraper object.

>>> with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper:

A cookie value passed directly to the Scraper will override your environment variable if both are set.

Examples

See /examples

Usage

Command Line

scrape_linkedin comes with a command line argument module scrapeli created using click.

Note: CLI only works with Personal Profiles as of now.

Options:

--url : Full Url of the profile you want to scrape
--user: www.linkedin.com/in/USER
--driver: choose Browser type to use (Chrome/Firefox), default: Chrome
-a --attribute : return only a specific attribute (default: return all attributes)
-i --input_file : Raw path to html file of the profile you want to scrape
-o --output_file: Raw path to output file for structured json profile (just prints results by default)
-h --help : Show this screen.

Examples:

Get Austin O'Boyle's profile info: $ scrapeli --user=austinoboyle
Get only the skills of Austin O'Boyle: $ scrapeli --user=austinoboyle -a skills
Parse stored html profile and save json output: $ scrapeli -i /path/file.html -o output.json

Python Package

Profiles

Use ProfileScraper component to scrape profiles.

from scrape_linkedin import ProfileScraper

with ProfileScraper() as scraper:
    profile = scraper.scrape(user='austinoboyle')
print(profile.to_dict())

Profile - the class that has properties to access all information pulled from a profile. Also has a to_dict() method that returns all of the data as a dict

with open('profile.html', 'r') as profile_file:
    profile = Profile(profile_file.read())

print (profile.skills)
# [{...} ,{...}, ...]
print (profile.experiences)
# {jobs: [...], volunteering: [...],...}
print (profile.to_dict())
# {personal_info: {...}, experiences: {...}, ...}

Structure of the fields scraped

personal_info
- name
- company
- school
- headline
- followers
- summary
- websites
- email
- phone
- connected
- image
skills
experiences
- volunteering
- jobs
- education
interests
accomplishments
- publications
- cerfifications
- patents
- courses
- projects
- honors
- test scores
- languages
- organizations

Companies

Use CompanyScraper component to scrape companies.

from scrape_linkedin import CompanyScraper

with CompanyScraper() as scraper:
    company = scraper.scrape(company='facebook')
print(company.to_dict())

Company - the class that has properties to access all information pulled from a company profile. There will be three properties: overview, jobs, and life. Overview is the only one currently implemented.

with open('overview.html', 'r') as overview,
    open('jobs.html', 'r') as jobs,
    open('life.html', 'r') as life:
        company = Company(overview, jobs, life)

print (company.overview)
# {...}

Structure of the fields scraped

overview
- name
- company_size
- specialties
- headquarters
- founded
- website
- description
- industry
- num_employees
- type
- image
jobs NOT YET IMPLEMENTED
life NOT YET IMPLEMENTED

config

Pass these keyword arguments into the constructor of your Scraper to override default values. You may (for example) want to decrease/increase the timeout if your internet is very fast/slow.

cookie {str}: li_at cookie value (overrides env variable)
- default: None
driver {selenium.webdriver}: driver type to use
- default: selenium.webdriver.Chrome
_driveroptions {dict}: kwargs to pass to driver constructor
- default: {}
_scrollpause {float}: time(s) to pause during scroll increments
- default: 0.1
_scrollincrement {int} num pixels to scroll down each time
- default: 300
timeout {float}: default time to wait for async content to load
- default: 10

Scraping in Parallel

New in version 0.2: built in parallel scraping functionality. Note that the up-front cost of starting a browser session is high, so in order for this to be beneficial, you will want to be scraping many (> 15) profiles.

Example

from scrape_linkedin import scrape_in_parallel, CompanyScraper

companies = ['facebook', 'google', 'amazon', 'microsoft', ...]

#Scrape all companies, output to 'companies.json' file, use 4 browser instances
scrape_in_parallel(
    scraper_type=CompanyScraper,
    items=companies,
    output_file="companies.json",
    num_instances=4
)

Configuration

Parameters:

_scrapertype {scrape_linkedin.Scraper}: Scraper to use
items {list}: List of items to be scraped
_outputfile {str}: path to output file
_numinstances {int}: number of parallel instances of selenium to run
_tempdir {str}: name of temporary directory to use to store data from intermediate steps
- default: 'tmp_data'
driver {selenium.webdriver}: driver to use for scraping
- default: selenium.webdriver.Chrome
_driveroptions {dict}: dict of keyword arguments to pass to the driver function.
- default: scrape_linkedin.utils.HEADLESS_OPTIONS
**kwargs {any}: extra keyword arguments to pass to the scraper_type constructor for each job

Issues

Report bugs and feature requests here.

scrape-linkedin-selenium