Top Github Scraper
Scrape top Github repositories and users based on keywords.
I used this tool to analyze the top 1k machine learning users and create an interactive map to search for users based on their location.

Setup
Installation
pip install top-github-scraper
Add Credentials
To make sure you can scrape many repositories and users, add your GitHub's credentials to .env file.
touch .env
Add your username and token to .env file:
GITHUB_USERNAME=yourusername
GITHUB_TOKEN=yourtoken
Usage
View full documentation here.
Get Top Github Repositories' URLs
from top_github_scraper import get_top_repo_urls
get_top_repo_urls(keyword="machine learning", stop_page=10)
Output at top_repo_urls_<keyword>_<sort_by>_<start_page>_<end_page>.json:
[
"/josephmisiti/awesome-machine-learning",
"/wepe/MachineLearning",
"/udacity/machine-learning",
"/Jack-Cherish/Machine-Learning",
"/ZuzooVn/machine-learning-for-software-engineers",
"/rasbt/python-machine-learning-book",
"/lawlite19/MachineLearning_Python",
"/lazyprogrammer/machine_learning_examples",
"/trekhleb/homemade-machine-learning",
"/ujjwalkarn/Machine-Learning-Tutorials"
]
Get Top Github Repositories' Information
from top_github_scraper import get_top_repos
get_top_repos("machine learning", stop_page=10)
Output for 1 repository at top_repo_info_<keyword>_<sort_by>_<start_page>_<end_page>.json :
{
"stargazers_count": 48620,
"forks_count": 12155,
"contributors": {
"login": [
"josephmisiti",
"josephmmisiti",
"hslatman",
"0asa",
"ajkl",
"ipcenas",
"cogmission",
"spekulatius",
"basickarl",
"NathanEpstein"
],
"url": [
"https://api.github.com/users/josephmisiti",
"https://api.github.com/users/josephmmisiti",
"https://api.github.com/users/hslatman",
"https://api.github.com/users/0asa",
"https://api.github.com/users/ajkl",
"https://api.github.com/users/ipcenas",
"https://api.github.com/users/cogmission",
"https://api.github.com/users/spekulatius",
"https://api.github.com/users/basickarl",
"https://api.github.com/users/NathanEpstein"
],
"contributions": [
671,
105,
21,
12,
11,
9,
8,
7,
7,
7
]
}
}
Get Top Github Contributors' Profiles
from top_github_scraper import get_top_contributors
get_top_contributors("machine learning", stop_page=10)
Output at top_contributor_info_<keyword>_<sort_by>_<start_page>_<end_page>.csv:
|| login | url | type | name | company | location | email | hireable | bio | public_repos | public_gists | followers |following | ------------- |:-------------:|:-------------:| :-----:| :-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:| | 0 | josephmisiti | https://api.github.com/users/josephmisiti | User | Joseph Misiti | Math & Pencil |"Brooklyn, NY"| | True | Mathematician & Co-founder of Math & Pencil|229|142|2705|275 1|josephmmisiti|https://api.github.com/users/josephmmisiti|User|||||||0|0|2|0 2|hslatman|https://api.github.com/users/hslatman|User|Herman Slatman|DistributIT|||||133|20|469|67 3|0asa|https://api.github.com/users/0asa|User|Vincent Botta| | Belgium|||"Innovation Engineer @evs-broadcast, previously Data Scientist @kensuio, E-Marketing Tools Manager @Diagenode, cofounder @Antibody-Adviser and photographer"|35|15|25|16 4|ajkl|https://api.github.com/users/ajkl|User|Ajinkya Kale|||[email protected]|||58|1|29|4 5|ipcenas|https://api.github.com/users/ipcenas|User|||||||79|0|1|0 6|cogmission|https://api.github.com/users/cogmission|User|David Ray||Third planet from the sun...|[email protected]||Humanity's freedom and abundance through the pursuit of technological innovation in the area of cognitive applications - Cognition Mission|30|19|54|44 7|spekulatius|https://api.github.com/users/spekulatius|User|Peter Thaleikis|@bringyourownideas |127.0.0.1||True|Software engineer focused on solutions using open source and simply filling in the gaps to fulfill the requirements.|42|1|232|920 8|basickarl|https://api.github.com/users/basickarl|User|Karl Morrison||"Malmö, Sweden"|[email protected]||The question is: Will you take me seriously|5|1|12|6 9|NathanEpstein|https://api.github.com/users/NathanEpstein|User|Nathan Epstein||"New York, NY"|[email protected]|True||23|12|208|0
Get Top Github Users' Profiles
from top_github_scraper import get_top_users
get_top_users("machine learning", stop_page=10)
Output at top_user_info_<keyword>_<start_page>_<end_page>.csv
|| login | url | type | name | company | location | email | hireable | bio | public_repos | public_gists | followers |following | ------------- |:-------------:|:-------------:| :-----:| :-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:| 0|rasbt|https://api.github.com/users/rasbt|User|Sebastian Raschka|UW-Madison|"Madison, WI"|||"Machine Learning researcher & open source contributor. Author of ""Python Machine Learning."" Asst. Prof. of Statistics @ UW-Madison."|71|5|13888|35 1|tqchen|https://api.github.com/users/tqchen|User|Tianqi Chen|"CMU, OctoML"||||Large scale Machine Learning|28|1|8611|126 2|halfrost|https://api.github.com/users/halfrost|User|halfrost|@Alibaba | Shanghai China|[email protected]||💪天道酬勤,勤能补拙。博观而约取,厚积而薄发。Gopher / Rustacean / iOS Dev. / Machine Learning / Retired acmer / Math / Philosophy / Technical Writer.|22|0|8566|314 3|ageron|https://api.github.com/users/ageron|User|Aurélien Geron||Paris|||Author of the book Hands-On Machine Learning with Scikit-Learn and TensorFlow. Former PM of YouTube video classification and founder & CTO of a telco operator.|43|16|8383|2 4|chiphuyen|https://api.github.com/users/chiphuyen|User|Chip Huyen|https://snorkel.ai|"Mountain View, CA"||True|Developing tools and best practices for machine learning production.|19|1|7839|15 5|rhiever|https://api.github.com/users/rhiever|User|Randy Olson|FOXO BioScience|"Vancouver, WA"|[email protected]||"Chief Data Scientist, @FOXOBioScience. AI, Machine Learning, and Data Visualization specialist. Community leader for /r/DataIsBeautiful."|77|17|5363|13 6|lexfridman|https://api.github.com/users/lexfridman|User|Lex Fridman|MIT|"Cambridge, MA"|||"AI researcher working on autonomous vehicles, human-robot interaction, and machine learning at MIT and beyond."|2|0|5031|0 7|eriklindernoren|https://api.github.com/users/eriklindernoren|User|Erik Linder-Norén||"Stockholm, Sweden"|[email protected]||"ML engineer at Apple. Excited about machine learning, basketball and building things."|24|0|3764|11 8|roboticcam|https://api.github.com/users/roboticcam|User|A/Prof Richard Xu 徐亦达教授|University of Technology Sydney|Sydney Australia|||"I am an A/Professor in Machine Learning at UTS. manage a large research team of postdoc, PhD students close to 30 people"|10|0|3561|0 9|ogrisel|https://api.github.com/users/ogrisel|User|Olivier Grisel|Inria|"Paris, France"|[email protected]||Machine Learning Engineer a Inria Saclay (Parietal team).|174|93|3237|116
Parameters
View a full list of paramters here.
How the Data is Scraped
top-github-scraper scrapes the owners as well as the contributors of the top repositories that pop up in the search when searching for a specific keyword on GitHub.

For each user, top-github-scraper scrapes 16 data points:
login: usernameurl: URL of the usertype: Whether this account is a user or an organizationname: Name of the usercompany: User's companylocation: User's locationemail: User's emailhireable: Whether the user is hireablebio: Short description of the userpublic_repos: Number of public repositories the user has (including forked repositories)public_gists: Number of public repositories the user has (including forked gists)followers: Number of followers the user hasfollowing: Number of people the user is following