Just a simple script to pull down all public GitHub repositories. It stores the results in a CSV, which is not lookup efficient. It should be easy to change to something like SQL, but YMMV; CSV is good enough for my needs.
The script grabs all of the properties available to Repository objects. Each repository is stored as a new row in the CSV. The CSV is meant to be read in with pandas
.
If you want all of the repositories, this will take several weeks with the user rate limit (5,000 requests per hour) and take up ~500GB of space.
The script uses the public GitHub API provided by PyGitHub. You can download this with pip
using the included requirements.txt file:
pip3 install -r requirements.txt
The script accepts two optional parameters:
--token
: an optional argument to specify your API token.- If no token is set, the rate limit is 60 requests per hour. You can obtain an API token under your user settings.
--filename
: An optional argument to specify the filename of the CSV to write to.- If no filename is given, "repos.csv" will be used. If the file already exists, it'll try and pick back up where it left off from a previous run. I haven't tested this fully. Go ahead and fuzz it. ๐
python3 ./get-repos.py
python3 ./get-repos.py --token <my-token>
python3 ./get-repos.py --filename repos.csv
python3 ./get-repos.py --token <my-token> --filename repos.csv
Why not use GH Archive?
I wanted to do it myself and learn the API. You probably want the GH Archive, not my messy script.