Git Product home page Git Product logo

github-repo-scraper's Introduction

GitHub Public Repo Scanner

Description

Just a simple script to pull down all public GitHub repositories. It stores the results in a CSV, which is not lookup efficient. It should be easy to change to something like SQL, but YMMV; CSV is good enough for my needs.

The script grabs all of the properties available to Repository objects. Each repository is stored as a new row in the CSV. The CSV is meant to be read in with pandas.

If you want all of the repositories, this will take several weeks with the user rate limit (5,000 requests per hour) and take up ~500GB of space.

Dependencies

The script uses the public GitHub API provided by PyGitHub. You can download this with pip using the included requirements.txt file:

pip3 install -r requirements.txt

Usage

The script accepts two optional parameters:

  • --token: an optional argument to specify your API token.
    • If no token is set, the rate limit is 60 requests per hour. You can obtain an API token under your user settings.
  • --filename: An optional argument to specify the filename of the CSV to write to.
    • If no filename is given, "repos.csv" will be used. If the file already exists, it'll try and pick back up where it left off from a previous run. I haven't tested this fully. Go ahead and fuzz it. ๐Ÿ›

Examples

python3 ./get-repos.py
python3 ./get-repos.py --token <my-token>
python3 ./get-repos.py --filename repos.csv
python3 ./get-repos.py --token <my-token> --filename repos.csv

FAQ

Why not use GH Archive?

I wanted to do it myself and learn the API. You probably want the GH Archive, not my messy script.

github-repo-scraper's People

Contributors

whatthefuzz avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.