Git Product home page Git Product logo

gitcollector's Introduction

gitcollector GitHub version Build Status codecov GoDoc Go Report Card

gitcollector collects and stores git repositories.

gitcollector is the source{d} tool to download and update git repositories at large scale. To that end, it uses a custom repository storage file format called siva optimized for saving storage space and keeping repositories up-to-date.

Status

The project is in a preliminary stable stage and under active development.

Storing repositories using rooted repositories

A rooted repository is a bare Git repository that stores all objects from all repositories that share a common history, that is, they have the same initial commit. It is stored using the Siva file format.

Root Repository explanatory diagram

Rooted repositories have a few particularities that you should know to work with them effectively:

  • They have no HEAD reference.
  • All references are of the following form: {REFERENCE_NAME}/{REMOTE_NAME}. For example, the reference refs/heads/master of the remote foo would be /refs/heads/master/foo.
  • Each remote represents a repository that shares the common history of the rooted repository. A remote can have multiple endpoints.
  • A rooted repository is simply a repository with all the objects from all the repositories which share the same root commit.
  • The root commit for a repository is obtained following the first parent of each commit from HEAD.

Getting started

Plain command

gitcollector entry point usage is done through the subcommand download (at this time is the only subcommand):

Usage:
  gitcollector [OPTIONS] download [download-OPTIONS]

Help Options:
  -h, --help                                     Show this help message

[download command options]
          --library=                             path where download to [$GITCOLLECTOR_LIBRARY]
          --bucket=                              library bucketization level (default: 2) [$GITCOLLECTOR_LIBRARY_BUCKET]
          --tmp=                                 directory to place generated temporal files (default: /tmp) [$GITCOLLECTOR_TMP]
          --workers=                             number of workers, default to GOMAXPROCS [$GITCOLLECTOR_WORKERS]
          --half-cpu                             set the number of workers to half of the set workers [$GITCOLLECTOR_HALF_CPU]
          --no-updates                           don't allow updates on already downloaded repositories [$GITCOLLECTOR_NO_UPDATES]
          --no-forks                             github forked repositories will not be downloaded [$GITCOLLECTOR_NO_FORKS]
          --orgs=                                list of github organization names separated by comma [$GITHUB_ORGANIZATIONS]
          --excluded-repos=                      list of repos to exclude separated by comma [$GITCOLLECTOR_EXCLUDED_REPOS]
          --token=                               github token [$GITHUB_TOKEN]
          --metrics-db=                          uri to a database where metrics will be sent [$GITCOLLECTOR_METRICS_DB_URI]
          --metrics-db-table=                    table name where the metrics will be added (default: gitcollector_metrics) [$GITCOLLECTOR_METRICS_DB_TABLE]
          --metrics-sync-timeout=                timeout in seconds to send metrics (default: 30) [$GITCOLLECTOR_METRICS_SYNC]

    Log Options:
          --log-level=[info|debug|warning|error] Logging level (default: info) [$LOG_LEVEL]
          --log-format=[text|json]               log format, defaults to text on a terminal and json otherwise [$LOG_FORMAT]
          --log-fields=                          default fields for the logger, specified in json [$LOG_FIELDS]
          --log-force-format                     ignore if it is running on a terminal or not [$LOG_FORCE_FORMAT]

Usage example, --library and --orgs are always required:

gitcollector download --library=/path/to/repos/directoy --orgs=src-d

To collect repositories from several github organizations:

gitcollector download --library=/path/to/repos/directoy --orgs=src-d,bblfsh

Note that all the download command options are also configurable with environment variables.

Docker

gitcollector upload a new docker image to docker hub on each new release. To use it:

docker run --rm --name gitcollector_1 \
-e "GITHUB_ORGANIZATIONS=src-d,bblfsh" \
-e "GITHUB_TOKEN=foo" \
-v /path/to/repos/directory:/library \
srcd/gitcollector:latest

Note that you must mount a local directory into the specific container path shown in -v /path/to/repos/directory:/library. This directory is where the repositories will be downloaded into rooted repositories in siva files format.

License

GPL v3.0, see LICENSE

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.