Git Product home page Git Product logo

github-mirror's Introduction

ghtorrent: Mirror and index data from the Github API

A library and a collection of scripts used to retrieve data from the Github API and extract metadata in an SQL database, in a modular and scalable manner. The scripts are distributed as a Gem (ghtorrent), but they can also be run by checking out this repository.

GHTorrent can be used for a variety of purposes, such as:

  • Mirror the Github API event stream and follow links from events to actual data to gradually build a Github index
  • Create a queriable metadata database for a specific repository
  • Construct a data source for extracting process analytics (see for example those) for one or more repositories

Components

GHTorrents components (which can be used individually) are:

  • APIClient: Knows how to query the Github API (both single entities and pages) and respect the API request limit. Can be configured to override the default IP address, in case of multihomed hosts.
  • Retriever: Knows how to retrieve specific Github entities (users, repositories, watchers) by name. Uses an optional persister to avoid retrieving data that have not changed.
  • Persister: A key/value store, which can be backed by a real key/value store, to store Github JSON replies and query them on request. The backing key/value store must support arbitrary queries to the stored JSON objects.
  • GHTorrent: Knows how to extract information from the data retrieved by the retriever in order to update an SQL database (see schema) with metadata.

Component Configuration

The Persister and GHTorrent components have configurable back ends:

  • Persister: Either uses MongoDB > 3.0 (mongo driver) or no persister (noop driver)
  • GHTorrent: GHTorrent is tested mainly with MySQL and SQLite, but can theoretically be used with any SQL database compatible with Sequel. Your milaege may vary.

For distributed mirroring you also need RabbitMQ >= 3.3

Installation

1. Install GHTorrent

GHTorrent is written in Ruby (tested with Ruby > 2.0). To install it as a Gem do:

sudo gem install ghtorrent

2. Install Your Preferred Database

Depending on which SQL database you want to use, install the appropriate dependency gem.

sudo gem install mysql2 # or sqlite3

Configuration

Copy config.yaml.tmpl to a file in your home directory.

All provided scripts accept the -c option, which accepts the location of the configuration file as a parameter.

You can find more information of how you can setup a mirroring cluster of machines to retrieve data in parallel on the Wiki.

Using GHTorrent

To mirror the event stream and capture all data:

  • ght-mirror-events.rb periodically polls Github's event queue (https://api.github.com/events), stores all new events in the configured pestister, and posts them to the github exchange in RabbitMQ.

  • ght-data_retrieval.rb creates queues that route posted events to processor functions. The functions use the appropriate Github API call to retrieve the linked contents, extract metadata (for database storage), and store the retrieved data in the appropriate collection in the persister, to avoid duplicate API calls. Data in the SQL database contain pointers (the ext_ref_id field) to the "raw" data in the persister.

To retrieve data for a repository or user:

  • ght-retrieve-repo retrieves all data for a specific repository
  • ght-retrieve-user retrieves all data for a specific user

To perform maintenance:

  • ght-load loads selected events from the persister to the queue in order for the ght-data-retrieval script to reprocess them

Data

The code in this repository is used to power the data collection process of the GHTorrent.org project. You can find all data collected by in the project in the Downloads page.

There are two sets of data:

  • Raw events: Github's event stream. These are the roots for mirroring operations. The ght-data-retrieval crawler starts from an event and goes deep into the rabbit hole.
  • SQL dumps + Linked data: Data dumps from the SQL database and the corresponding MongoDB entities.

Bugs & Feature Requests

Please tell us about features you'd like or bugs you've discovered on our Issue Tracker.

Patches, bug fixes, etc are welcome. Please fork the repository and create a pull request when done fixing/implementing the new feature.

Citing GHTorrent in your Research

If you find GHTorrent and the accompanying datasets useful in your research, please consider citing the following paper:

Georgios Gousios and Diomidis Spinellis, "GHTorrent: GitHub’s data from a firehose," in MSR '12: Proceedings of the 9th Working Conference on Mining Software Repositories, June 2-–3, 2012. Zurich, Switzerland.

Authors

License

2-clause BSD

github-mirror's People

Contributors

bhlowe avatar colelloa avatar deepsource-io[bot] avatar dspinellis avatar gousiosg avatar hahnicity avatar jeffmcaffer avatar larsborn avatar notalex avatar pdegenportnoy avatar rtlee9 avatar ryanfarr01 avatar sbaltes avatar vmarkovtsev avatar ward avatar xchikux avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

github-mirror's Issues

MongoDB dumps from tudelft.nl unreachable

It seems like the two initial MongoDB dumps (hosted at TU Delft) have been unreachable for the past two weeks:

http://dutihr.st.ewi.tudelft.nl/downloads/commits-dump.2015-08-03.tar.gz
http://dutihr.st.ewi.tudelft.nl/downloads/commits-1-dump.2015-08-04.tar.gz

We encountered timeouts (no error response) using either wget or Firefox from both university and private IPs, although the host responds to ping. We had previously (a few months ago) downloaded both files successfully.

Pull request history "merged" problem.

It seems that the "merged" actions are not associated with the correct user. Sometimes it is associated with the issue opener sometimes with the closer. Using the github /user/repo/issues api the correct author login could be acquired.

For example the following issues are all merged by "juditacs" user (based on the github api), and in the sql dump:

Incorrect opened_at in table pull_request_history

I found this problem where I compared the opened_at of a pull-request in GHTorrent and the created_at of the same pull-request accessed via the GH official API.

Here is an example:

  1. data in table pull_request_history
id pull_request_id created_at action actor_id
20920677 345020 2012-08-30 17:41:41 opened 654469

the PR is the #1266 in repository cocos2d/cocos2d-x

  1. data fetched by GH official API
    https://api.github.com/repos/cocos2d/cocos2d-x/pulls/1266
    "created_at": "2012-08-30T19:41:41Z"

2012-08-30 17:41:41 in GHTorrent is two hours early than 2012-08-30T19:41:41Z in GH.

At first, I thought this inconsistence is caused by the difference of timezone. However, the created_at of a repository in GHTorrent equals to the value fetched by GH official API.

Last update too recent?

If a spider fails to get response from GitHub API in consequence of unstable network environment, or the GitHub API Server itself, the program exits in advance with no error or exception. But when I run with the same config again, it says:

WARN, 2016-05-04T00:11:20+08:00, ghtorrent -- full_repo_retriever.rb: Last update too recent

Any way to resume the polling? I found the watcher data is pretty far from the real (watchers + starrers).

Incorrect datetime value error when running ght-restore-mysql

I got this error message on an Ubuntu 16.04 system (MySQL 5.7.16):

....
Thu Jan 12 00:43:27 CET 2017 Creating indexes
Thu Jan 12 00:43:27 CET 2017 CREATE UNIQUE INDEX `login` ON `ghtorrent17_1`.`users` (`login` ASC)  COMMENT '';
Thu Jan 12 01:00:49 CET 2017 CREATE UNIQUE INDEX `sha` ON `ghtorrent17_1`.`commits` (`sha` ASC)  COMMENT '';
ERROR 1292 (22007) at line 1: Incorrect datetime value: '0000-00-00 00:00:00' for column 'created_at' at row 490174

This fixed the problem for me:

UPDATE commits SET created_at = NULL WHERE CAST(created_at AS CHAR(20)) = '0000-00-00 00:00:00';
UPDATE projects SET created_at = NULL WHERE CAST(created_at AS CHAR(20)) = '0000-00-00 00:00:00';
UPDATE projects SET updated_at = NULL WHERE CAST(updated_at AS CHAR(20)) = '0000-00-00 00:00:00';

issue_events action check constraint causes transaction to fail

During normal operation, ensure_issue_events fails with the following error:

ERROR, 2017-10-27T13:40:21-04:00, ghtorrent -- ghtorrent.rb: PG::CheckViolation: ERROR: new row for relation "issue_events" violates check constraint "issue_events_action_check" DETAIL: Failing row contains (831153782, 49, 27, head_ref_restored, null, 2016-10-20 20:01:36).

This error does not crash the app, but causes all later transactions to fail:

WARN, 2017-10-27T13:40:21-04:00, ghtorrent -- ghtorrent.rb: Transaction failed (3245 ms) ERROR, 2017-10-27T13:40:21-04:00, ghtorrent -- ghtorrent.rb: PG::InFailedSqlTransaction: ERROR: current transaction is aborted, commands ignored until end of transaction block

There are many different types of issue events: https://developer.github.com/v3/issues/events/#events-1

However, the type constraint only declares a few actions, but enforces a not null constraint regardless: https://github.com/gousiosg/github-mirror/blob/master/lib/ghtorrent/migrations/011_add_issues.rb#L30

The code itself does nothing to try to filter the "action" field: https://github.com/gousiosg/github-mirror/blob/master/lib/ghtorrent/ghtorrent.rb#L1597

My question:

What is preferable
a) expand the constraint, or drop it altogether, so that all issue events are allowed to be written?
or b) alter ensure_issue_events so that it won't attempt to write any events that would violate the constraint?

Need better api error handling

def api_request_raw(url, media_type = '') in api_client sometimes returns nil and sometimes throws an exception. This ripples up to pretty much all of the retrieve* and ensure* methods and beyond. Some of these built to handle nil and others are not. We are seeing many secondary exceptions as a result of nil return values. This typically happens when a 403 Forbidden is returned from GitHub. While I suspect that is a throttling related problem (subsequent calls appear to work), it is completely realistic that this happen as there are a raft of different REST call status that will cause nil to be returned.

I started by putting .nil? checks in the appropriate places but:

  • there are quite a few places
  • makes the code look yucky
  • generally these just exit the method

The bonus of checking in the caller is that you can provide a somewhat more targeted error. Rather then "could not retrieve user XXX", the code could give "Failed ensure_commit because user XXX could not be retrieved"

If an exception is to be thrown, there are a number of "send() loops" that will need to be augmented with rescues.

I'm happy to help make the related changes but need to know

  • exceptions or nil? checks?
  • fix/change/remove existing nil? checks if exceptions are the chosen path?

Metadata appears out of date on several repos in MongoDB

GHTorrent Team,

Black Duck Software has been hoping to use GHTorrent to keep up to date on all github metadata. When accessing using the MongoDB instance, we have noticed that several major repos appear to be somewhat out of date.

Using a few repos from owner 'google' as an example, but we've seen several major repos other than google being quite out of date:

  1. google/material-design-lite

https://api.github.com/repos/google/material-design-lite

"stargazers_count": 28104,
"watchers_count": 28104,

db.repos.find({ name: 'material-design-lite', 'owner.login': 'google' })

   "stargazers_count": 12936,
   "watchers_count": 12936,

  1. google/incremental-dom

https://api.github.com/repos/google/incremental-dom

"stargazers_count": 2684,
"watchers_count": 2684,

db.repos.find({ name: 'incremental-dom', 'owner.login': 'google' })

"stargazers_count":1314,
"watchers_count":1314,
  1. google/binnavi

https://api.github.com/repos/google/binnavi

"stargazers_count": 2187,
"watchers_count": 2187,

db.repos.find({ name: 'binnavi', 'owner.login': 'google' })

   "stargazers_count":1183,
   "watchers_count":1183,

We at Black Duck were wondering if there was a way to identify repositories that have not been updated recently, and/or would like to help out the GHTorrent team in any way possible to stay as up to date as possible. Please reach out to us here on email me directly at [email protected] so we can try to find a solution that can benefit both of our organizations.

Best,
Andrew Colello
Software Engineer, Knowledgebase Team
Black Duck Software

Schema and database fields of pull_requests do not agree

The field user_id listed in the schema is not part of the pull_requests table.

mysql> describe pull_requests;
+----------------+------------+------+-----+---------+----------------+
| Field          | Type       | Null | Key | Default | Extra          |
+----------------+------------+------+-----+---------+----------------+
| id             | int(11)    | NO   | PRI | NULL    | auto_increment |
| head_repo_id   | int(11)    | YES  | MUL | NULL    |                |
| base_repo_id   | int(11)    | NO   | MUL | NULL    |                |
| head_commit_id | int(11)    | YES  | MUL | NULL    |                |
| base_commit_id | int(11)    | NO   | MUL | NULL    |                |
| pullreq_id     | int(11)    | NO   | MUL | NULL    |                |
| intra_branch   | tinyint(1) | NO   |     | NULL    |                |
+----------------+------------+------+-----+---------+----------------+
7 rows in set (0.00 sec)

Permission problem when restoring the backup

Hi,

I'm in the process of restoring the last backup (20160419), I was following the indications described here and I think there is something missing to make it work.

The execution of the command ght-restore-mysql fails when running the LOAD DATA INFILE statements as the user created for the new schemata does not include the corresponding permission.

I solved it with something like:

GRANT FILE ON *.* TO 'ghtorrentuser'@'localhost';

Maybe the problem appears only with specific database configurations. In my case, I'm using MySQL Commnuity Server 5.6.25 on Linux.

Error or misundertanding

Hi there,

I'm finishing my Phd on computer science and I will use GHTorrent in my thesis.

I'm having some problems to understand why there is some differences between the data coming from github and the data coming from ghtorrent.

Let's give an example:

1 - Clone https://github.com/vaadin/framework
2 - Take the first commit by doing git rev-list --max-parents=0 HEAD : d0b04c7fb28acc39ceeb63ea0c22f8568e7ca81d
3 - Search for this sha in the project_commit table
4 - It is not associated with the https://github.com/vaadin/framework project in ghtorrent. It is associated with 2 other projects that are copies of the vaandin project.

Could please help to understand why the commit d0b04c7fb28acc39ceeb63ea0c22f8568e7ca81d is not associated with the project in https://github.com/vaadin/framework inside the ghtorrent?

Thank you very much.

Data in the "User" table is not up-to-date

I've downloaded and restored the mysql dump from "mysql-2018-02-01".

When I query certain known users in the "User" table, I noticed that the information contained in the table is not up-to-date. For example, the "location" field for my record (i.e., login=shehan) is Null. This was correct when I created my profile, but I updated this profile attribute in 2016. So, the latest "User" table should not show a Null value.

Is this a known issue with the GHTorrent data? Can you let me know how I can obtain a User table with the most recent data?

Thanks!

Labels for pull requests.

Pull requests are a special class of issue. However some repositories seem to be missing labels attached with pull requests. I would propose that we ensure that labels are attached to pull requests the same way they are attached to regular issues.

ght-retrieve-repo breaks while fetching no longer existing user/fork

WARN, 2018-03-02T15:22:53-08:00, ghtorrent -- api_client.rb: Failed request. URL: https://api.github.com/repos/Crockchartering/twitter.github.com, Status code: 404, Status: Not Found, Access: 336331c85fc, IP: 0.0.0.0, Remaining: 4367
ERROR, 2018-03-02T15:22:53-08:00, ghtorrent -- full_repo_retriever.rb: Error in stage: ensure_forks, Repo: twitter/twitter.github.com, Message: no implicit conversion of String into Integer
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:694:in `[]='
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:694:in `block (3 levels) in repo_bound_items'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:683:in `each'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:683:in `block (2 levels) in repo_bound_items'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:679:in `each'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:679:in `block in repo_bound_items'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:670:in `each'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:670:in `repo_bound_items'
/home/gordonl/github-mirror/lib/ghtorrent/retriever.rb:354:in `retrieve_forks'
/home/gordonl/github-mirror/lib/ghtorrent/ghtorrent.rb:1388:in `ensure_forks'
/home/gordonl/github-mirror/lib/ghtorrent/commands/full_repo_retriever.rb:87:in `block in retrieve_full_repo'
/home/gordonl/github-mirror/lib/ghtorrent/commands/full_repo_retriever.rb:84:in `each'
/home/gordonl/github-mirror/lib/ghtorrent/commands/full_repo_retriever.rb:84:in `retrieve_full_repo'
/home/gordonl/github-mirror/lib/ghtorrent/commands/ght_retrieve_repo.rb:31:in `go'
/home/gordonl/github-mirror/lib/ghtorrent/command.rb:66:in `run'
bin/ght-retrieve-repo:6:in `<main>'

ght-restore-mysql: Errcode: 13 "Permission denied" (MariaDB 10.1)

Trying to restore a dump in MariaDB 10.1 I get the error:

acs@~/devel/ghtorrent/dump $ ./ght-restore-mysql -u root -d ghtorrent -p '' .
jue sep 20 22:39:06 CEST 2018 Creating the DB schema
jue sep 20 22:39:09 CEST 2018 Restoring table commit_comments
ERROR 13 (HY000) at line 1: Can't get stat of '/home/acs/devel/ghtorrent/dump/commit_comments.csv' (Errcode: 13 "Permission denied")

I have fixed it with the change:

LOAD DATA LOCAL INFILE inside the ght-restore-mysqlscript.

It has a performance issue: "When using LOCAL with LOAD DATA, a copy of the file is created in the directory where the MySQL server stores temporary files" (mysql doc) but it works.

trying to extract mysql dump gives an error: gzip: stdin: unexpected end of file

tar zxfv mysql-2017-04-01.tar.gz              
mysql-2017-04-01/                             
mysql-2017-04-01/commit_comments.csv          
mysql-2017-04-01/pull_requests.csv            
mysql-2017-04-01/followers.csv                
mysql-2017-04-01/watchers.csv                 
mysql-2017-04-01/pull_request_comments.csv    
                                              
gzip: stdin: unexpected end of file           
tar: Unexpected EOF in archive                
tar: Unexpected EOF in archive                
tar: Error is not recoverable: exiting now  

Geocoded GitHub Data

Hello,
My organization just utilized the GitHub Mirror (more specifically the GHTorrent SQL Data from 4/2) to generate a dataset detailing programming language popularity by country. In order to do this, I developed a tool to geocode the entirety of the GitHub users database (affix an ISO-2-character country code to each user). I think that geo-tagged information could be a valuable addition to the datasets provided by the GHTorrent Site.

I would like to donate both the geocoded user database as well as the tool I developed to the github mirror community so that others can benefit. If you are interested, let me know the best way to approach providing the data.

Thanks!

Derek

Download error

Hi!
I'm trying to download the mysql dump of February 2017 of 50 GB, but I'm getting error of reading bytes. I tried to download from different networks. Does the server kicking me out?

Not working when gem mongo upgraded to (2.2.1, 1.12.5)

$ ruby -Ilib bin/ght-retrieve-repo -c config.yaml gousiosg github-mirror
Overriding configuration mirror_history_pages_back=5 with new value 1000
WARN, 2016-01-25T09:46:19+00:00, ghtorrent -- ghtorrent.rb: Transaction failed (1 ms)
uninitialized constant Mongo::ConnectionFailure
/home/legend/github-mirror/lib/ghtorrent/adapters/mongo_persister.rb:199:in `rescue in rescue_connection_failure'

It seems that mongodb were not specified version?

SQL error for missing fork parents

when ensure_repo runs on a fork, it looks to ensure the parent repo is also present. If it is not or is otherwise not available, then this line breaks with an error trying to relate foreign keys.

This scenario can happen if the parent cannot be loaded. For example, the key in use may not have permissions to that repo or there may be a transient error.

What is the right fix to do here? I have not looked at the database enough to grok all the relationships

Access denied on import

Importing data with the document procedure produces an error, such as the following

ERROR 1045 (28000) at line 1: Access denied for user 'ghtorrent'@'localhost' (using password: YES)

Extremely high mongodb load

It seems the insert conditions put a very high loads and some locks on mongodb. Is there any suitable way to address this?

Errors when using MySQL web

I encounter the following errors when trying to access MySQL web (http://ghtorrent.org/dblite/)

I login as a 'Guest' and encounter the below error when I try to expand a table from the Database Explorer pane/window:

SQLSTATE[HY000]: General error: 1021 Disk full (/mnt/#sql_1c7d9_0.MAI); waiting for someone to free some space... (errno: 28 "No space left on device

When I try to execute a select query (e.g., select * from users limit 5;​) I encounter the following error:

SQLSTATE[42000]: Syntax error or access violation: 1142 SELECT command denied to user 'ghtro'@'web' for table 'users'

Thanks!

Query a repository's license

Hi,

I'm currently doing research into developers' locations and the types of licenses they use. GHTorrent is a valuable source of information for me, but I noticed in the MySQL web interface that repositories' licenses are not currently retrieved / stored. Is this correct?

I can probably work around this by writing some code of my own, but I think it will be a valuable addition to GHTorrent. The only change that seems to be required is to provide a custom media type in the Accept header (application/vnd.github.drax-preview+json), see GitHub's documentation. I noticed that just recently you added support for custom media types, so I hope this wouldn't be too much work.

Thank you for providing GHTorrent, it is already a valuable resource for me!

Why project_commits?

Hi!
Why project_commits exists if there is a project_id in each commits' row?

Confusion in the schema of RDBMS

Hello,

This is regarding the schema given here:
http://ghtorrent.org/files/schema.pdf

The id key is used in many tables and I am unable to understand it clearly. Consider the following cases:

  1. The primary key "id" used in projects table, is it the same field used in repo_labels or repo_milestones? I am unable to understand how are projects and repos related in the schema. Semantically, a project can consist of many repositories in github.
  2. The repo_labels has two labels: id and repo_id. The project members has repo_id as one of the primary key while the project table does not refer to repo_id at all.
  3. Watchers and issues are associated with repo_id while commits are associated with project_id. The project table does not has project_id in its attribute list.

I have imported the sql dump and had hoped that firing a few sql queries would resolve the confusion, but it hasn't. Please help!

Thanks and Regards,
Ayushi

mislabelled user type

There seems to be a great number of individual users that are labeled as 'Organization'. e.g., SureShinde

Inconsistency on tables

I'm using GHTorrent on Google BigQuery (https://bigquery.cloud.google.com/table/ghtorrent-bq:ght_2017_04_01.project_languages?pli=1)

I've found a inconsistency between tables. I've queried the top projects of a specific language with more commits. For this, I've used the table project_languages. But when I queried over the table projects, the column "language" shows sometimes another language. Example: I've queried the top projects ordered by number of commits of projects of Java. When I query in the table projects with the project_id of the another query, the column "language" shows another language like C.

Now, I'm lost. Which field is more fiable? Likewise, there are a lot of commits from 1994. Is it real?

Non-commit entities not stored in MySQL database

When running ght-retrieve-repo, while commits are successfully stored in the database, issues, pull_requests, etc. are fetched but not stored, even when providing the -y option. I notice in the logs that while ghtorrent.rb is being used to add commits to the database when retrieving them, this is not the case with the other entities.

commits:

...
INFO, 2018-02-27T16:13:45-08:00, ghtorrent -- api_client.rb: Successful request. URL: https://api.github.com/repos/twitter/twemoji/commits/72b5e44e092d910629547cbc6886127901fb81d8?per_page=100, Remaining: 3098, Total: 278 ms
INFO, 2018-02-27T16:13:45-08:00, ghtorrent -- retriever.rb: Added commit twitter/twemoji -> 72b5e44e092d910629547cbc6886127901fb81d8
INFO, 2018-02-27T16:13:45-08:00, ghtorrent -- ghtorrent.rb: Added commit twitter/twemoji -> 72b5e44e092d910629547cbc6886127901fb81d8 
INFO, 2018-02-27T16:13:45-08:00, ghtorrent -- ghtorrent.rb: Added commit_parent 72b5e44e092d910629547cbc6886127901fb81d8 to commit 2d8c1a7e7243c76aa53db8f018dcbdb994d22024
...

pull requests:

...
INFO, 2018-02-27T16:09:54-08:00, ghtorrent -- api_client.rb: Successful request. URL: https://api.github.com/repos/twitter/twemoji/pulls/225, Remaining: 3284, Total: 733 ms
INFO, 2018-02-27T16:09:54-08:00, ghtorrent -- retriever.rb: Added pull_requests twitter/twemoji -> 225
INFO, 2018-02-27T16:09:55-08:00, ghtorrent -- api_client.rb: Successful request. URL: https://api.github.com/repos/twitter/twemoji/pulls/219, Remaining: 3283, Total: 870 ms
INFO, 2018-02-27T16:09:55-08:00, ghtorrent -- retriever.rb: Added pull_requests twitter/twemoji -> 219
...

PTY allocation request

@gousiosg Hello,

after using the service for a while I get following response when running
ssh -L 3306:web.ghtorrent.org:3306 [email protected]
response:
PTY allocation request failed on channel 2

Therefore I can't connect to mysql database.
MongoDB works fine.

Thanks

SQLite3::SQLException: AUTOINCREMENT is only allowed on an INTEGER PRIMARY KEY

Hi during run this command: ght-retrieve-repo user repo I received followed error:

Overriding configuration mirror_history_pages_back=5 with new value 1000
Database empty, running migrations from /var/lib/gems/2.1.0/gems/ghtorrent-0.11.1/lib/ghtorrent/migrations
Creating table users
Creating table projects
Creating table commits
Creating table commit_parents
Creating table followers
Adding organization descriminator field to table users
Updating users with default values
Creating table organization-members
Adding table commit comments
Adding table project members
Adding table watchers
Adding table pull requests
Adding table pull request history
Adding table pull request commits
Adding table pull request comments
Adding unique(name, owner) constraint to table projects
Create table project_commits
Migrating data from commits to project_commits
Adding table forks
Adding table issues
Adding issue history
SQLite3::SQLException: AUTOINCREMENT is only allowed on an INTEGER PRIMARY KEY
/var/lib/gems/2.1.0/gems/sqlite3-1.3.11/lib/sqlite3/database.rb:91:in `initialize'

Can I resolve this problem?

Incorrect number of Pull Requests in table pull_request_history

Using the query

SELECT action, COUNT(*) as freq, YEAR(created_at) as pr_yr, MONTH(created_at) as pr_mt
FROM
  [ghtorrent-bq:ght_2017_09_01.pull_request_history] 
GROUP BY pr_yr, pr_mt, action

I get a monthly digest of opened and merged PRs events on GitHub. Unfortunately, plotting this does not yield a very plausible graph:

rplot-1

I was able to figure out when things seem to have gone wrong for the number of merged pullrequests, to the first month of 2014:

rplot05-1

Later, the sharp drop in mid-2016 also seems questionable.

I hope this helps you debug the issue.

(Possibly related to #19.)

Issues importing mysql data -- Importing instructions

Hi,

Thanks for the awesome project!

I am having a hard time importing the massive mysql file. It starts out fast but after 20% or so the import speed drops significantly. It seems like it will never complete because the rate of import drops faster than the the progress.

I am wondering if you can include instructions in the readme/website of how to import the data. Are there mysql config changes that i need to make? What are the min system requirements?

Thanks!

Cannot find user [email protected]:gousiosg/github-mirror.git.

ght-retrieve-repo -c ~/config.yaml -s 'username' -p 'password' [email protected]:gousiosg/github-mirror.git

fail to connect with

g/github-mirror.git
Overriding configuration github_username=username with new value username
Overriding configuration github_passwd=password with new value password
Overriding configuration mirror_history_pages_back=5 with new value 1000
WARN, 2017-07-19T14:23:37+02:00, ghtorrent -- ghtorrent.rb: Not a valid email address: [email protected]:gousiosg/github-mirror.git
Error: Cannot find user [email protected]:gousiosg/github-mirror.git.
Try --help for help.

restore script doesn't allow change of storage engine

When I try to use InnoDB I get an error.

(venv)greg@Ithilien:~/Documents/ecs260/dump/mod_dump$ ./ght-restore-mysql -uroot -proot -eInnoDB .
./ght-restore-mysql: illegal option -- e
Invalid option: -
Usage: ./ght-restore-mysql [-u dbuser ] [-p dbpasswd ] [-h dbhost] [-d database ] dump_dir

Am I doing something wrong here?

Missing torrents

Hi,

Under 'Available Downloads', it says 'List of available torrents (Last dump date: 2014-11-29)' but there are no torrents listed.

Can you update the site with a list of available .torrent files?

Thanks!

How to check if my schema creation (through ght_restore_mysql script) is complete

I ran the 'ght-restore-mysql' script on my remote server and then shifted the process to run in the background and disowned it. ( I pressed ctrl+Z, then command 'bg', 'disown' ). The problem is I think my schema creation is completed but the bash script still gets shown as a running process with a status S. And as I didn't use any nohup command or such, I don't have an output log where I can check any exit code.

So, I was thinking if there's any way to check if my schema creation is completed or else I'd have to run the script again to be completely sure.

I used the latest June dump (mysql-2018-06-01) and after running the script,
my ghtorrent database shows a size of 357.33 GB.

+-----------------------+---------------+
| DB Name               | DB size in MB |
+-----------------------+---------------+
| ghtorrent_june01_2018 |      365905.6 |

Can anyone confirm if the size I get is accurate?

Or Is there any other way to check is the schema creation is complete or not?

SQLite3::SQLException in the first attempt

Have I done something wrong?

E:\github\github-mirror>ght-retrieve-repo -c config.yaml gousiosg github-mirror
Overriding configuration mirror_history_pages_back=5 with new value 1000
Database empty, running migrations from D:/RailsInstaller/Ruby2.1.0/lib/ruby/gems/2.1.0/gems/ghtorrent-0.11.1/lib/ghtorrent/migrations
Creating table users
Creating table projects
Creating table commits
Creating table commit_parents
Creating table followers
Adding organization descriminator field to table users
Updating users with default values
Creating table organization-members
Adding table commit comments
Adding table project members
Adding table watchers
Adding table pull requests
Adding table pull request history
Adding table pull request commits
Adding table pull request comments
Adding unique(name, owner) constraint to table projects
Create table project_commits
Migrating data from commits to project_commits
Adding table forks
Adding table issues
Adding issue history
SQLite3::SQLException: AUTOINCREMENT is only allowed on an INTEGER PRIMARY KEY
D:/RailsInstaller/Ruby2.1.0/lib/ruby/gems/2.1.0/gems/sqlite3-1.3.11-x86-mingw32/lib/sqlite3/database.rb:91:in `initialize'

The result is that only user appears in the MongoDB.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.