Replication package for our work on "Taxing Collaborative Software Engineering"
This replication package requires Python 3.10 or higher. Install the dependencies via
python3 -m pip install -r requirements.txt
For a faster loading, we recommend to optionally install orjson
via pip:
python3 -m pip install orjson
First, we collect all timelines from all pull requests at a GitHub instance. crawler.py
requires an <api_token>
for your GitHub instance and an <out_dir>
where the results are stored into:
python3 crawl.py <api_token> <out_dir>
[crawler.py
](crawler.py also provides the following optional command line arguments:
--api_url
for the GitHub instance URL (default:https://api.github.com
)--disable_cache
for disable caching (for larger instances not recommended)--num_workers
for parallel processes (default: 1)--organization
for limiting to one organization (helpful for organizations hosted on github.com)
To list all options in detail, run
python3 crawl.py --h
For this step, you will need
- the directory of the previously collected data and
- a mapping of users and countries. This can be either a
dict
for a static mapping (does not capture changes in the users' location over time) or a dataframe for time-dependent mapping as data frame monthly sampled (captures changes in the users' location over time).
Run notebook.ipynb
. Look out for the instructions as inline comments.
Copyright © 2023 Michael Dorner
This work is licensed under MIT license.