Git Product home page Git Product logo

wp2.2_dev's Introduction

OSD status dashboard (wp2.2_dev)

Demo backend API Python version CodeQL standard-readme compliant REUSE compliance status Contributor Covenant GitHub license DOI

Demonstrator data-mining backend for an open source development status dashboard

Targeted at hosters of version control platforms (such as Wikifactory, GitLab, or GitHub), this Python backend program mines open source hardware repositories for metadata and calculates metrics based on it. This backend exposes a representational state transfer (REST) application programming interface (API) where requests for those metrics can be made.

This software is not for general consumers to just "double click" on and install on their devices.

Please see the Install and Usage sections to get up and running with this tool.

Table of Contents

Background

Today’s industrial product creation is expensive, risky and unsustainable. At the same time, the process is highly inaccessible to consumers who have very little input in the design and distribution of the finished product. Presently, SMEs and maker communities across Europe are coming together to fundamentally change the way we create, produce, and distribute products.

OPENNEXT is a collaboration between 19 industry and academic partners across Europe. Funded by the European Union's Horizon 2020 programme, this project seeks to enable small and medium enterprises (SMEs) to work with consumers, makers, and other communities in rethinking how products are designed and produced. Open source hardware is a key enabler of this goal where the design of a physical product is released with the freedoms for anyone to study, modify, share, and redistribute copies. These essential freedoms are based on those of open source software, which is itself derived from free software where the word free refers to freedom, not free-of-charge. When put in practice, these freedoms could potentially not only reduce proprietary vendor lock-in, planned obsolescence, or waste but also stimulate novel – even disruptive – business models. The SME partners in OPENNEXT are experimenting with producing open source hardware and even opening up the development process to wider community participation. They produce diverse products ranging from desks, cargo bike modules, to a digital scientific instrument platform (and more).

Work package 2 (WP2) of OPENNEXT is gathering theoretical and practical insights on best practices for company-community collaboration when developing open source hardware. This includes running Delphi studies to develop a maturity model to describe the collaboration and developing a precise definition for what the "source" is in open source hardware. In particular, task 2.2 in this work package is developing a demonstration project status dashboard with "health" indicators showing the evolution of a project within the maturity model; design activities; or progress towards success based on project goals. Details of the dashboard's technical architecture are described in the deliverable 2.5 (D2.5) report.

This repository contains the backend code for D2.5 and to be clear, this deliverable is: Designed to be deployed on a server operated by version control platforms such as Wikifactory or GitHub.

This deliverable is not: For general end-users to install on consumer devices and "double click" to open.

In addition, this repository aims to follow international standards and good practices in open source development such as, but not limited to:

Install

This section assumes knowledge of Python, Git, and using a GNU/Linux-based server including installing software from package managers and running a terminal session.

Note: This software is designed to be deployed on a server by system administrators or developers, not on generic consumer devices.

This project requires Python version 3.10 or later on your server and running it in a Python virtual environment is optional but recommended. Detailed external library dependencies are listed in the standard-conformant requirements.txt file and also here:

In addition to Python and the dependencies listed above, the following programs must be installed and accessible from the command line:

  • git (version 2.7.4 or later)
  • pip (version 19.3.1 or later)

A GitHub personal access token is required top be available as an environmental variable. This is because the Python scripts will use it for GitHub API queries. This token is an alphanumeric string in the form of "ghp_2D5TYFikFsQ4U9KPfzHyvigMycePCPqkPgWc".

Running from source

The code can be run from source and has been tested on updated versions of GNU/Linux server operating systems including Red Hat Enterprise Linux 8.7. While effort has been made to keep the Python scripts platform-agnostic, they have not been tested under other operating systems such as BSD-derivatives, Apple macOS or Microsoft Windows as they - especially the latter two - are rarely used for hosting code such as this.

On your server, with the tools git and pip installed, run the following commands in a terminal session to retrieve the latest version of this repository and prepare it for development and running locally (usually for testing):

git clone https://github.com/OPEN-NEXT/wp2.2_dev.git
pip install --user -r requirements.txt

The git command will download the files in this repository onto your server into a directory named wp2.2_dev, and pip installs the Python dependencies listed in requirements.txt.

In a terminal window at the root directory of this repository, start the server with the uvicorn Asynchronous Server Gateway Interface (ASGI) server by running this command:

uvicorn oshminer.main:app --reload

There will be some commandline output which ends with something like the following line:

INFO:     Application startup complete.

This means the server API is up an running, and should be accessible on your local machine on port 8000 at 127.0.0.1.

Deploy as container

There is a Dockerfile in this repository that defines a container within which this code can run.

To build and use the container, you need to have programs like Podman or Docker installed.

With the repository cloned by git onto your system, navigate to it and build the container with this command:

podman build -t wp22dev ./ --format=docker

Replace the command podman with docker depending on which one is available (this project has been tested with Podman 4.0.2), and wp22dev can be replaced with any other name. --format=docker is needed to explicitly build this as a Docker-formatted container that will be accepted by cloud services like Heroku.

Then, the run the container on port 8000 at 127.0.0.1 with this command:

podman run --env PORT=8000 --env GITHUB_TOKEN=[token] -p 127.0.0.1:8000:8000 -d wp22dev

Where token is the 40 character alphanumeric string of your GitHub API personal access token. It is in the form of "ghp_2D5TYFikFsQ4U9KPfzHyvigMycePCPqkPgWc".

Heroku deployment example

The image built this way can be pushed to cloud hosting providers such as Heroku. With Heroku as an example:

  1. Set up an empty app from your Heroku dashboard.

  2. In the Settings page for your Heroku app, set a Config Var with Key "GITHUB_TOKEN" and Value being your GitHub API personal access token.

  3. With the Heroku commandline interface installed, first login from your terminal:

heroku container:login
  1. Push the container image built above to your Heroku app:
podman push wp22dev registry.heroku.com/[your app name]/web
  1. Release the pushed container into production:
heroku container:release web --app=[your app name]

Fly.io example

Similar to Heroku, the container image created above can be deployed to an app on Fly.io. Assuming a Fly.io account has already been created:

  1. Log in to Fly.io in a terminal session:
flyctl auth login
  1. Launch a new app. Run the following command, which will ask for an app name. Enter [your app name], replacing it with whatever name you'd like:
flyctl launch
  1. Authorise pushing a container image to the Fly.io image registry:
flyctl auth docker
  1. Push the locally built image to the remote Fly.io image registry:
podman push wp22dev registry.fly.io/[your app name]
  1. Deploy the app:
flyctl deploy --image registry.fly.io/[your app name]
  1. Set GitHub API personal access token as environmental variable:
flyctl secrets set GITHUB_TOKEN=[token]

Where token is the 40 character alphanumeric string of your GitHub API personal access token. It is in the form of "ghp_2D5TYFikFsQ4U9KPfzHyvigMycePCPqkPgWc".

A demo of this is hosted on Fly.io with this API endpoint:

https://wp22dev.fly.dev/data

This demo instance will go into a sleep state after a period of inactivity (approximately 30 minutes at time of writing). If your API calls to this endpoint is taking more than a few seconds, it might be the demo waking from that state.

Usage

The backend server listens to requests for information about a list of open source hardware (and software) repositories hosted on Wikifactory or GitHub.

Making requests to the REST API

GET requests to the API are formed as JSON payloads to the /data endpoint.

There are two components to each request:

  1. repo_urls: An array of strings of repository URLs, such as https://wikifactory.com/+elektricworks/pikon-telescope. Currently, metadata retrieval for Wikifactory project and GitHub repository URLs are implemented. Each URL is composed of the Wikifactory domain (wikifactory.com), space (e.g. +elektricworks), and project (e.g. pikon-telescope).

  2. requested_data: An array of strings representing the types of repository metrics desired for each repository. Currently, the following are implemented for Wikifactory projects:

    1. files_info: The numbers and proportions of mechanical and electronic computer-assisted design (CAD), image, data, document, and other file types in the repository.
    2. files_editability: Basic information about how "editable" the CAD files are in this repository.
    3. license: The license for the repository.
    4. tags: Aggregated tags for the repository and any associated with the maintainers of that repsitory.
    5. commits_level: The hash identifier (contribution id for Wikifactory projects) and timestamp of each commit to the repository. This can be used to graph the commit activity level in a frontend visualisation. Note: This will be based on commits from the first three detected branches in the repository, including the default branch. This is because the time it takes to requests commits across various branches take a long time, and APIs might time out. Also note that branches are not implemented by Wikifactory, so it will behave as if there is only one branch.
    6. issues_level: Similar to commits_level, but for all issues in the repository.

The following is an example request that could be sent to the API for three Wikifactory projects:

{
    "repo_urls": [
        "https://wikifactory.com/+dronecoria/dronecoria-frame", 
        "https://wikifactory.com/@luzleanne/community-composter", 
        "https://wikifactory.com/+elektricworks/pikon-telescope"
    ], 
    "requested_data": [
        "files_info", 
        "files_editability", 
        "license", 
        "tags",
        "commits_level", 
        "issues_level"
    ]
}

API response format

The API will respond with a JSON array containing the requested_data for each repository in repo_urls.

Specifically, for each repository, the response will include:

  • repository: String containing the repository URL.
  • platform: String, only Wikifactory for now.
  • requested_data: Object containing the following:
    • files_editability: Object containing the following:
      • files_count: Integer number of (presumed to be) CAD files that are not text documents or data files (like CSV).
      • files_openness: Object containing the following:
        • open: Integer number of files using open formats.
        • closed: Integer number of files using closed/proprietary formats.
        • other: Integer number of files not categorised in either of the above.
      • files_encoding: Object containing the following:
        • binary: Integer number of files using binary formats.
        • text: Integer number of files using text-based formats.
        • other: Integer number of files not categorised in either of the above.
    • files_info: Object containing the following:
      • total_files: Integer of total number of files in the repository.
      • ecad_files: Integer number of electronic CAD files.
      • mcad_files: Integer number of mechanical CAD files.
      • image_files: Integer number of image files.
      • data_files: Integer number of data files.
      • document_files: Integer number of documentation files.
      • other_files: Integer number of other types of files.
      • ecad_proportion: Floating point proportion of electronic CAD files.
      • mcad_proportion: Floating point proportion of mechanical CAD files.
      • image_proportion: Floating point proportion of image files.
      • data_proportion: Floating point proportion of data files.
      • document_proportion: Floating point proportion of documentation files.
      • other_proportion: Floating point proportion of other types of files.
    • license: Object containing license information:
      • key: String of license idenfifier. Currently the same as spdx_id.
      • name: Full name of license.
      • spdx_id: String of the SPDX license identifier.
      • url: URL to license text.
      • node_id: For some licenses, this will be an identifier in GitHub's license list.
      • html_url: URL to license information.
      • permissions: Array of strings containing the permissions given by the license, which could include:
        • commercial-use: This work and derivatives may be used for commercial purposes.
        • modifications: This work may be modified.
        • distribution: This work may be distributed.
        • private-use: This work may be used and modified in private.
        • patent-use: This license provides an express grant of patent rights from contributors.
      • conditions: Array of strings expressing the conditions under which the work could be used, which could include a combination of:
        • include-copyright: A copy of the license and copyright notice must be included with the work.
        • include-copyright--source: A copy of the license and copyright notice must be included with the work in when distributed in source form.
        • document-changes: Changes made to the source/documentation must be documented.
        • disclose-source: Source code/documentation must be made available when the work is distributed.
        • network-use-disclose: Users who interact with software via network are given the right to receive a copy of the source code.
        • same-license: Modifications must be released under the same license when distributing the work. In some cases a similar or related license may be used.
        • same-license--file: Modifications of existing files must be released under the same license when distributing the work. In some cases a similar or related license may be used.
        • same-license--library: Modifications must be released under the same license when distributing software. In some cases a similar or related license may be used, or this condition may not apply to works that use the software as a library.
      • limitations: Limitations of the license, which could include a combination of:
        • trademark-use: This license explicitly states that it does NOT grant trademark rights, even though licenses without such a statement probably do not grant any implicit trademark rights.
        • liability: This license includes a limitation of liability.
        • patent-use: This license explicitly states that it does NOT grant any rights in the patents of contributors.
        • warranty: The license explicitly states that it does NOT provide any warranty.
    • tags: Aggregated array of strings representing the tags associated with the repository, and tags associated with users who are maintainers/owners of the repository. The implementation of this might change as Wikifactory implements their skill-based matchmaking features.
      • Examples: open-source, raspberry-pi, space, 3d-printing
    • commits_level: Array of objects representing commits (contributions in Wikifactory), where each one would contain:
      • hash: A string, where for Git-based repositories, the unique hash identifier for the commit. For Wikifactory, this is the id field of the contribution.
      • committed: String containing the timestamp for the commmit in ISO 8601 format, e.g. 2018-04-25T20:35:59.614973+00:00.
    • issues_level: Array of objects representing issues, where each one would contain:
      • id: String containing the URL to the issue.
      • published: String containing the creation date of the issue in ISO 8601 format, e.g. 2018-04-25T20:35:59.614973+00:00.
      • isResolved: Boolean (true or false) of whether the issue has been marked as closed or resolved.
      • resolved: String containing ISO 8601 formatted timestamp representing the last time there was activity in the issue (such as comments), or if the issue isResolved, the time it happened.

Notes:

  • For files_editability above, filetypes are identified by file extensions. The categories and mapping are documented in oshminer/filetypes.py, and can be traced the osh-file-types list by Open Source Ecology Germany.
  • For files_info above, filetypes are identified by file extensions. The categories and mapping are located in oshminer/filetypes.py.
  • The license information and formatting is largely based on that from the GitHub-managed choosealicense.com repository, with the exception of some open source hardware licenses which were manually added.

Custom Wikifactory URLs

By default, this tool will:

  1. Identify whether a provided repository URL in the JSON request body as a Wikifactory project if it is under the domain wikifactory.com
  2. Use the public Wikifactory GraphQL API endpoint at https://wikifactory.com/api/graphql

Both can be customised with the following environmental variables during deployment:

  1. WIF_BASE_URL - (default: wikifactory.com) The base domain used for pattern-matching and identifying Wikifactory project URLs in the JSON request body in the form of example.com. If this is customised, then the requested Wikifactory project URLs passed to this tool should also use that domain instead of wikifactory.com. Otherwise, an "Repository URL domain not supported" error will be returned.
  2. WIF_API_URL - (default: https://wikifactory.com/api/graphql) The full URL of the GraphQL API endpoint to make queries regarding Wikifactory projects in the form of https://example.com[:port]/foo/bar.

Maintainers

Dr Pen-Yuan Hsing (@penyuan) is the current maintainer.

Dr Jérémy Bonvoisin (@jbon) was a previous maintainer who contributed greatly to this repository during the first year of the OPENNEXT project and is now an external advisor.

Contributing

Thank you in advance for your contribution. Please open an issue or submit a GitHub pull request. For more details, please look at CONTRIBUTING.md.

This project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by the Contributor Covenant Code of Conduct 2.0.

Acknowledgements

The maintainer would like to gratefully acknowledge:

  • Dr Jérémy Bonvoisin (@jbon) not only for the initial contributions to this work, but also for continued practical and theoretical insight, generosity, and guidance.
  • Dr Elies Dekoninck (@elies30) and Rafaella Antoniou (@rafaellaantoniou) for valuable feedback and support.
  • Max Kampik (@mkampik), Diego Vaquero, and Andrés Barreiro from Wikifactory for close collaboration, design insights, and technical support throughout the project.
  • OPENNEXT internal reviewers Dr Jean-François Boujut (@boujut) and Martin Häuer (@moedn) for constructive criticism.
  • OPENNEXT project researchers Robert Mies (@MIE5R0), Mehera Hassan (@meherrahassan), and Sonika Gogineni (@GoSFhg) for useful feedback and extensive administrative support.
  • The Linux Foundation CHAOSS group for insights on open source community health metrics.
  • The following people for their valuable feedback via a survey (see D2.5 report for details) (in alphabetical order of last name): Jean-François Boujut (@boujut), Martin Häuer (@moedn), James Jones (CubeSpawn), Max Kampik (@mkampik), Johannes Střelka-Petz.

EU flag

The work in this repository is supported by a European Union Horizon 2020 programme grant (agreement ID 869984).

License

GitHub AGPL-3.0-or-later license

The Python code in this repository is licensed under the GNU AGPLv3 or any later version © 2022 Pen-Yuan Hsing

CC BY-SA

This README is licensed under the Creative Commons Attribution-ShareAlike 4.0 International license (CC BY-SA 4.0) © 2022 Pen-Yuan Hsing

Details on other files are in the REUSE specification dep5 file.

wp2.2_dev's People

Contributors

boujut avatar jbon avatar mkampik avatar penyuan avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

wp2.2_dev's Issues

Add event time stamps in committer graphs

In order to analyse the chronoligical evolution of the committer graphs, we don't only need to know how much events two commiters have in common, we also need to know when each of these events happened. Doing so, we'll be able to filter the graph through a time window, instead of computing it for the whole time frame of the project.

I think the best option is to first create the full graph over the entire project timeframe with some kind of list object (or maybe a pandas dataframe?) as attribute of each node. From this, with a filtering function, we would be able to prune the graph, excluding events outside a selected time window.

Pull metadata for GitHub issues

In addition to pulling commit histories from GitHub repositories, also get data from their issue trackers (as mentioned in issue #17). This should be useful for investigating organisational architecture and collaboration patterns.

Badges for dashboard

In recent meetings such as this one on 2020-09-16, we discussed "achievement" badges that will be part of the Open!Next month-18 dashboard minimum viable product (MVP).

This issue is to discuss the badges and develop an initial list to be part of the month-18 MVP.

There will be two kinds of badges:

  1. Self-awarded/declared - If you (as a repository/project owner) decide that you meet the criteria for a badge, you can tick a box indicating so yourself. Question: What would be the technical mechanism for creating and for the dashboard to detect self-awarded badges?

  2. Automated badges - The backend of the dashboard will decide whether to award a badge on if the mined metadata meets certain criteria.

One example of a badge implementation:

As we know from gamification, there can be multiple levels of the same badge. @moedn suggested a DIN SPEC 3105 badge. The three levels might be:

  1. Bronze - You self-declare that your project meets the DIN SPEC.
  2. Silver - You self-declare you've started the certification process. (maybe this can be automated if the DIN SPEC application/certification process will produce metadata for the dashboard's backend to detect???)
  3. Gold - Automatically awarded once DIN SPEC 3105 conformance is confirmed.

Let's discuss the mechanisms for how the badge system will work and come up with an initial set of badges for the month-18 dashboard.

I'll try to post a few of my ideas soon, but please chime in with your thoughts on any of the above at any time.

Switch configuration file format

NOTE: This is probably not very urgent. I am just putting it here so we don't forget to do it one day.

Currently, start.py uses an input JSON configuration file, which doesn't allow comments. To make the configuration file easier to understand, I suggest using a format that allows comments such as one that is supported by Python 3's builtin ConfigParser module.

Implement CSV reader

Use Python's standard libraries to load the CSV list of repositories to mine, then prepare the loaded data for the data-mining script.

Prepare git-mining paper by 2020-07-01

Per the email from Jean-François on 2020-04-23, they expect a paper submitted to the Design Science journal by 2020-07-01 for review.

At some point we should discuss and work on this.

Implement logic for mining only new data

Make sure that the data-mining script is only asked to mine data after timestamp x where x is the last timestamp for any given repository that has already been mined and data saved into the JSON file.

The logic needs to consider (1) if there are repositories listed in the input CSV that have never been mined before; (2) treat repositories separately because they might have been previously mined at different times; (3) keep track of which OSH project repositories belong to.

Refactor checking of required directories into its own function

As stated in the comments in start.py, code for checking whether the required directories (e.g. __DATA__ or for JSON and GraphML output) are repeated several times. Let's wrap this into one function call, probably at the start of start.py where it checks everything.

As a bonus, if the needed directories don't exist, either create them using sensible defaults or stop execution and ask the user to create and specify them then re-run.

Review dashboard documentation

Jean-Francois's group generously offered to review the dashboard documentation for M18 because of their work on design reuse.

KeyError for `branch` in build_commit_history.py bug

How to reproduce: Run start.py from the master branch against this repository, i.e. wp2.2_dev.

Symptom: There is a KeyError for the key branch in build_commit_history.py. Here is the traceback:

Finding forks of OPEN-NEXT/wp2.2_dev page 1
There are 1 forks of OPEN-NEXT/wp2.2_dev
retrieving commits in OPEN-NEXT/wp2.2_dev
fetching info at https://github.com/OPEN-NEXT/wp2.2_dev
Traceback (most recent call last):
  File "/home/penyuan/.vscode-oss/extensions/ms-python.python-2020.5.86806/pythonFiles/ptvsd_launcher.py", line 48, in <module>
    main(ptvsdArgs)
  File "/home/penyuan/.vscode-oss/extensions/ms-python.python-2020.5.86806/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 432, in main
    run()
  File "/home/penyuan/.vscode-oss/extensions/ms-python.python-2020.5.86806/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/home/penyuan/.pyenv/versions/3.6.10/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/penyuan/.pyenv/versions/3.6.10/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/penyuan/.pyenv/versions/3.6.10/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/penyuan/devel/git/wp2.2_dev/src/start.py", line 174, in <module>
    main()
  File "/home/penyuan/devel/git/wp2.2_dev/src/start.py", line 116, in main
    build_commit_history(known_commits, commit_history)
  File "/home/penyuan/devel/git/wp2.2_dev/src/build_commit_history.py", line 115, in build_commit_history
    colour = palette_html[branch_names.index(commit_history.nodes[commit]['branch'])]
KeyError: 'branch'
fish: “env PYTHONIOENCODING=UTF-8 PYTH…” terminated by signal SIGTERM (Polite quit request)

Optimize HTML network visualisation and consider alternative libraries

- alternative: 1 https://towardsdatascience.com/python-interactive-network-visualization-using-networkx-plotly-and-dash-e44749161ed7
- alternative: 2 https://graph-tool.skewed.de/
- alternative 3: vis.js from scratch https://github.com/marcin-kolda/py-graph-vis
- alternative 4: D3.js

What counts as an Issue "participant"?

The GitHub v4 GraphQL API can give you a list of "participants" for each GitHub issue, but there is no published definition for who counts as a participant, what actions would make a person a participant, and there's not metadata on the nature and timing of those actions.

Per our recent meeting with @penyuan and @jbon, I think we simply have to manually identify activities surrounding an issue (such as comments/replies, reactions, linked commits, etc.) and get that metadata and identify participants. This should not be overly difficult anyway because I've already implemented something like this in the GitHub_issues branch.

I'm opening this issue as a place for us to enumerate exactly which interactions we want to use in our community analyses and building a graph.

To start things off, here are some obvious examples:

  1. If user B comments on an issue opened by user A, then we create an edge from B to A in the interaction graph.
  2. If user C reacts to user B's comment, create an edge from C to B.
  3. If C authors a change that is committed by A, that also counts as an interaction. But what is the directionality of this edge?!?!
  4. etc.

P.S. Sometimes we might have a user interacting with themselves, such as user A commenting on their own issue. That's fine, but we don't need to add an edge from A to A in this case.

Need place to put files/info related to process facilitation dashboard

I think there should be a place to aggregate information we have regarding the development of the process facilitation dashboard.

So far I can think of four candidates:

  1. The TUB Cloud hosted by TU Berlin (in the appropriate folder)
  2. The Wiki for this git repository
  3. A directory containing Markdown and other files inside this git repository
  4. A project hosted on OSF.io, meaning if and when we choose to do so, we can easily publish our results there that includes a citable DOI

I am slightly inclined toward option 3 because it can be version-controlled. Option 4 is also good but I recognise that it requires most of us to learn a new (albeit powerful) tool. Either way I am not super opinionated on this.

Come up with some research queries for ontology

The WP3.3 ontology is relying at least partially on us to come up with some research-focused queries for its Wikibase instance. This includes how the dashboard may query its data. This issue is for brainstorming what those queries are.

One example would be query for all activities (e.g. commit, issue activity, etc.) in a version control repository that is recorded in the database filtered by timestamp. The timestamp filter could be after/before a time, within a time range, or the latest or earliest recorded activity.

Investigate using University of Bath storage to keep Perceval-downloaded data in sync

Our current data mining scripts uses the Perceval library to download git repository metadata which takes up lots of space. @jbon suggested we minimize re-downloading this data by keeping it in shared space, such as the 1 TB of mountable space provided by the University of Bath.

What if we create a git repository in that disk space and use it to version-control the data we download via Perceval in our data mining scripts?

Understand file change actions codes

For some reason, while most of dict elements in the list of dict "files" in a commit contain the keys "action", "added" and "removed", some of them don't.

The meaning of the keys "added" and "deleted" is pretty clear, but I don't understand why they would be missing for some commits.

The meaning of the key "action" is clear in most cases, when the value is "A", "D" or "M" (I guess added, deleted, modified). But there are other codes who's meaning is not clear to me. We see occasionally the codes R052, R074, R066, C085, DD, MM.

There are also some other attributes we haven't been using so far ("indexes", "modes") and whose meaning is not clear to me.

We should investigate the meaning of all these codes and attributes and why they are sometimes present and sometimes not.


Some examples:

in commit 10ed9335c327415e31b8432b752d5c5d1bb0a4d1 in (a fork of) jbon/github-mining:

       "files": [
           {
               "added": "9",
               "file": "README.md",
               "removed": "4"
           },

The key "action" is missing.

in commit 83872323d9d5b65b723871b6da76e78ff3c73de0 in jurgenwesterhoff/bGeigieNanoKit:

            {
                "action": "R066",
                "file": "hardware/bGeigieNanoKit.brd",
                "indexes": [
                    "634a725",
                    "b787bbd"
                ],
                "modes": [
                    "100755",
                    "100755"
                ],
                "newfile": "hardware/bGeigieNanoKit/bGeigieNanoKit.brd"
            },

The keys "added" and "deleted" are missing.

Author committer distinction in commits

From what I've learned, Git commits include metadata on their authors and committers. They are distinct but previously we have only considered the committer. Most of the time they are the same, but in cases where they are different, we should decide how to handle it.

Per my recent meeting with @jbon, I propose that when analysing commit history, we can build it with the committer information for now. The key is to count commits where author != committer as an interaction when looking at community structure.

Investigate Wikibase as data backend

During the recent meeting on 2020-08-12, @moedn made the interesting suggestion that our data mining scripts could use their upcoming Wikibase instance to store data. This could be useful for us, and also make our resultant data more accessible.

To make this work, we need to provide a clearer picture of what kinds of data we're mining, questions to answer, and metrics so that those setting up the Wikibase instance can build them into the data ontology.

I am creating this issue to help remember this point and investigate further.

build_commit_history.py fails on line 78 for certain repos

I.e. len(refs)!=len(parents).

When I use our mining scripts (e.g. start.py & build_commit_history.py) on certain OSH repositories, build_commit_history.py fails on line 78.

I'll post exactly which repository caused the problem plus more details in my next comment in this issue thread.

Implement identity management

SortingHat might not work in our case per discussion in this issue. So, after our latest meeting, we'd like to try to incorporate @jbon's existing identity management code into this repository.

Update requirements.txt

New Python libraries have been called in the scripts such as pyvis or seaborn, they should be added to requirements.txt.

Generate DSMs and File co-edition graphs

related to #18

  • test the script on more repositories
  • feed the script a list of repositories instead of one at a time (new config parameter; issue #19)
  • Identity management (issue #15)
  • Pull github issues (issue #20)
  • Generate interaction graphs based on file co-edition and issue participation (#23)
  • Generate DSMs based on file co-edition and issue participation
    • first: test with considering a link between files when they are showing up in the same commit.
    • second: experiment with weighting between files depending on the frequency of their co-editions within commits.

Create demo repository to test our scripts against?

A major part of our efforts is to develop a robust set of Python 3 script for mining Github repositories.

I've been testing parts of our script (e.g. get_commits, etc.) on repositories such as Safecast/bGeigieNanoKit, but they have so many forks each with so many commits that downloading so much data takes up lots of time and storage space.

To make testing those scripts easier, what if we made a demo repository to test the scripts against? The demo repository would contain all cases that we might run into including commits, branches, merges, and forks (meaning there will be demo forks, too).

Alternatively, if we can identify an existing project that has all the cases but is also small, then we can use their repository.

Data reader for dashboard frontend

For M18, it should at least read from the mined JSON file in this repository and prepare it for downstream wrangling and visualisation by Dash.

Allow using commandline options AND/OR config file when running start.py

Right now, our start.py needs both commandline options and a .config file to run. Let's update it so that it is more flexible where start.py can get everything it needs from commandline options, an arbitrarily named configuration file at any location, or a combination of both. In the end, the configuration file would be optional but useful.

Also remember to define what happens by default if no options and configuration file are provided.

I suspect this is not the highest priority item right now, but I'm putting this issue here so I will remember to do it someday...

Allow passing in a list of repositories to mine

Currently our script allows specifying one GitHub repository to retrieve data from. Expand this to allow passing in a list of repositories to mine (related to issue #17).

Ideally, this list won't be limited to GitHub repositories, but we can start with accepting just GitHub repositories as a MVP (minimum viable product) for this issue.

Optimize error handling for non-existent repository/fork

For now, @penyuan has added rudimentary error handling to deal with non-existent repositories or forks when running the mining scripts, but:

  1. The script would ask for input from the user
  2. The script doesn't current keep a record of which repositories were problematic

Is there a way to tell perceval to download metadata only?

It seems that the command git = Git(repo_URL, data_dump_path) in get_commits.py:51 downloads all the contents from a repo, not only the metadata, but also the files. It packs everything in *.pack files in the directory /objects/pack. For hardware projects, this is a lot of data. And since we fetch infos not only from root repos but also from forks, it is really a lot of data. It is more than what my system can handle and makre the whole process really slow, way too slow for the reactivity that would be required from a management dashboard. Because of this, I can't test our algorithms on the really interesting projects who have a lot of data. This is a real handicap.

Can we simply tell perceval to skip downloading/generating the *.pack files?

As an illustration, here the result of the du command in the folder __DATA__/grimoire_dumps on my machine. It contains Gb of data I don't use. Just for the repo AngelLM/Thor, these are 2.6 Gb, and there are 150 forks of this repo, so processing this repo would mean downloading 360 Gb, which is more than the total capacity of my disk (and an order of magnitude more than the space allocated to my linux distribution)!

jb2971@VirtualBox:~/GitRepos/wp2.2_dev/__DATA__/grimoire_dumps$ du
4	./rafaellaantoniou-wp2.2_reference/branches
4	./rafaellaantoniou-wp2.2_reference/refs/heads
4	./rafaellaantoniou-wp2.2_reference/refs/tags
12	./rafaellaantoniou-wp2.2_reference/refs
8	./rafaellaantoniou-wp2.2_reference/info
56	./rafaellaantoniou-wp2.2_reference/hooks
8	./rafaellaantoniou-wp2.2_reference/objects/d4
8	./rafaellaantoniou-wp2.2_reference/objects/8f
8	./rafaellaantoniou-wp2.2_reference/objects/56
8	./rafaellaantoniou-wp2.2_reference/objects/5a
12	./rafaellaantoniou-wp2.2_reference/objects/ec
4	./rafaellaantoniou-wp2.2_reference/objects/info
12	./rafaellaantoniou-wp2.2_reference/objects/06
8	./rafaellaantoniou-wp2.2_reference/objects/66
8	./rafaellaantoniou-wp2.2_reference/objects/f9
8	./rafaellaantoniou-wp2.2_reference/objects/1b
8	./rafaellaantoniou-wp2.2_reference/objects/ea
8	./rafaellaantoniou-wp2.2_reference/objects/1f
8	./rafaellaantoniou-wp2.2_reference/objects/a3
8	./rafaellaantoniou-wp2.2_reference/objects/4e
8	./rafaellaantoniou-wp2.2_reference/objects/da
8	./rafaellaantoniou-wp2.2_reference/objects/d3
8	./rafaellaantoniou-wp2.2_reference/objects/5d
8	./rafaellaantoniou-wp2.2_reference/objects/b5
8	./rafaellaantoniou-wp2.2_reference/objects/0a
8	./rafaellaantoniou-wp2.2_reference/objects/53
8	./rafaellaantoniou-wp2.2_reference/objects/02
8	./rafaellaantoniou-wp2.2_reference/objects/27
8	./rafaellaantoniou-wp2.2_reference/objects/9d
8	./rafaellaantoniou-wp2.2_reference/objects/3d
8	./rafaellaantoniou-wp2.2_reference/objects/79
8	./rafaellaantoniou-wp2.2_reference/objects/12
8	./rafaellaantoniou-wp2.2_reference/objects/4f
8	./rafaellaantoniou-wp2.2_reference/objects/5f
12	./rafaellaantoniou-wp2.2_reference/objects/65
12	./rafaellaantoniou-wp2.2_reference/objects/b2
8	./rafaellaantoniou-wp2.2_reference/objects/a8
8	./rafaellaantoniou-wp2.2_reference/objects/a1
8	./rafaellaantoniou-wp2.2_reference/objects/18
8	./rafaellaantoniou-wp2.2_reference/objects/1d
8	./rafaellaantoniou-wp2.2_reference/objects/34
8	./rafaellaantoniou-wp2.2_reference/objects/74
8	./rafaellaantoniou-wp2.2_reference/objects/4a
8	./rafaellaantoniou-wp2.2_reference/objects/0f
8	./rafaellaantoniou-wp2.2_reference/objects/f5
8	./rafaellaantoniou-wp2.2_reference/objects/86
8	./rafaellaantoniou-wp2.2_reference/objects/9c
12	./rafaellaantoniou-wp2.2_reference/objects/3c
8	./rafaellaantoniou-wp2.2_reference/objects/3b
8	./rafaellaantoniou-wp2.2_reference/objects/ff
8	./rafaellaantoniou-wp2.2_reference/objects/38
12	./rafaellaantoniou-wp2.2_reference/objects/81
8	./rafaellaantoniou-wp2.2_reference/objects/67
12	./rafaellaantoniou-wp2.2_reference/objects/4d
8	./rafaellaantoniou-wp2.2_reference/objects/b6
12	./rafaellaantoniou-wp2.2_reference/objects/5c
8	./rafaellaantoniou-wp2.2_reference/objects/8e
8	./rafaellaantoniou-wp2.2_reference/objects/32
12	./rafaellaantoniou-wp2.2_reference/objects/92
12	./rafaellaantoniou-wp2.2_reference/objects/51
8	./rafaellaantoniou-wp2.2_reference/objects/ac
20	./rafaellaantoniou-wp2.2_reference/objects/f2
8	./rafaellaantoniou-wp2.2_reference/objects/28
12	./rafaellaantoniou-wp2.2_reference/objects/e9
8	./rafaellaantoniou-wp2.2_reference/objects/7f
8	./rafaellaantoniou-wp2.2_reference/objects/52
4	./rafaellaantoniou-wp2.2_reference/objects/pack
8	./rafaellaantoniou-wp2.2_reference/objects/bf
8	./rafaellaantoniou-wp2.2_reference/objects/a0
8	./rafaellaantoniou-wp2.2_reference/objects/dc
8	./rafaellaantoniou-wp2.2_reference/objects/80
8	./rafaellaantoniou-wp2.2_reference/objects/a2
8	./rafaellaantoniou-wp2.2_reference/objects/d1
8	./rafaellaantoniou-wp2.2_reference/objects/47
12	./rafaellaantoniou-wp2.2_reference/objects/21
8	./rafaellaantoniou-wp2.2_reference/objects/4c
8	./rafaellaantoniou-wp2.2_reference/objects/99
8	./rafaellaantoniou-wp2.2_reference/objects/48
8	./rafaellaantoniou-wp2.2_reference/objects/91
12	./rafaellaantoniou-wp2.2_reference/objects/46
8	./rafaellaantoniou-wp2.2_reference/objects/0e
8	./rafaellaantoniou-wp2.2_reference/objects/35
8	./rafaellaantoniou-wp2.2_reference/objects/2a
8	./rafaellaantoniou-wp2.2_reference/objects/7a
8	./rafaellaantoniou-wp2.2_reference/objects/af
8	./rafaellaantoniou-wp2.2_reference/objects/6b
8	./rafaellaantoniou-wp2.2_reference/objects/7d
708	./rafaellaantoniou-wp2.2_reference/objects
812	./rafaellaantoniou-wp2.2_reference
4	./jurgenwesterhoff-bGeigieNanoKit/branches
4	./jurgenwesterhoff-bGeigieNanoKit/refs/heads
4	./jurgenwesterhoff-bGeigieNanoKit/refs/tags
12	./jurgenwesterhoff-bGeigieNanoKit/refs
8	./jurgenwesterhoff-bGeigieNanoKit/info
56	./jurgenwesterhoff-bGeigieNanoKit/hooks
4	./jurgenwesterhoff-bGeigieNanoKit/objects/info
64380	./jurgenwesterhoff-bGeigieNanoKit/objects/pack
64388	./jurgenwesterhoff-bGeigieNanoKit/objects
64492	./jurgenwesterhoff-bGeigieNanoKit
4	./OPEN-NEXT-wp2.2_reference/branches
4	./OPEN-NEXT-wp2.2_reference/refs/heads
4	./OPEN-NEXT-wp2.2_reference/refs/tags
12	./OPEN-NEXT-wp2.2_reference/refs
8	./OPEN-NEXT-wp2.2_reference/info
56	./OPEN-NEXT-wp2.2_reference/hooks
8	./OPEN-NEXT-wp2.2_reference/objects/d4
8	./OPEN-NEXT-wp2.2_reference/objects/8f
8	./OPEN-NEXT-wp2.2_reference/objects/56
8	./OPEN-NEXT-wp2.2_reference/objects/5a
12	./OPEN-NEXT-wp2.2_reference/objects/ec
4	./OPEN-NEXT-wp2.2_reference/objects/info
12	./OPEN-NEXT-wp2.2_reference/objects/06
8	./OPEN-NEXT-wp2.2_reference/objects/66
8	./OPEN-NEXT-wp2.2_reference/objects/f9
8	./OPEN-NEXT-wp2.2_reference/objects/1b
8	./OPEN-NEXT-wp2.2_reference/objects/ea
8	./OPEN-NEXT-wp2.2_reference/objects/1f
8	./OPEN-NEXT-wp2.2_reference/objects/a3
8	./OPEN-NEXT-wp2.2_reference/objects/4e
8	./OPEN-NEXT-wp2.2_reference/objects/da
8	./OPEN-NEXT-wp2.2_reference/objects/d3
8	./OPEN-NEXT-wp2.2_reference/objects/5d
8	./OPEN-NEXT-wp2.2_reference/objects/b5
8	./OPEN-NEXT-wp2.2_reference/objects/0a
8	./OPEN-NEXT-wp2.2_reference/objects/53
8	./OPEN-NEXT-wp2.2_reference/objects/02
8	./OPEN-NEXT-wp2.2_reference/objects/27
8	./OPEN-NEXT-wp2.2_reference/objects/9d
8	./OPEN-NEXT-wp2.2_reference/objects/3d
8	./OPEN-NEXT-wp2.2_reference/objects/79
8	./OPEN-NEXT-wp2.2_reference/objects/12
8	./OPEN-NEXT-wp2.2_reference/objects/4f
8	./OPEN-NEXT-wp2.2_reference/objects/5f
12	./OPEN-NEXT-wp2.2_reference/objects/65
12	./OPEN-NEXT-wp2.2_reference/objects/b2
8	./OPEN-NEXT-wp2.2_reference/objects/a8
8	./OPEN-NEXT-wp2.2_reference/objects/a1
8	./OPEN-NEXT-wp2.2_reference/objects/18
8	./OPEN-NEXT-wp2.2_reference/objects/1d
8	./OPEN-NEXT-wp2.2_reference/objects/34
8	./OPEN-NEXT-wp2.2_reference/objects/74
8	./OPEN-NEXT-wp2.2_reference/objects/4a
8	./OPEN-NEXT-wp2.2_reference/objects/0f
8	./OPEN-NEXT-wp2.2_reference/objects/f5
8	./OPEN-NEXT-wp2.2_reference/objects/86
8	./OPEN-NEXT-wp2.2_reference/objects/9c
12	./OPEN-NEXT-wp2.2_reference/objects/3c
8	./OPEN-NEXT-wp2.2_reference/objects/61
8	./OPEN-NEXT-wp2.2_reference/objects/3b
8	./OPEN-NEXT-wp2.2_reference/objects/ff
8	./OPEN-NEXT-wp2.2_reference/objects/38
12	./OPEN-NEXT-wp2.2_reference/objects/81
8	./OPEN-NEXT-wp2.2_reference/objects/67
12	./OPEN-NEXT-wp2.2_reference/objects/4d
8	./OPEN-NEXT-wp2.2_reference/objects/b6
12	./OPEN-NEXT-wp2.2_reference/objects/5c
8	./OPEN-NEXT-wp2.2_reference/objects/8e
8	./OPEN-NEXT-wp2.2_reference/objects/32
12	./OPEN-NEXT-wp2.2_reference/objects/92
12	./OPEN-NEXT-wp2.2_reference/objects/51
8	./OPEN-NEXT-wp2.2_reference/objects/ac
20	./OPEN-NEXT-wp2.2_reference/objects/f2
8	./OPEN-NEXT-wp2.2_reference/objects/28
12	./OPEN-NEXT-wp2.2_reference/objects/e9
8	./OPEN-NEXT-wp2.2_reference/objects/7f
8	./OPEN-NEXT-wp2.2_reference/objects/52
4	./OPEN-NEXT-wp2.2_reference/objects/pack
8	./OPEN-NEXT-wp2.2_reference/objects/bf
8	./OPEN-NEXT-wp2.2_reference/objects/a0
8	./OPEN-NEXT-wp2.2_reference/objects/dc
8	./OPEN-NEXT-wp2.2_reference/objects/80
8	./OPEN-NEXT-wp2.2_reference/objects/a2
8	./OPEN-NEXT-wp2.2_reference/objects/d1
8	./OPEN-NEXT-wp2.2_reference/objects/47
12	./OPEN-NEXT-wp2.2_reference/objects/21
8	./OPEN-NEXT-wp2.2_reference/objects/4c
8	./OPEN-NEXT-wp2.2_reference/objects/99
8	./OPEN-NEXT-wp2.2_reference/objects/48
8	./OPEN-NEXT-wp2.2_reference/objects/91
12	./OPEN-NEXT-wp2.2_reference/objects/46
8	./OPEN-NEXT-wp2.2_reference/objects/0e
8	./OPEN-NEXT-wp2.2_reference/objects/35
8	./OPEN-NEXT-wp2.2_reference/objects/2a
8	./OPEN-NEXT-wp2.2_reference/objects/7a
8	./OPEN-NEXT-wp2.2_reference/objects/af
8	./OPEN-NEXT-wp2.2_reference/objects/6b
8	./OPEN-NEXT-wp2.2_reference/objects/7d
716	./OPEN-NEXT-wp2.2_reference/objects
820	./OPEN-NEXT-wp2.2_reference
4	./OPEN-NEXT-WP2.2_dev/branches
4	./OPEN-NEXT-WP2.2_dev/refs/heads
4	./OPEN-NEXT-WP2.2_dev/refs/tags
12	./OPEN-NEXT-WP2.2_dev/refs
8	./OPEN-NEXT-WP2.2_dev/info
56	./OPEN-NEXT-WP2.2_dev/hooks
4	./OPEN-NEXT-WP2.2_dev/objects/info
552	./OPEN-NEXT-WP2.2_dev/objects/pack
560	./OPEN-NEXT-WP2.2_dev/objects
664	./OPEN-NEXT-WP2.2_dev
4	./penyuan-github-mining/branches
4	./penyuan-github-mining/refs/heads
4	./penyuan-github-mining/refs/tags
12	./penyuan-github-mining/refs
8	./penyuan-github-mining/info
56	./penyuan-github-mining/hooks
4	./penyuan-github-mining/objects/info
8868	./penyuan-github-mining/objects/pack
8876	./penyuan-github-mining/objects
8980	./penyuan-github-mining
4	./massjona-github-mining/branches
4	./massjona-github-mining/refs/heads
4	./massjona-github-mining/refs/tags
12	./massjona-github-mining/refs
8	./massjona-github-mining/info
56	./massjona-github-mining/hooks
8	./massjona-github-mining/objects/8f
4	./massjona-github-mining/objects/info
8	./massjona-github-mining/objects/66
8	./massjona-github-mining/objects/90
8	./massjona-github-mining/objects/31
8	./massjona-github-mining/objects/a3
12	./massjona-github-mining/objects/eb
8	./massjona-github-mining/objects/78
8	./massjona-github-mining/objects/d3
8	./massjona-github-mining/objects/be
8	./massjona-github-mining/objects/f6
12	./massjona-github-mining/objects/3d
8	./massjona-github-mining/objects/c5
8	./massjona-github-mining/objects/65
8	./massjona-github-mining/objects/b1
20	./massjona-github-mining/objects/1e
8	./massjona-github-mining/objects/2b
8	./massjona-github-mining/objects/e5
8	./massjona-github-mining/objects/61
12	./massjona-github-mining/objects/f4
8	./massjona-github-mining/objects/85
8	./massjona-github-mining/objects/81
8	./massjona-github-mining/objects/43
8	./massjona-github-mining/objects/01
12	./massjona-github-mining/objects/b6
8	./massjona-github-mining/objects/98
12	./massjona-github-mining/objects/03
20	./massjona-github-mining/objects/94
8	./massjona-github-mining/objects/92
8	./massjona-github-mining/objects/2f
8	./massjona-github-mining/objects/28
12	./massjona-github-mining/objects/52
4	./massjona-github-mining/objects/pack
8	./massjona-github-mining/objects/04
8	./massjona-github-mining/objects/e4
8	./massjona-github-mining/objects/29
8	./massjona-github-mining/objects/88
8	./massjona-github-mining/objects/80
8	./massjona-github-mining/objects/25
12	./massjona-github-mining/objects/e3
8	./massjona-github-mining/objects/1a
8	./massjona-github-mining/objects/c9
8	./massjona-github-mining/objects/7e
8	./massjona-github-mining/objects/48
12	./massjona-github-mining/objects/db
8	./massjona-github-mining/objects/46
8	./massjona-github-mining/objects/e0
8	./massjona-github-mining/objects/14
8	./massjona-github-mining/objects/6b
8	./massjona-github-mining/objects/fe
8	./massjona-github-mining/objects/75
8	./massjona-github-mining/objects/ba
8	./massjona-github-mining/objects/9a
476	./massjona-github-mining/objects
580	./massjona-github-mining
4	./AngelLM-Thor/branches
4	./AngelLM-Thor/refs/heads
4	./AngelLM-Thor/refs/tags
12	./AngelLM-Thor/refs
8	./AngelLM-Thor/info
56	./AngelLM-Thor/hooks
4	./AngelLM-Thor/objects/info
2605124	./AngelLM-Thor/objects/pack
2605132	./AngelLM-Thor/objects
2605232	./AngelLM-Thor
4	./OPEN-NEXT-wp2.2_dev/branches
28	./OPEN-NEXT-wp2.2_dev/refs/heads
4	./OPEN-NEXT-wp2.2_dev/refs/tags
36	./OPEN-NEXT-wp2.2_dev/refs
8	./OPEN-NEXT-wp2.2_dev/info
56	./OPEN-NEXT-wp2.2_dev/hooks
8	./OPEN-NEXT-wp2.2_dev/objects/24
8	./OPEN-NEXT-wp2.2_dev/objects/cb
4	./OPEN-NEXT-wp2.2_dev/objects/info
8	./OPEN-NEXT-wp2.2_dev/objects/06
8	./OPEN-NEXT-wp2.2_dev/objects/41
8	./OPEN-NEXT-wp2.2_dev/objects/5d
8	./OPEN-NEXT-wp2.2_dev/objects/2b
8	./OPEN-NEXT-wp2.2_dev/objects/13
8	./OPEN-NEXT-wp2.2_dev/objects/7c
8	./OPEN-NEXT-wp2.2_dev/objects/01
8	./OPEN-NEXT-wp2.2_dev/objects/c6
8	./OPEN-NEXT-wp2.2_dev/objects/10
8	./OPEN-NEXT-wp2.2_dev/objects/67
8	./OPEN-NEXT-wp2.2_dev/objects/de
8	./OPEN-NEXT-wp2.2_dev/objects/5b
576	./OPEN-NEXT-wp2.2_dev/objects/pack
8	./OPEN-NEXT-wp2.2_dev/objects/25
8	./OPEN-NEXT-wp2.2_dev/objects/99
8	./OPEN-NEXT-wp2.2_dev/objects/26
8	./OPEN-NEXT-wp2.2_dev/objects/fc
728	./OPEN-NEXT-wp2.2_dev/objects
856	./OPEN-NEXT-wp2.2_dev
4	./rafaellaantoniou-github-mining/branches
4	./rafaellaantoniou-github-mining/refs/heads
4	./rafaellaantoniou-github-mining/refs/tags
12	./rafaellaantoniou-github-mining/refs
8	./rafaellaantoniou-github-mining/info
56	./rafaellaantoniou-github-mining/hooks
4	./rafaellaantoniou-github-mining/objects/info
8484	./rafaellaantoniou-github-mining/objects/pack
8492	./rafaellaantoniou-github-mining/objects
8596	./rafaellaantoniou-github-mining
4	./jbon-wp2.2_reference/branches
4	./jbon-wp2.2_reference/refs/heads
4	./jbon-wp2.2_reference/refs/tags
12	./jbon-wp2.2_reference/refs
8	./jbon-wp2.2_reference/info
56	./jbon-wp2.2_reference/hooks
4	./jbon-wp2.2_reference/objects/info
36	./jbon-wp2.2_reference/objects/pack
44	./jbon-wp2.2_reference/objects
148	./jbon-wp2.2_reference
4	./Safecast-bGeigieNanoKit/branches
4	./Safecast-bGeigieNanoKit/refs/heads
4	./Safecast-bGeigieNanoKit/refs/tags
12	./Safecast-bGeigieNanoKit/refs
8	./Safecast-bGeigieNanoKit/info
56	./Safecast-bGeigieNanoKit/hooks
4	./Safecast-bGeigieNanoKit/objects/info
**128428**	./Safecast-bGeigieNanoKit/objects/pack
128436	./Safecast-bGeigieNanoKit/objects
128540	./Safecast-bGeigieNanoKit
4	./OPEN-NEXT-WP2.2_reference/branches
4	./OPEN-NEXT-WP2.2_reference/refs/heads
4	./OPEN-NEXT-WP2.2_reference/refs/tags
12	./OPEN-NEXT-WP2.2_reference/refs
8	./OPEN-NEXT-WP2.2_reference/info
56	./OPEN-NEXT-WP2.2_reference/hooks
8	./OPEN-NEXT-WP2.2_reference/objects/d4
8	./OPEN-NEXT-WP2.2_reference/objects/8f
8	./OPEN-NEXT-WP2.2_reference/objects/56
8	./OPEN-NEXT-WP2.2_reference/objects/5a
12	./OPEN-NEXT-WP2.2_reference/objects/ec
4	./OPEN-NEXT-WP2.2_reference/objects/info
12	./OPEN-NEXT-WP2.2_reference/objects/06
8	./OPEN-NEXT-WP2.2_reference/objects/66
8	./OPEN-NEXT-WP2.2_reference/objects/f9
8	./OPEN-NEXT-WP2.2_reference/objects/1b
8	./OPEN-NEXT-WP2.2_reference/objects/ea
8	./OPEN-NEXT-WP2.2_reference/objects/1f
8	./OPEN-NEXT-WP2.2_reference/objects/a3
8	./OPEN-NEXT-WP2.2_reference/objects/4e
8	./OPEN-NEXT-WP2.2_reference/objects/da
8	./OPEN-NEXT-WP2.2_reference/objects/d3
8	./OPEN-NEXT-WP2.2_reference/objects/5d
8	./OPEN-NEXT-WP2.2_reference/objects/b5
8	./OPEN-NEXT-WP2.2_reference/objects/0a
8	./OPEN-NEXT-WP2.2_reference/objects/53
8	./OPEN-NEXT-WP2.2_reference/objects/02
8	./OPEN-NEXT-WP2.2_reference/objects/27
8	./OPEN-NEXT-WP2.2_reference/objects/9d
8	./OPEN-NEXT-WP2.2_reference/objects/3d
8	./OPEN-NEXT-WP2.2_reference/objects/79
8	./OPEN-NEXT-WP2.2_reference/objects/12
8	./OPEN-NEXT-WP2.2_reference/objects/4f
8	./OPEN-NEXT-WP2.2_reference/objects/5f
12	./OPEN-NEXT-WP2.2_reference/objects/65
12	./OPEN-NEXT-WP2.2_reference/objects/b2
8	./OPEN-NEXT-WP2.2_reference/objects/a8
8	./OPEN-NEXT-WP2.2_reference/objects/a1
8	./OPEN-NEXT-WP2.2_reference/objects/18
8	./OPEN-NEXT-WP2.2_reference/objects/1d
8	./OPEN-NEXT-WP2.2_reference/objects/34
8	./OPEN-NEXT-WP2.2_reference/objects/74
8	./OPEN-NEXT-WP2.2_reference/objects/4a
8	./OPEN-NEXT-WP2.2_reference/objects/0f
8	./OPEN-NEXT-WP2.2_reference/objects/f5
8	./OPEN-NEXT-WP2.2_reference/objects/86
8	./OPEN-NEXT-WP2.2_reference/objects/9c
12	./OPEN-NEXT-WP2.2_reference/objects/3c
8	./OPEN-NEXT-WP2.2_reference/objects/61
8	./OPEN-NEXT-WP2.2_reference/objects/3b
8	./OPEN-NEXT-WP2.2_reference/objects/ff
8	./OPEN-NEXT-WP2.2_reference/objects/38
12	./OPEN-NEXT-WP2.2_reference/objects/81
8	./OPEN-NEXT-WP2.2_reference/objects/67
12	./OPEN-NEXT-WP2.2_reference/objects/4d
8	./OPEN-NEXT-WP2.2_reference/objects/b6
12	./OPEN-NEXT-WP2.2_reference/objects/5c
8	./OPEN-NEXT-WP2.2_reference/objects/8e
8	./OPEN-NEXT-WP2.2_reference/objects/32
12	./OPEN-NEXT-WP2.2_reference/objects/92
12	./OPEN-NEXT-WP2.2_reference/objects/51
8	./OPEN-NEXT-WP2.2_reference/objects/ac
20	./OPEN-NEXT-WP2.2_reference/objects/f2
8	./OPEN-NEXT-WP2.2_reference/objects/28
12	./OPEN-NEXT-WP2.2_reference/objects/e9
8	./OPEN-NEXT-WP2.2_reference/objects/7f
8	./OPEN-NEXT-WP2.2_reference/objects/52
4	./OPEN-NEXT-WP2.2_reference/objects/pack
8	./OPEN-NEXT-WP2.2_reference/objects/bf
8	./OPEN-NEXT-WP2.2_reference/objects/a0
8	./OPEN-NEXT-WP2.2_reference/objects/dc
8	./OPEN-NEXT-WP2.2_reference/objects/80
8	./OPEN-NEXT-WP2.2_reference/objects/a2
8	./OPEN-NEXT-WP2.2_reference/objects/d1
8	./OPEN-NEXT-WP2.2_reference/objects/47
12	./OPEN-NEXT-WP2.2_reference/objects/21
8	./OPEN-NEXT-WP2.2_reference/objects/4c
8	./OPEN-NEXT-WP2.2_reference/objects/99
8	./OPEN-NEXT-WP2.2_reference/objects/48
8	./OPEN-NEXT-WP2.2_reference/objects/91
12	./OPEN-NEXT-WP2.2_reference/objects/46
8	./OPEN-NEXT-WP2.2_reference/objects/0e
8	./OPEN-NEXT-WP2.2_reference/objects/35
8	./OPEN-NEXT-WP2.2_reference/objects/2a
8	./OPEN-NEXT-WP2.2_reference/objects/7a
8	./OPEN-NEXT-WP2.2_reference/objects/af
8	./OPEN-NEXT-WP2.2_reference/objects/6b
8	./OPEN-NEXT-WP2.2_reference/objects/7d
716	./OPEN-NEXT-WP2.2_reference/objects
820	./OPEN-NEXT-WP2.2_reference
4	./Warukira-bGeigieNanoKit/branches
4	./Warukira-bGeigieNanoKit/refs/heads
4	./Warukira-bGeigieNanoKit/refs/tags
12	./Warukira-bGeigieNanoKit/refs
8	./Warukira-bGeigieNanoKit/info
56	./Warukira-bGeigieNanoKit/hooks
4	./Warukira-bGeigieNanoKit/objects/info
128428	./Warukira-bGeigieNanoKit/objects/pack
128436	./Warukira-bGeigieNanoKit/objects
128536	./Warukira-bGeigieNanoKit
4	./jbon-github-mining/branches
4	./jbon-github-mining/refs/heads
4	./jbon-github-mining/refs/tags
12	./jbon-github-mining/refs
8	./jbon-github-mining/info
56	./jbon-github-mining/hooks
4	./jbon-github-mining/objects/info
8864	./jbon-github-mining/objects/pack
8872	./jbon-github-mining/objects
8976	./jbon-github-mining
2958056	.

Load CSV list of repositories

CSV is probably easier to manage with common desktop spreadsheet software, so have the data-mining script read from this a list of repositories to mine.

KeyError for `added` in build_file_change_history.py

How to reproduce: Run current master branch start.py with the repository jurgenwesterhoff/bGeigieNanoKit.

Symptom: Run fails with this traceback:

retrieving commits in jurgenwesterhoff/bGeigieNanoKit
fetching info at https://github.com/jurgenwesterhoff/bGeigieNanoKit
Traceback (most recent call last):
  File "/home/penyuan/.vscode-oss/extensions/ms-python.python-2020.5.86806/pythonFiles/ptvsd_launcher.py", line 48, in <module>
    main(ptvsdArgs)
  File "/home/penyuan/.vscode-oss/extensions/ms-python.python-2020.5.86806/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 432, in main
    run()
  File "/home/penyuan/.vscode-oss/extensions/ms-python.python-2020.5.86806/pythonFiles/lib/python/old_ptvsd/ptvsd/__main__.py", line 316, in run_file
    runpy.run_path(target, run_name='__main__')
  File "/home/penyuan/.pyenv/versions/3.6.10/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/penyuan/.pyenv/versions/3.6.10/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/penyuan/.pyenv/versions/3.6.10/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/penyuan/devel/git/wp2.2_dev/src/start.py", line 174, in <module>
    main()
  File "/home/penyuan/devel/git/wp2.2_dev/src/start.py", line 139, in main
    build_file_change_history(known_commits, file_change_history)
  File "/home/penyuan/devel/git/wp2.2_dev/src/build_file_change_history.py", line 47, in build_file_change_history
    added = filechange["added"],
KeyError: 'added'
fish: “env PYTHONIOENCODING=UTF-8 PYTH…” terminated by signal SIGTERM (Polite quit request)

Looks like sometimes the metadata retrieved with Perceval might not include the added item.

Include JSON to JSON-LD conversion

Per recent discussions with @moedn et al. e.g. on 2020-11-03, I'm inclined to hook the dashboard up to the Open!Next Wikibase instance by converting the dashboard's JSON-formatted data to a JSON-LD payload which is accepted by WMDE's upcoming API. I.e.:

"dashboard data in JSON format" <----> "JSON-LD" <----> "WMDE's API"

Add bi-directionality in committer graphs

so far we use a networkx.Graph what can only record one link between two nodes. As a result, we mix up interactions going from a committer to another and those going the other way round.

Alternative is to use a networkx.MultiGraph that allows parallel edges. This would be compatible with a vis.js visualisation. I couldn't tell from the networkx doc whether it is compatible with GraphML export, though.

Alternatively an undirected graph with two weight attributes from and to would do the thing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.