bonnyci / mateys-ahoy Goto Github PK

View Code? Open in Web Editor NEW

0.0 12.0 2.0 4.62 MB

Individual contributor analysis (see shuffleboard)

License: Apache License 2.0

Shell 1.59% R 98.41%

mateys-ahoy's People

Contributors

Watchers

Forkers

aaronschneider1 sarahzuk

mateys-ahoy's Issues

Social networking trends (eg, Twitter)

Identify what people are posting about using the usual means, hashtags etc. Not sure how useful this is tbh but worth exploring.

Research how open source foundations are defining community participation metrics

TL;DR: How do others currently define individual participation metrics?

This is part of a larger effort to understand how individuals are contributing to open source projects. Ultimately we need to quantify this activity. Open source projects are often run by non-profit open source foundations that need to justify their existence in order to procure funding. One thing many open source foundations care about is tracking contributor activity. This might include looking at the rate of people joining the community (new contributors) and the rate of people leaving the community, as well as how long someone has been involved in the community. They might also provide some way of ranking or rating the contributors. The goal of this issue is to discover how open source communities are currently quantifying contributor activity.

For each article you look at

Create a wiki page to capture research summaries/ideas/etc for this ticket (https://github.com/BonnyCI/mateys-ahoy/wiki -> New Page)
For each article you read, put a bullet point/link in the wiki
some kind of summary of what you thought the main point (relevant to us) (if it's worth it)
ideas you maybe got
ways you might improve on what they suggested (if you have any)
questions - "what are they even talking about?" "why is this important to them?" "what question are they really trying to answer here?" with links to any relevant information that inspired

Some suggestions:

TREC Entity Training Data

http://trec.nist.gov/data/entity2011.html

Improve commit log identification methodolog

When getting commit histories from an individual's commit log, improve the sort order of returned repositories to see if it results in a better identification.

per Issue #6

How many identified have hits on Google Scholar?

Entries:
https://scholar.google.com/scholar?hl=en&q=karen+simonyan

Profile example:
https://scholar.google.com/citations?user=L7lMQkQAAAAJ

EFF AI Progess Measurement Experiment

Contribute ideas for metrics here? Or possibly use things from here for our own purposes unrelated to their efforts?

https://www.eff.org/ai/metrics

Create new contributor directory

Define metrics to be collected from Github Event data

Given the metrics identified in Issue #1 determine what data we need to gather from the Github Event data to compute them.

Find Github repos associated with Language communities

Of particular interest are R and Python

Identify company or university association

Propose contributor participation categories

Rather than just considering "top" or "most active" develop categories of contributors based on their activity profiles.

One idea: take random samples of contributors to the project in a small enough size for manual analysis (n=10), manually identify which ones appear to be the most "valuable" to a given project and then come up with a list of parameters that defines their "value". Then we can do further analysis using the parameters that are most easily available given the data we have to see if we can find any obvious correlations that could help us build event data queries.

What types of event activity are most strongly correlated with higher valued contributors (if any)? We could iterate on this to incorporate the event payload field (which has to be parsed). The goal here would be to maximize the probability of identifying key contributors while attempting to minimize the amount of "crud" we have to analyze to find them.

Set up TravisCI to check for Signed Off By

Define metrics to be collected from Git Commit History

Add mxnet exploration to repo

New Contributor Setup for Sarah

@SarahZuk

Github account info (username)
Google account email/username (for gbigquery)
Install RStudio

@missaugustina

add sarah to BonnyCI org
add sarah to Google BQ

Nature Index

https://www.natureindex.com/

Manually build a social networking profile for a small sample of contributors

Initially use a known set, contributors that have already been identified and see where they have profiles and what identifying information is available. Once that's done, take a small sample of unknowns and try to identify them as well. Ideally this should be automated even if just in one-off scripts to document the paths taken.

CRAN packages by SourceRank

https://libraries.io/search?order=desc&platforms=CRAN&sort=rank

Commit Authors <-> Github users

Twitter Bot or Not

More interesting than Issue #18 would be a way to determine if Twitter accounts are bots and to perform demographic analysis on the Twitter bots themselves to see how they are evolving.

Find Existing "Innovation Trackers"

This issue is to collect information about existing innovation trackers. What metrics are they tracking and how are they collecting data?

Areas of Expertise demographics

In addition to considering what companies are represented, look for areas of expertise (contributors can have more than one). How does this correlate with other "innovation" factors?

Questions:

What fields of expertise are represented for each project among top contributors?
What is the typical diversity of expertise per top contributor?
Is there any correlation between projects identified as highly innovative/active and expertise diversity?

CRAN package downloads

https://github.com/metacran/cranlogs.app

Automate web searching by name and email address

Explore options for automatically building profiles.

DuckDuckGo API search (try different combinations of terms, need a way to rank match likelihood)
LinkedIn API (apply to do email searches)
Explore Social Network services

Automated web searches

Once areas of interest are identified, automate web searches for different topics. If a particularly valuable data source is found then register an additional alert for it.

Additional automation tools for tracking content updates:

Define metrics to be collected through Github API

Given the metrics identified in Issue #1 determine what data we need to gather from the Github API to compute them.

What companies are engaged the most in R code contributions?

R core language source code analysis will not yield results (uses SVN, small group of contributors)
Commit history on packages will yield better results, see other issues in this milestone for CRAN analysis
also ROpenSci and Bioconductor

Commit Authors Lookup

Develop a better method for identifying who a commit author is. Keep track of their email addresses for given time frames and determine a way to "rank" them. If there is a choice between a hosted email address and a company one for the same time period, the company one gets priority.

author -> min_date,max_date -> authoritative email_address

Given an email address + date, find the author
email_address + date -> author

Investigate Libraries.io SourceRank as an alternative to "stars"

https://github.com/librariesio/libraries.io/pull/1020/files#diff-3524a86e01a2c00d6b1d818b4dd43e1aR22

Link committers to Github Accounts (via GBQ Github repos dataset)

GBQ has a Github repos dataset that has commit sha's. For repos of interest, try cross-referencing commits to identify authors so even activity can also be identified.

Normalize company names based on domain lookups

Given a domain extracted from an email address, identify the company.

Automate metric collection

Data should be stored in GBQ dataset

Extract email addresses from commit log

How many contributors with minimal github info are able to be identified this way? Do the email addresses improve other identification results?

See: countering-bean-counting/bonnyci_shuffleboard#85

Use DuckDuckGo for Exploration

This came from Issue #9 which was originally going to just use DuckDuckGo. On further consideration, it makes more sense to write a script that takes a list of inputs and a query argument (or an argument indicating some pre-determined query pattern).

This would just provide some insight on a) what combinations of search terms yield the best results and b) what our match rate is.

These results would need be manually analyzed until a better method was determined. Right now other than manually searching one at a time, it's hard to get a sense of what proportion of the contributors are able to be matched.

bonnyci / mateys-ahoy Goto Github PK

mateys-ahoy's People

Contributors

Watchers

Forkers

mateys-ahoy's Issues

Recommend Projects

Recommend Topics

Recommend Org