This repository holds data, experimental setups, and code closely related to the paper "Exploiting Relations, Sojourn-Times and Joint Conditional Probabilities for Automated Commit Classification".
While the name implies hidden Markov Models, those were just one type of model tested here. We attempt also to fit dependent mixture models, joint conditional density models, as well as attempting traditional machine learning. We reverse-engineer rules for manually labeling commits and create labels for a few hundred commits. Then, we produce datasets of adjacent, labeled commits that can be used with the mentioned models:
commits_t-0.csv
: Approx. 300 newly labeled commits. Some of these were contained previously in Levin's dataset and we labeled them here again to see if we would reach the same consensus. All commits have size properties fromGit-Density
attached and come also with the usual information, like author, committer, email, timestamps, messages, hashes, etc.commits_t-1.csv
,commits_t-2.csv
,commits_t-3.csv
: Those are the "interesting" datasets. These have the same feature names ascommits_t-0.csv
, but also come with 1/2/3 directly predecessing commits, so they 2/3/4 times the features ascommits_t-0.csv
. The names are the same, but suffixed by_t_{1,2,3}
.
The latter type of commit chains can be exploited for sequential learning. For example, is there any value in knowing the activity that was carried out in the previous commit(s)?