This repository is the codebase for our project, Mutation Testing Based on Mining GitHub Commits. We are Team 1 of CS453 Automated Software Testing course at KAIST.
For our final presentation, see final-presentation.pdf
.
For our final report, see final-presentation.pdf
.
This repository contains code and data for: Data mining/crawling, preprocessing, and clustering.
- We collected commits from open-source Python projects on GitHub. To do so, we used Google BigQuery to mine the
bigquery-public-data.github_repos
dataset for commits.- The mined commits are available under the
/mined_commits/
directory, in the form of gzipped JSON Lines files. These have been split into chunks to facilitate parallel processing.
- The mined commits are available under the
- For each commit data, we downloaded the contents of every Python source file affected by the commit, both before and after.
- We wrote a Scrapy spider (
scrapy_github_files.py
) to perform the task. To run the spider, usescrapy runspider scrapy_github_files.py
. - The spider creates a directory outside the repository at
../github_file_changes/
. It stores the crawled data in a large JSON Lines file namedfile_changes_chunk<num>.jsonl
, where<num>
is the chunk number.
- We wrote a Scrapy spider (
Also check out the original instructions added by our teammate Adil:
- Install requirements through requirements-dev.txt
- Create a service account for Google Big Query API
- Create a GitHUb Access Token
- Export variables as such:
export GOOGLE_APPLICATION_CREDENTIALS=<Path to your key file>
export GITHUB_TOKEN=<Your GitHub Access Token>
- Create a
data
folder in the root directory of this project- You are ready to use
extractor.py
For each changed file, we extracted pairs of functions changed by the commit. We used GumTree to derive a series of edit actions to transform the function from the "before" code to the "after" code. We also normalized the functions' source code, so that each function fits in a single line, which is needed for processing with seq2seq.
- This preprocessing is done by
transform_change_entries.py
. It reads file changes stored under../github_file_changes/
, and saves the preprocessed result to../dataset/
. - We used JPype1 to invoke GumTree from Python.
(Code is added, description needs to be added)