Codez

This repository is the codebase for our project, Mutation Testing Based on Mining GitHub Commits. We are Team 1 of CS453 Automated Software Testing course at KAIST.

For our final presentation, see final-presentation.pdf. For our final report, see final-presentation.pdf.

Organization

This repository contains code and data for: Data mining/crawling, preprocessing, and clustering.

Data preparation

We collected commits from open-source Python projects on GitHub. To do so, we used Google BigQuery to mine the bigquery-public-data.github_repos dataset for commits.
- The mined commits are available under the /mined_commits/ directory, in the form of gzipped JSON Lines files. These have been split into chunks to facilitate parallel processing.
For each commit data, we downloaded the contents of every Python source file affected by the commit, both before and after.
- We wrote a Scrapy spider (scrapy_github_files.py) to perform the task. To run the spider, use scrapy runspider scrapy_github_files.py.
- The spider creates a directory outside the repository at ../github_file_changes/. It stores the crawled data in a large JSON Lines file named file_changes_chunk<num>.jsonl, where <num> is the chunk number.

Also check out the original instructions added by our teammate Adil:

How to extract raw data:

Install requirements through requirements-dev.txt

Create a service account for Google Big Query API

Create a GitHUb Access Token

Export variables as such:
export GOOGLE_APPLICATION_CREDENTIALS=<Path to your key file>
export GITHUB_TOKEN=<Your GitHub Access Token>

Create a data folder in the root directory of this project

You are ready to use extractor.py

Preprocessing

For each changed file, we extracted pairs of functions changed by the commit. We used GumTree to derive a series of edit actions to transform the function from the "before" code to the "after" code. We also normalized the functions' source code, so that each function fits in a single line, which is needed for processing with seq2seq.

This preprocessing is done by transform_change_entries.py. It reads file changes stored under ../github_file_changes/, and saves the preprocessed result to ../dataset/.
We used JPype1 to invoke GumTree from Python.

Training / Clustering

(Code is added, description needs to be added)

pastelmind / ast-codez Goto Github PK

ast-codez's Introduction

Codez

Organization

Data preparation

How to extract raw data:

Preprocessing

Training / Clustering

ast-codez's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent