Git Product home page Git Product logo

ast-codez's Introduction

Codez

This repository is the codebase for our project, Mutation Testing Based on Mining GitHub Commits. We are Team 1 of CS453 Automated Software Testing course at KAIST.

For our final presentation, see final-presentation.pdf. For our final report, see final-presentation.pdf.

Organization

This repository contains code and data for: Data mining/crawling, preprocessing, and clustering.

Data preparation

  1. We collected commits from open-source Python projects on GitHub. To do so, we used Google BigQuery to mine the bigquery-public-data.github_repos dataset for commits.
    • The mined commits are available under the /mined_commits/ directory, in the form of gzipped JSON Lines files. These have been split into chunks to facilitate parallel processing.
  2. For each commit data, we downloaded the contents of every Python source file affected by the commit, both before and after.
    • We wrote a Scrapy spider (scrapy_github_files.py) to perform the task. To run the spider, use scrapy runspider scrapy_github_files.py.
    • The spider creates a directory outside the repository at ../github_file_changes/. It stores the crawled data in a large JSON Lines file named file_changes_chunk<num>.jsonl, where <num> is the chunk number.

Also check out the original instructions added by our teammate Adil:

How to extract raw data:

  1. Install requirements through requirements-dev.txt
  2. Create a service account for Google Big Query API
  3. Create a GitHUb Access Token
  4. Export variables as such:
    export GOOGLE_APPLICATION_CREDENTIALS=<Path to your key file>
    export GITHUB_TOKEN=<Your GitHub Access Token>
  5. Create a data folder in the root directory of this project
  6. You are ready to use extractor.py

Preprocessing

For each changed file, we extracted pairs of functions changed by the commit. We used GumTree to derive a series of edit actions to transform the function from the "before" code to the "after" code. We also normalized the functions' source code, so that each function fits in a single line, which is needed for processing with seq2seq.

  • This preprocessing is done by transform_change_entries.py. It reads file changes stored under ../github_file_changes/, and saves the preprocessed result to ../dataset/.
  • We used JPype1 to invoke GumTree from Python.

Training / Clustering

(Code is added, description needs to be added)

ast-codez's People

Contributors

adilb99 avatar pastelmind avatar tiendatnguyen-vision avatar wulanfrom avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.