Git Product home page Git Product logo

hoppity's Introduction

Hoppity is a learning based approach to detect and fix bugs in Javascript programs.

Hoppity is trained on a dataset of (buggy, fixed) pairs from Github commits. Refer to our gh-crawler repo for scripts to download and generate a dataset. Alternatively, we provide a cooked dataset of pairs with a single AST difference here: https://drive.google.com/file/d/1kEJBCH1weMioTcnmG6fmqz6VP-9KjH7x/view?usp=sharing (you can also find the json format of cooked code graphs)

We also provide trained models for each of the following datasets:
One Diff Model

Zero and One Diff Model

Zero, One, and Two Diff Model

INSTALL

  • Install python packages:
pip install torch==1.3.1
pip install numpy
pip install -r requirements.txt
  • Other dependencies
cd deps/torchext
pip install -e .
  • install current package
hoppity$ pip install -e .
  • JS packages
npm install shift-parser
npm install ts-morph
npm install shift-spec-consumer

Data Preprocessing

If you would like to use your own dataset, you will need to "cook" the folder as a preprocessing step to generate graphs. You can use the data_process/run_build_main.sh script to create the cooked dataset. Set the variables data_root, data_name and ast_fmt accordingly.

This builds the AST in our graph format, for each file and saves it in a pkl file. Additionally, it creates a graph edit file text file for each pair of (buggy, fixed) JS ASTs. This is in a JSON format such that each edit is an object in a list.

Data Split - Train, Validate, and Test

If you're using the cooked dataset we provided, this portion is already done for you. Once you've downloaded the compressed file, unzip by running tar xzf cooked-one-diff.gz. If you do not specify an output directory, the files will be placed in ~/cooked-full-fmt-shift_node/ by default. This will take around an hour. After the files are extracted you can move onto the next step to begin training.

Otherwise, run data_process/run_split.sh to partition your cooked dataset. The raw Javascript source files are needed for this script to filter out duplicates. Set the raw_src variable in the script accordingly.

run_split.sh calls split_train_test.py to load triples from the save_dir and partition according to the percentage arguments specified in config.py. The default split is 80% train, 10% validation, and 10% test. It saves three files: test.txt, val.txt, and train.txt in the save_dir with the cooked data. Each sample name in the cooked dataset is written in one of the three files.

Training

Now, run run_main.sh to train on our pre-processed dataset. Set the variables in the script accordingly. Hyperparameters can be changed in common/config.py. The training runs indefinitely. Kill the script manually to end training.

Finding the Best Model

To find the "Best Model", we've provided a script that evaluates each epoch's model dump on the validation set. Run find_best_model.sh to start the evaluation. Set the variables accordingly. The loss of each epoch's model will be recorded in the LOSS_FILE.

Evaluation

We provide an evaluation script that can evaluate a particular model on a number of metrics:

  • Total End-to-End Accuracy - A sample is considered accurate if the model detects the bug and predicts the entire given fix.
  • Location Accuracy - Bug detection acccuracy
  • Operator Accuracy - Since there are only 4 operators (ADD, REMOVE, REPLACE_VAL, REPLACE_TYPE), we always report top-1 accuracy.
  • Value Accuracy - If the sample is a REPLACE_VAL or ADD, it is considered accurate if the value is predicted correctly. We also include an UNKNOWN value for literals not included in the vocabulary. If the model predicts UNKNOWN a vlaue not in the vocabulary, it is considered correct.
  • Type Accuracy - If the sample is a REPLACE_TYPE or ADD, it is considered accurate if the node type is predicted correctly.

We also include an option for accuracy breakdown per operation type. Lastly, if you would like an exhaustive evaluation of all metrics, we provide the output_all option.

hoppity's People

Contributors

elizabethdinella avatar hanjun-dai avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.