DATASET folder contains all 12 datasets used to train and test RoBERTa models. The files are zipped to preserve the space.
For each dataset there are 7 files:
- training_mask: mask tokens for the training set (with a special token at the end)
- training_masked_code: training method code with a special token instead of the mask tokens
- eval_mask: mask tokens for the evaluation set (with a special token at the end)
- eval_masked_code: evaluation method code with a special token instead of the mask tokens
- test_mask: mask tokens for the test set (with a special token at the end)
- test_masked_code: test method code with a special token instead of the mask tokens
- tokenizer_training: list of methods used to train the tokenizer
For each dataset we have two different files:
- predictions.txt: RoBERTa predictions for each masked method in the test set
- raw_data.csv: reported metrics (BLEU scores and Levenshtein distance) for each prediction (same record as the previous file)
For each dataset we've reported a summary file to compare RoBERTa model with n-gram model.
In result_comparison.csv you can find, for each record processed without errors by both RoBERTa and n-gram, if each model has correctly predicted all the token (perfect prediction)
torch==1.4.0
transformers==3.0.2
tokenizers==0.8.1rc1
wandb==0.9.1
python3 tokenizer-training.py --vocab_size [vocabulary size] --train_data_file [Path of the tokenizer training file (has the name matching *_tokenizer_training.txt)] --output_dir [folder where to store the tokenizer]
python3 run_training.py --train_data_file [Path of the training file (has the name matching *_masked_code_training.txt)] --eval_data_file [Path of the evaluation file (has the name matching *_masked_code_eval.txt)] --output_root [folder where to store the model] --tokenizer_name [folder with the trained tokenizer] --vocab_size [vocabulary size]
python3 run_on_test_set.py --model_path [path of the trained model] --test_set_inputs_path [Path of the test file (has the name matching *_masked_code_test.txt)] --predictions_path [Path of the textual file where predictions will be written (the file is created by the script)]