UCSC IBM Capstone dedicated to probing large language models for code style.
- Code Style Transfer & Probing
python extract_metrics.py py150k
The script is used for directly tokenizing the full dataset
train
: training set that namedbq_data_outlier.csv
eval
: eval set that namedevaluation_set.csv
python tokenize_raw_script.py [train|eval]
Generate all individual features of parallel corpora at once.
python parallel_corpora_gen_script.py \
INPUT_CSV_PATH \
OUTPUT_CSV_PATH
INPUT_CSV_PATH: the path for the input csv data, which will contain the raw script. The evaluation_set.csv
gives the correct column names.
OUTPUT_CSV_PATH: the path for the output file. It will be a csv file containing all the script that the individual features are transferred.
python combined_parallel_gen_script.py [OPTIONS] TARGET_FEAT
Arguments:
TARGET_FEAT [required]
Options:
--csv-name TEXT
--output-dir TEXT
--is-short TEXT [default: False]
Example:
python combined_parallel_gen_script.py comment+docstring \
--csv-name /data/curated_eval_set/eval_set_short_individual_feat.csv \
--output-dir /data/curated_eval_set \
--is-short True
-
TARGET_FEAT: Any combination of the feature that to be transferred and should be combined with
+
, i.e. comment+docstring -
csv-name
: Input CSV file, should be the file with all individual feature transfer scripts. -
output-dir
: The output location -
is-short
: Whether the input script is the shorten version(in the range of the max length)
For keeping dataset in the range of the max sequence length, use this script for filtering out the short sequence data and extracting function/class level codes out of the long sequences. The output will be 2 files: short and long datasets.
python split_long_data.py [train|eval]
You will need to configure the file path inside the script.
Modify the setup in
config.py
before starting the training. This may not work really well, need to make deeper inspection on the training process. Thegpu.py
will select the free-most GPU for training.
export CUDA_VISIBLE_DEVICES=$(python gpu.py | tail -n 1); python train.py
Tokenizing and filtering out all the NULL value examples.
Usage: parallel_preprocessing_script.py [OPTIONS] TARGET_FEAT CSV_FNAME
OUTPUT_DIR
Arguments:
TARGET_FEAT [required]
CSV_FNAME [required]
OUTPUT_DIR [required]
Options:
--is-short / --no-is-short [default: no-is-short]
Example:
python parallel_preprocessing_script.py \
casing+class \
/data/curated_eval_set/downsized_eval_set_short_class_casing.csv \
/data/curated_eval_set/downsized_eval_set_short_class_casing_dataset.hf \
--is-short ; \
- TARGET_FEAT: Any combination of the feature that to be transferred and should be combined with
+
, i.e. casing+class+list_comp+comment+docstring - CSV_FNAME: CSV file that contains all individual features
- train set - casing:
bq_data_uncased.csv
- train set - class:
bq_data_outlier_no_class.csv
- train set - list comp:
bq_data_uncomp_fixed_outlier.csv
- train set - comment:
bq_uncommented_outlier.csv
- train set - docstring:
bq_updated_docstring_outlier.csv
- eval set:
eval_set_individual_feat.csv
- eval set contains separate labels for each individual transformation
- train set - casing:
- OUTPUT_DIR: The output dataset path, whatever you want, which will be a
.hf
file is-short
: Whether the input script is the shorten version(in the range of the max length)
Preprocessing on the docstring transfer will be done for removing very long sequence data.
# inidividual
## i.e. class
### train
python parallel_preprocessing_script.py \
class \
bq_data_outlier_no_class.csv \
train_class_dataset.hf
### eval
python parallel_preprocessing_script.py \
class \
eval_set_individual_feat.csv \
eval_class_dataset.hf
# combined - eval only
## i.e. class+list_comp
python parallel_preprocessing_script.py \
class+list_comp \
eval_set_individual_feat.csv \
eval_class_list_comp_dataset.hf
Input: tokenized .hf
file.
Output: tokenized .hf
file containing control tokens.
You can use this to finetune or make prediction with combined model.
python add_control_code_script.py \
INPUT_DATASET_PATH \
OUTPUT_DATASET_PATH \
FEATURES \
CONTROL_TOKEN_TYPE
INPUT_DATASET_PATH: the input dataset file path, should be .hf
file.
OUTPUT_DATASET_PATH: the output dataset file path, should be .hf
file.
FEATURES: Any combination of the feature that to be transferred and should be combined with +
, i.e. casing+class+list_comp+comment+docstring.
CONTROL_TOKEN_TYPE: there are 4 types of control tokens
# (comp|case|comment|docstring|class)
and appended at the end of the sequence.- Use natural language for describing the transfer such as "change for loop to list comprehension" and wrap it with
<nl>prompt</nl>
- Same as 2, but considering
<nl>
and</nl>
as special tokens(new tokens to train and 2 was not). - Same as 2, but simplifying the prompt sentences.
4 performed the best.
# individual features (finetune only)
## i.e. class
python add_control_code_script.py \
train_class_dataset.hf \
train_class_dataset_with_tokens.hf \
class \
4
# multiple features (inference only)
## i.e. class+list comp
python add_control_code_script.py \
eval_class_list_comp_dataset.hf \
eval_class_list_comp_dataset_with_tokens.hf \
class+list_comp \
4
The Seq2Seq Generation finetuning with CodeT5.
# individual
export CUDA_VISIBLE_DEVICES=$(python gpu.py | tail -n 1); python seq2seq_train.py
# combined
export CUDA_VISIBLE_DEVICES=$(python gpu.py | tail -n 1); python combined_seq2seq_train.py
You will need to configure the training in the script: seq2seq_train.py
fname_prefix
: your repo directory i.e./home/you/code-style-probing/
train_dataset_hf_name
: train set. But in the script, we dowsized it due to the training time constraint. i.e.train_class_dataset.hf
test_dataset_hf_name
: test set. i.e.test_class_dataset.hf
output_dir_name
: checkpoint folder i.e.codet5-class-checkpoints/
model_checkpoint
: checkpoint name, can be the folder or huggingface checkpoint, i.e.Salesforce/codet5-small
inference_only
: whether only do the inference on the test set, i.e.False
down_size_test_set
: whether downsize the test set for saving time. i.e.True
is_baseline
: if baseline, the CodeT5 will be trained from scratch. i.e.False
batch_size
: i.e.16
combined_seq2seq_train.py
batch_size
:16
output_model_name
: output model folder name, i.e.combined_nl_prompt_base_features_contd_codet5small
checkpoint
:checkpoint directory, i.e.seq2seq_results/combined_nl_prompt_base_features_contd_codet5small/epoch 2/checkpoint-85000
train_comp_dataset
: train set, control tokens need to be added.train_casing_dataset
: train set, control tokens need to be added.train_docstring_dataset
: train set, control tokens need to be added.train_comment_dataset
: train set, control tokens need to be added.train_class_dataset
: train set, control tokens need to be added.test_comp_dataset
: another split for validation, control tokens need to be added.test_casing_dataset
: another split for validation, control tokens need to be added.test_docstring_dataset
: another split for validation, control tokens need to be added.test_comment_dataset
: another split for validation, control tokens need to be added.test_class_dataset
: another split for validation, control tokens need to be added.
Usage: seq2seq_inference.py [OPTIONS] INFERENCE_DATASET MODEL_CKPT_PATH
OUTPUT_CSV_FILENAME
Arguments:
INFERENCE_DATASET [required]
MODEL_CKPT_PATH [required]
OUTPUT_CSV_FILENAME [required]
Options:
--batch-size INTEGER [default: 8]
--is-nl / --no-is-nl [default: no-is-nl]
--is-downsize / --no-is-downsize
[default: no-is-downsize]
Example:
rm -rf codestylist ; \
export CUDA_VISIBLE_DEVICES=1; \
python seq2seq_inference.py \
/data/code/curated_eval_set/curated_docstring_dataset_with_prompt.hf \
codestylist/combined_code_style_transformer \
combined_model_results/docstring.non_downsized.output.csv \
--batch-size 64 \
--is-nl ;
- DATASET_PATH: The path of the test set. (
.hf
) - CHECKPOINT: The model checkpoint path.
- OUTPUT_FILE_PATH: The path of the prediction output
- IS_NL: [true|false], whether use the control tokens.
- IS_DOWNSIZE: [true|false], whether need to downsize the test set, will downsize it to 2000 examples.
The output will be a prediction file that contains input/prediction/label.
The removal of
codestylist
folder is because the trainer will create a foler automatically and will have error if we try to load the model from the hub, it will try to load from the empty folder created by trainer instead. So it is needed to remove the folder first no matter whether it exists.
Please see seq2seq_eval.ipynb
(individual) and combined_seq2seq_eval.ipynb
(combined) for evaluation.
We now have a script evaluate_score
for running the evaluation:
Usage: evaluate_score.py [OPTIONS] PRED_DIR OUTPUT_DIR TARGET_FEAT
Arguments:
PRED_DIR [required]
OUTPUT_DIR [required]
TARGET_FEAT [required]
Options:
--is-nl-tokens-added / --no-is-nl-tokens-added
[default: no-is-nl-tokens-added]
--clean-diff / --no-clean-diff [default: clean-diff]
Example:
python evaluate_score.py \
/data/ken/data/code/decorator.output_post_process.csv \
./test.json decorator \
--clean-diff
- PRED_DIR: You prediction csv file
- OUTPUT_DIR: You score output json file name
- is-nl-tokens-added: if true, will run preprocessing on removing nl prompt(combined model only)
- clean-diff: will clean some inconsistent characters caused by AST parse and unparse before calculating DiffBLEU