Since the tokenization on all the dataset takes a lot of time, I have decided to creat

For training we use the following given in the README, if I understand correctl

Small Training Dataset about codegen HOT 6 CLOSED

yazdanbakhsh commented on September 13, 2024

Small Training Dataset

from codegen.

Comments (6)

brozi commented on September 13, 2024

It seems like some tokens can make RoBERTa's detokenizer crash. I pushed a fix that will make it output a null token instead of this character when this happens.

note: RoBERTa's tokenizer was created for natural languages. The advantage versus our tokenizer is that you can feed it code in any programming languages easily. The disadvantage is that you won't remove some tokens that don't change the semantics of the code (e.g. newlines in java and C++) and that your BPE and vocab won't be tailored to programming languages. If you are training only on languages we support or which are supported by tree-sitter, you might want to use a tokenizer made for source code.

from codegen.

yazdanbakhsh commented on September 13, 2024

Thanks for the reply and suggestion.

If I understand correctly, I used the tokenizer that you folks developed (fast instead of roberta) as follows. You are suggesting to use a different detokenizer during training?

python3 -m codegen_sources.preprocessing.preprocess /gcs/transcoder_data/train_data_small \
--langs java cpp python \
--mode monolingual_functions \
--bpe_mode fast \
--local true \
--train_splits 8

from codegen.

yazdanbakhsh commented on September 13, 2024

For training we use the following script given in the README, if I understand correctly, I just need to set roberta_mode false to use the program-specific tokenizers.

python3 -m torch.distributed.launch --nproc_per_node=$NGPU codegen_sources/model/train.py \
--exp_name mlm \
--dump_path '/gcs/transcoder_data/train_data_small_dump' \
--data_path '/gcs/transcoder_data/train_data_small/XLM-syml' \
--mlm_steps 'java_monolingual,python_monolingual' \
--add_eof_to_stream true \
--word_mask_keep_rand '0.8,0.1,0.1' \
--word_pred '0.15' \
--encoder_only true \
--n_layers 12  \
--emb_dim 768  \
--n_heads 12  \
--lgs 'java_monolingual-python_monolingual' \
--max_vocab 64000 \
--gelu_activation true \
--roberta_mode true \
--amp 2  \
--fp16 true  \
--batch_size 32 \
--bptt 512 \
--epoch_size 100000 \
--max_epoch 100000 \
--split_data_accross_gpu local \
--optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0001,weight_decay=0.01' \
--save_periodic 0 \
--validation_metrics _valid_mlm_ppl \
--stopping_criterion '_valid_mlm_ppl,10'

from codegen.

yazdanbakhsh commented on September 13, 2024

Set roberta_mode = false got CUDA memory error. I guess the number of heads in the original Transcoder repo is 8 whereas in CodeGen it is set to 12 and number of layers was 6 compared to 12 in CodeGen. We use V100 GPUs 16GB.

Tried to allocate 96.00 MiB (GPU 6; 15.78 GiB total capacity; 12.64 GiB already allocated; 36.25 MiB free; 14.02 GiB reserved in total by PyTorch

After changing the number of layers and heads, the model seems to be trained for a while and then receives the following error and then CUDA memory issue again.

Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8192.0
...
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4096.0
...
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2048.0
...

from codegen.

brozi commented on September 13, 2024

You made me realize that the example parameters we were showing in transcoder.md are those we used to finetune DOBF with a model that uses the same tokenizer and architecture as RoBERTa to compare it with CodeBERT and GraphCodeBERT. It didn't really make sense and we updated the parameters so that they correspond to TransCoder's architecture. Here are the differences:

--n_layers 6  \ (was 12)
--emb_dim 1024  \  (was 768)
--n_heads 8  \ (was 12)
--gelu_activation false \ (was true)
--roberta_mode false \ (was true)

We also updated it to train on python, java and C++ instead of only python and java like in the original TransCoder paper.

About your memory issues, we train our models on V100 GPUs with 32GB. If you only have 16GB you will need to either decrease the batch size (decrease the batch_size parameter for MLM or the tokens_per_batch parameter for DOBF, DAE and BT) or decrease the size of the model, for instance emb_dim (but in that case you won't be able to reload our pre-trained models). The Gradient overflow thing sometimes happens at the beginning of training and it's generally not a problem.

from codegen.

yazdanbakhsh commented on September 13, 2024

Thanks Baptiste for all the detailed answer. I can confirm the training is now working on our end.

from codegen.

Small Training Dataset about codegen HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent