โ
- Since the code is runned on the computation node (single GPU version) on clusters, thus, some codes are ignorable when being runned on your server or kaggle
- Some model paths should be modified according to your local environments
Base Tasks
4.1 (A)
- Run
prepare_sft_dataset.py
to generate two files:sft_train.json
andsft_test.json
- Train gpt2-medium
Setting Value Model gpt2-medium Train iteration 160000 Batch size 8 Optimizer AdamW (weight decay) Test step 200 steps Test data 240 Other Hyper-parameter Here Training Record Here Train error Here Test error Here
โ Train 20000 (Train iteration / batch size) steps, and test in every 200 steps
4.1(B)
- Run
eval.py
to evaluate the performance of vanilla gpt2-medium and sft gpt2-medium, result detail
๐ eval.py
(modified on evaluate.py
) leverages the reward model OpenAssistant/rewardmodel-deberta-v3-large-v2 to evaluate the performance instead of using Openai Apikey
4.1(C)
- Insights: Summarize what you find based on the results or settings from 4.1(A) and 4.1(B).
Explorations
4.2(C)
- Run
train_sft.py
. The code of lora has been integrated by TA, only changecfg = get_configs("gpt2-medium")
tocfg = get_configs("gpt2-medium/lora")
in train_sft.py to add lora on gpt2-medium - Comparsion
Setting Figure Link gpt2-medium train error Here gpt2-medium test error Here gpt2-medium/lora train error Here gpt2-medium/lora test error Here
โ gpt2-medium
and gpt2-medium/lora
are trained based on same hyper-parameter settings, optimizer: AdamW (weight decay)
Lora Rank | Figure Link | Dialogue Quality |
---|---|---|
Full paramters | Here | |
lora rank = 1 | Here | |
lora rank = 10 | Here | |
lora rank = 100 | Here |
โ Different lora rank training are based on same hyper-parameter settings, optimizer: AdamW (weight decay)
4.2(A)
- Run
train_sft.py
. Only need to switch the optimizer from AdamW to others (already included infit()
function intrainers.py
) and then test on different optimizers - Comparsion
Optimizers Figure Link GPU Memory SGD Here $1663992832$ bytesSGD with Momentum (momentum=0.9) Here $1877104128$ bytesSGD with Nesterov (momentum=0.9) Here $1877104128$ bytesAdamW ( $\beta_1=0.9$ ,$\beta_2=0.95$ )Here $2090215424$ bytes
โ Models with different optimizers are trained with the same weight decay and hyper-parameter settings
โ 4.2(A) experiments are conducted based on gpt2-medium/lora with rank