Nice work! I'm trying to sft Llama 2 on cherry_data_v1 under the sam

Thx for your quick reply. I used <a href="https://github.com/h

Thanks for your reply~ Since you are using a different codebas

Evaluation reproducibility on benchmarks about cherry_llm HOT 4 CLOSED

tianyi-lab commented on July 28, 2024

Evaluation reproducibility on benchmarks

from cherry_llm.

Comments (4)

MingLiiii commented on July 28, 2024

Thanks for your interest!

Below are some questions for me to know what is going on:

What codebase are you using for sft?
Can you specifically let me know your training setting?
How much exactly is the slight performance gap?
I don't think we showcase the performance for llama2 fine-tuning on cherry_data_v1, so which list of performance are you comparing your model with?
Indeed previously there was one bro using the incorrect scripts. Can you send me your scripts used in lm-evaluation-harness for evaluation?

As for "data splits in cherry_data_v1 are not exactly 5%, 10%, or 15%", yes indeed. It is caused by our previous filtering mechanism. Thus in our paper, we all present "approximately 5% data" or so and present the specific data number in the paper.

from cherry_llm.

Cheungki commented on July 28, 2024

Thx for your quick reply.

I used llama-factory for sft.
I train these three models under the same settings as yours: batch_size 128, learning_rate 2e-5, num_train_epochs 3, warmup_ratio 0.03, max_length 2048. Due to hardware limitations, I use fp16 rather than bf16 in my training.
As for the performance gap, please check the attached files below.
cherry_5percent.json
cherry_10percent.json
cherry_15percent.json
I'm comparing mine with the results you report here.
I used CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --main_process_port 29999 -m lm_eval --model hf --model_args pretrained={sft_model_path} --tasks mmlu,ai2_arc,hellaswag,truthfulqa --batch_size 1 --output_path {log_path} in 'lm-evaluation-harness' for evaluation.

from cherry_llm.

MingLiiii commented on July 28, 2024

Thanks for your reply~

Since you are using a different codebase, so there is a great possibility that different prompts will lead to a different performance when using lm-evaluation-harness, as they don't support customized prompts. As mentioned, for llama2 models, we used vicuna prompt for training.
I think the settings are the same as ours.
N/A
We are sorry for the misunderstanding. In this table, they are not using cherry_data_v1 (calculated based on llama1), but the "IFD scores are calculated on llama2-7b or llama2-13b". So there should be gaps if using cherry_data_v1 data.
Also, the "IFD scores are calculated on llama2-7b or llama2-13b" is also released recently, please check: Alpaca llama2 7b, Alpaca llama2 13b, WizardLM70k llama2 7b, WizardLM70k llama2 13b.
You might need to sort the data by yourself.
It seems that all of the testing scripts you are using are Zeroshot, however, according to the open_llm_leaderboard site, most of them are using the few-shot. So I think this is the main reason.

To conclude, the main reason is that you are not using the few-shot settings mentioned in the open_llm_leaderboard. Besides, it's better to use "IFD scores are calculated on llama2-7b or llama2-13b".

from cherry_llm.

Cheungki commented on July 28, 2024

ic, I'll try this later. Thx again.

from cherry_llm.

Evaluation reproducibility on benchmarks about cherry_llm HOT 4 CLOSED

Comments (4)

Related Issues (19)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent