tianyi-lab / cherry_llm Goto Github PK

[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models

Python 96.71% Shell 3.29%

cherry_llm's Issues

a confusion about Instruction-Following Difficulty (IFD) scores

according to the paper and data_by_IFD.py line 103-105, my understanding mean_1 = loss_list_1.mean() is Direct Answer Score, mean_2 = loss_list_2.mean() is Conditioned Answer Score, and mean_rate = mean_2/mean_1 should be Instruction-Following Difficulty (IFD) scores, right?
So mean_rate(IFD) should be large means that the sample is useful for model.
But from the code, i see if mean_rate > 1: continue from line 106 in data_by_IFD.py, is it right or i misunderstood.

How to filter code SFT data？

Very impressive work. I would like to ask a question. The paper says that IFD is ineffective for code sft data. Is there any improvement method specifically for code sft data screening? Thanks

Need help: the loss curve is strange.

Thanks for your excellent work! Question in title(when I train the final cheery model).

Evaluation reproducibility on benchmarks

Nice work!

I'm trying to sft Llama 2 on cherry_data_v1 under the same setting as this repo. Also, following the settings in your original paper, I use the lm-evaluation-harness framework to evaluate the fine-tuned models on mmlu,ai2_arc,hellaswag,truthfulqa these four benchmarks. But there is a slight performance gap between mine and what you reported in this repo.

What might lead to this? or can you share the scripts used in lm-evaluation-harness for evaluation?

BTW, the data splits in cherry_data_v1 are not exactly 5%, 10%, or 15% of the amount of original Stanford-Alpaca.

I plan to apply this method on Llama2, which part of this project needs to be changed to adapt to Llama2?

Very interesting work. I plan to apply this method on Llama2, which part of this project needs to be changed to adapt to Llama2? I noticed that one of your future work is 'Implement our method on llama 2 models.' Do you have any progress? Thanks a lot!

about the paper

I think there's something wrong with this formula. Shouldn't there be a minus sign?

Could the Pre-Experienced Model be used in other different dataset?

Hi authors,this project is great!I have some confusion and need your help.
The Pre-Experienced Model(stage 3) I fine-tuned with a certain data could be used to filter other datasets?For example, I used the selected pre-experienced samples(stage 2) from alpaca_data to fined tune my pretrain model and obtained a Pre-Experienced Model,and then use this model to select cherry data from alpaca_data.But could I use this Pre-Experienced Model to filter cherry data from other datasets (such as firefly)? In other words,If I have to use the selected pre-experienced samples from other datasets(such as firefly), and then fine-tune my pretrain model to obtained a new Pre-Experienced Model?
my english is poor,I don’t know if my description is clear or not..Thanks a lot!

why is the process so slow

it costs 100 hours to select from 50000 multi_turn samples

Chinese SFT data cannot be displayed.

I'm using Chinese SFTdata for code execution. After the "pre_experience_selection.sh" file is executed, the "alpaca_data_pre.json" file is obtained, but all Chinese characters in the file are changed to \uxxxx. Therefore, the “Train Pre-Experienced Model” file cannot be executed.

Can you check whether “data_by_cluster” and “data_analysis” do not support Chinese?

Thank you.

关于Direct Answer Score sθ(A)

想请教下面这个问题，非常感谢您的回答：）

为什么das越高，对模型越有挑战呢？das越高不是表明模型预测的概率越大，掌握得越好吗？（from paper: A higher direct answer score may suggest that the answer is inherently more challenging or intricate for the model to generate.）

我理解的das的自回归计算过程：

对于数据：{"instruction": "what do you like to eat?", "answer": "I like eating apples."}
das要衡量模型对answer本身的生成难度，das的自回归计算方式为：
i
i like 
I like eating
I like eating apples

Question about the effect of labels[0, :start_token] = -100

Hi, I don't understand the function of code 'labels[0, :start_token] = -100' in Line 68 from cherry_seletion/data_analysis.py file.

Logic behind IFD score

Thanks for your work! I am currently trying to understand the logic behind IFD score and maybe I am misunderstanding something. For equation 4 of the direct answer score, the paper mentions "A higher direct answer score may suggest that the answer is inherently more challenging or intricate for the model to generate". However, isn't it when equation 4 (the probability of the sentence) gives a higher value indicates that the sentence makes more sense so it would be more natural for LM to generate? kindly looking forward to your reply. Thanks!

batch?

How batch?

Multi-round conversation data set

Hello, I observed that the alpaca_data.json dataset we used is in the form of a single round of dialogue. May I ask if you have considered IFD screening for data sets with multiple rounds of dialogue?

'The training of pre-experienced models is discarded for more efficient usage': that means we can only use base model to do cherry analysis and selection?

I noticed your update in README. Did that mean we can only use base model to do cherry analysis and selection? Directly use base model or use pre-experienced model, which performance is better?

Any report of time consuming?

I've already changed the tensors into gpu. And i'm pretty sure that I'm using GPU.
However, running cherry_seletion/data_analysis.py takes about nearly a month on about a 50k dataset.
Is it normal?

how many epochs to train on cherry data?

I saw from the paper for pre-exprience model, your trained 1 epoch. But after data selection, it's fuzzy in the paper how many epochs you trained on cherry data, it that 3 ? Looking forward for your quick response.

a confusion about data_by_IFD

I have read your paper in detail. It's a great work.
I still have a question: why not use cross_entropy or ppl directly when calculating IFD scores, but use log_softmax+nll_loss . I understand that log_softmax+nll_loss is the same as cross_entropy, so use cross_entropy or ppl should be possible, I don’t know if my understanding is correct. Thanks for taking the time out of your busy schedule to reply.

GPT-4/ChatGPT Evaluation Code

Hi authors. I am just wondering if it is possible to share the eval_after_wrap.py code. Thanks!

May I ask if this project is suitable for other large models, such as the Baichuan model, to filter high-quality datasets from other fields

Questions related to training

Thank you for sharing

I'm trying to train models using my Chinese SFT data. I have some questions as follows:

My first step is to run "pre_experience_analysis.sh", but it seems to run all my json data. Is that reasonable? It takes a long time. The "start_idx" and "end_idx" of “data_analysis.py” are not set in your code.
Do I need to modify the code for my own Chinese SFT data? Or just use it normally.

tianyi-lab / cherry_llm Goto Github PK

cherry_llm's Issues

Recommend Projects

Recommend Topics

Recommend Org