tianyi-lab / cherry_llm Goto Github PK

[NAACL'24] Self-data filtering of LLM instruction-tuning data using a novel perplexity-based difficulty score, without using any other models

Python 96.71% Shell 3.29%

cherry_llm's Introduction

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning (NAACL'24)

From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning
(NAACL'24)

This is the repo for the Cherry Data Selection project, which introduces a self-guided methodology for LLMs to autonomously discern and select cherry samples from vast open-source datasets, effectively minimizing manual curation and potential cost for instruction tuning an LLM.

The repo contains:

The cherry data used for fine-tuning the model, cherry_data_v1 represents the cherry data obtained based on the llama-1 model.
The model checkpoints that were trained using our cherry data.
The code for selecting cherry data from the existing instruction-tuning dataset.

(Feel free to email [email protected] for any questions or feedback.)

News

[2024/03] Our paper has been accepted to the NAACL 2024 main conference!
[2024/02] We released the Superfiltering, which reveals the strong consistency between small and large LLMs in perceiving and evaluating the difficulty of instruction tuning data and utilizes a small LM, e.g., GPT-2 (124M), to effectively and efficiently select data for instruction tuning.
[2023/12] An updated code for calculating the statistics for IFD scores, please check Reflection-Tuning Code for Selection.
[2023/12] The statistics necessary for calculating IFD scores on Alpaca and WizardLM on llama2-7b and llama2-13b were released, please check: Alpaca llama2 7b, Alpaca llama2 13b, WizardLM70k llama2 7b, WizardLM70k llama2 13b.
[2023/11] We added some results on llama2-7b and llama2-13b, further showing the generalizability of our method.
[2023/09] We partially reconstructed the repo structure and added some results on llama2.
[2023/09] We released codes for evaluating the performance between two LLMs by using GPT4 or chatGPT.
[2023/09] We released codes for this project.

Overview
Highlights
Install
Run Code
Data and Model Weights V1
Data and Model Weights V2
Evaluation
Performance Comparison
Prompt
Hyperparameters
ToDo
Citation
Our Related Works

Overview

Our study puts forth a method for autonomously sifting through expansive open-source datasets to discover the most impactful training samples. We coin these samples as "cherry data", designating those data fragments that hold the potential to exponentially enhance LLM instruction tuning. At the heart of our research is the hypothesis that during their preliminary training stages with carefully chosen instruction data, LLMs can develop an intrinsic capability to discern instructions. This foundational understanding equips them with the discernment to assess the quality of broader datasets thus making it possible to estimate the instruction-following difficulty in a self-guided manner.

Initially, the model is familiarized with a fraction of the target dataset during the "Learning from Brief Experience" phase. This preliminary knowledge paves the way for the subsequent "Evaluating Based on Experience" phase, where we meticulously evaluate the model's response generation. To estimate the difficulty of a given example, we propose a novel metric called Instruction-Following Difficulty (IFD) score in which both models' capability to generate a response to a given instruction and the models' capability to generate a response directly are measured and compared. By calculating Instruction-Following Difficulty (IFD) scores, we quantify the challenge each sample presents to the model. Harnessing these insights, the "Retraining from Self-Guided Experience" phase utilizes cherry data with standout IFD scores to hone the model, culminating in our superior cherry models. The net result is a model that aligns more adeptly with instructions, ensuring enhanced performance.

Highlights

The selection of cherry data in this project is entirely self-guided and does not need ANY extra outside models, ranging from BERT to chatGPT.
We use approximately 5% or 10% of the data to have comparable performances to the models trained on full data, which is experimented on the Alpaca and WizardLM datasets.
The IFD score proposed by us can divide the samples into better or relatively bad ones, which might provide insight into the types of data good for instruction tuning.
(Selective Reflection-Tuning) The IFD scores and the reversed version can be utilized to construct better data! In Reflection-Tuning Code for Selection, we proposed the Teacher-Student Collaboration pipeline to construct a training set customized for the student.
(Superfiltering) The IFD scores calculated by LLMs with different sizes share strong consistencies! Thus you can utilize a really small language model like GPT2 to select the data for instruction tuning, which would be super fast and efficient! Please see Superfiltering for details.

Install

Install the dependencies with pip install -r requirements.txt

Note: This requirements.txt is originated from the Stanford Alpaca. If you are using a different code base with PyTorch installed, we recommend you manually install the below packages and do not need to install from requirements.txt

pip install tqdm

pip install scikit-learn

Run Code

Select Pre-Experienced Data

python cherry_seletion/data_analysis.py \
    --data_path data/alpaca_data.json \
    --save_path alpaca_data_pre.pt \
    --model_name_or_path <your_path_to_hf_converted_llama_ckpt_and_tokenizer> \
    --max_length 512 \
    --prompt alpaca \
    --mod pre

--data_path: The targeted dataset in the Alpaca format
--save_path: The path to save the .pt file containing embeddings or scores
--prompt: The prompt type used for training and selecting data, can choose between alpaca or wiz
--mod: pre used for getting needed embeddings or scores on selecting pre-experienced samples and cherry used for cherry

python cherry_seletion/data_by_cluster.py \
    --pt_data_path alpaca_data_pre.pt \
    --json_data_path data/alpaca_data.json \
    --json_save_path alpaca_data_pre.json \
    --sample_num 10 \
    --kmeans_num_clusters 100 \
    --low_th 25 \
    --up_th 75

--pt_data_path: The .pt file from previous step containing needed embeddings or scores --json_data_path: The targeted dataset in the Alpaca format
--json_save_path: The path to save the selected pre-experienced samples
--sample_num: How many samples will be selected in each cluster
--kmeans_num_clusters: How many clusters will be generated by K-Means
--low_th and --up_th: The lower and Upper threshold for selecting samples within each cluster

Train Pre-Experienced Model
Select Cherry Data

python cherry_seletion/data_analysis.py \
    --data_path data/alpaca_data.json \
    --save_path alpaca_data_cherry.pt \
    --model_name_or_path <your_path_pre_experienced_model> \
    --max_length 512 \
    --prompt alpaca \
    --mod cherry

python cherry_seletion/data_by_IFD.py \
    --pt_data_path alpaca_data_cherry.pt \
    --json_data_path data/alpaca_data.json \
    --json_save_path alpaca_data_cherry.json \
    --max_length 512 \
    --sample_rate 0.06 \
    --prompt alpaca

--sample_rate: How many cherry samples you would like to select? You can also use --sample_number to set the exact number of samples.

Train Cherry Model

Data and Model Weights V1

llama 1 models

The following table provides a comparison between our cherry models and baseline models on the Huggingface Open LLM Leaderboard and AlpacaEval Leaderboard.
These results are based on cherry_data_v1. The prompt and training hyperparameters can be found in the Hyperparameters section. These results verify the effectiveness of our method, which can be used to select the most valuable data samples for instruction tuning.

	Avg	ARC	HellaSwag	MMLU	TruthfulQA	AlpacaEval	Data	Model
Alpaca	50.21	42.65	76.91	41.73	39.55	26.46	/	/
5% Alpaca	52.06	53.92	79.49	36.51	38.33	34.74	[Link]	[hf-Link]
10% Alpaca	/	/	/	/	/	/	[Link]	[hf-Link]
15% Alpaca	/	/	/	/	/	/	[Link]	[hf-Link]

WizardLM	54.18	51.60	77.70	42.70	44.70	67.64	/	/
WizardLM*	52.79	53.07	77.44	37.75	42.90	61.99	[hf-Link]	[hf-Link]
10% WizardLM	51.59	52.90	78.95	33.08	41.41	61.44	[Link]	[hf-Link]
20% WizardLM	/	/	/	/	/	/	[Link]	[hf-Link]
20% WizardLM	/	/	/	/	/	/	[Link]	[hf-Link]
40% WizardLM	52.83	53.07	77.79	35.29	45.17	65.09	[Link]	[hf-Link]

Also, the WizardLM filter script is provided here: [Link]

llama 2 models

Thanks to the FastChat and flash-attention, we are able to run our experiments with longer length.
The above results are directly using cherry_data_v1 for finetuning the llama-2-7B model, with the length of 2048, and using original vicuna prompts.

	Avg	ARC	HellaSwag	MMLU	TruthfulQA	AlpacaEval	Data	Model
WizardLM	57.09	54.18	79.25	46.92	48.01	66.08	/	[Link]
10% WizardLM	57.57	54.86	80.46	45.74	49.20	71.36	[Link]	[Link]
20% WizardLM	/	/	/	/	/	/	[Link]	[Link]
20% WizardLM	58.50	55.97	80.40	46.87	50.76	72.57	[Link]	[Link]
40% WizardLM	58.00	56.23	80.22	46.15	49.37	70.52	[Link]	[Link]

Note: WizardLM in the above table is our implementation using FastChat code, prompt, and configuration.
Note: Due to the hardware limit, all our models are using the 7B model.
Note: For these llama2 models, we still use the cherry_data_v1 to ensure the effectiveness of our data. We will soon make the cherry_data_v2 which is based on llama2 available.

Data and Model Weights V2

In this section, all the IFD scores are calculated on llama2-7b or llama2-13b models by using Vicuna's prompt. The training of pre-experienced models is discarded for more efficient usage. The performances are promising in the llama2 model even without a pre-experienced model, indicating the proficiency of our proposed IFD scores.

	Avg	ARC	HellaSwag	MMLU	TruthfulQA	AlpacaEval	Data	Model
Alpaca-7b (llama2)	55.25	54.35	78.65	47.02	40.98	27.75	/	/
5% Alpaca-7b (llama2)	55.78	57.94	80.37	44.91	40.62	36.78	/	/
10% Alpaca-7b (llama2)	56.31	58.02	80.42	46.64	40.18	/	/	/
15% Alpaca-7b (llama2)	56.37	57.42	80.68	46.40	40.95	/	/	/

Alpaca-13b (llama2)	58.78	57.59	81.98	54.05	41.49	35.00	/	/
5% Alpaca-13b (llama2)	61.21	62.37	84.00	55.65	42.82	46.82	/	/
10% Alpaca-13b (llama2)	61.02	62.97	83.88	55.29	41.93	/	/	/
15% Alpaca-13b (llama2)	61.23	62.37	83.48	55.56	43.42	/	/	/

All the above models are trained using FastChat code and prompt.
Data with IFD scores will be released soon.

Evaluation

We release the codes and data for using GPT4 or chatGPT to evaluate and compare the performance between two LLMs. This method greatly eliminates the potential position bias of GPT4 and chatGPT. For details, please see AlpaGasus or our paper. We thank @Lichang-Chen and AlpaGasus repo for sharing the evaluation codes.

To use this code, please follow the below scripts:

bash scripts/do_eval_generation.sh: The model automatically generates the responses for a given instruction in test datasets.
bash scripts/do_eval_generation_wrap.sh: Wrap the response files of LLMs being compared.
bash scripts/do_eval.sh: Use GPT4 or chatGPT for the evaluation.
bash scripts/do_review_eval_score.sh: Parse the results and draw the figure.

More detailed illustrations will be updated. Feel free to drop me an email if you are urgent about it.

Performance Comparison

Comparing our models trained on selected data with models trained on full data. (a) Comparison between our model with 5% Alpaca data and the official Alpaca model. (b) Comparison between our model with 10% WizardLM data and the reimplemented WizardLM model. (c) Comparison between our model with 40% WizardLM data and the official WizardLM model. All these experiments use GPT4 as the judge. Each horizontal bar represents a comparison in a specific test set.

Prompt

We used the following prompts for fine-tuning the cherry models with Alpaca data:

for examples with a non-empty input field:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Input:
{input}

### Response:

for examples with an empty input field:

Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:

We used the following prompts for fine-tuning the cherry models with Wizard data:

{instruction}

### Response:

Hyperparameters

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay	Warmup Rate
Cherry Models V1 (Alpaca)	128	2e-5	3	512	0	0.03
Cherry Models V1 (WizardLM)	128	2e-5	3	1024	0	0.03
---	---:	---:	---:	---:	---:	---:
Cherry Models V2 7B	128	2e-5	3	2048	0	0.03
Cherry Models V2 13B	128	1e-5	5	2048	0	0.03

ToDo

Release the code, data, and models.
Release the evaluation code for comparison.
Train Cherry WizardLM with the length of 2048.
Implement our method on llama 2 models.
Modify the paper

Citation

Please consider citing our paper if you think our codes, data, or models are useful. Thank you!

@inproceedings{li-etal-2024-quantity,
    title = "From Quantity to Quality: Boosting {LLM} Performance with Self-Guided Data Selection for Instruction Tuning",
    author = "Li, Ming  and
      Zhang, Yong  and
      Li, Zhitao  and
      Chen, Jiuhai  and
      Chen, Lichang  and
      Cheng, Ning  and
      Wang, Jianzong  and
      Zhou, Tianyi  and
      Xiao, Jing",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.421",
    pages = "7595--7628",
}

@article{Li2024SuperfilteringWD,
  title={Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning},
  author={Ming Li and Yong Zhang and Shwai He and Zhitao Li and Hongyu Zhao and Jianzong Wang and Ning Cheng and Tianyi Zhou},
  journal={ArXiv},
  year={2024},
  volume={abs/2402.00530},
  url={https://api.semanticscholar.org/CorpusID:267365346}
}

@article{Li2024SelectiveRS,
  title={Selective Reflection-Tuning: Student-Selected Data Recycling for LLM Instruction-Tuning},
  author={Ming Li and Lichang Chen and Jiuhai Chen and Shwai He and Jiuxiang Gu and Tianyi Zhou},
  journal={ArXiv},
  year={2024},
  volume={abs/2402.10110},
  url={https://api.semanticscholar.org/CorpusID:267682220}
}

@inproceedings{li2023reflectiontuning,
  title={Reflection-Tuning: Recycling Data for Better Instruction-Tuning},
  author={Ming Li and Lichang Chen and Jiuhai Chen and Shwai He and Tianyi Zhou},
  booktitle={NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following},
  year={2023},
  url={https://openreview.net/forum?id=xaqoZZqkPU}
}

Our Related Works

If you are interested in Data Selection for Instruction Tuning, please see Cherry_LLM and Superfiltering.
If you are interested in human/LLM-free Data Augmentation for Instruction Tuning, please see Mosaic-IT and RuleR.
If you are interested in Data Improvement for Instruction Tuning, please see Reflection_Tuning.
If you are interested in Knowledge Distillation in the LLM era, please see this Survey.

cherry_llm's People

Contributors

Stargazers

Watchers

Forkers

xiechengmude thzdyjy fake10086 kai-wen-yang ericxsun noboso xinhen ura-hcmut mrm8488 songkq jie311 superxiang dawn-2-winter leon-gittech liunix61 helotte knowledgehacker lihongxiacream

cherry_llm's Issues

Chinese SFT data cannot be displayed.

I'm using Chinese SFTdata for code execution. After the "pre_experience_selection.sh" file is executed, the "alpaca_data_pre.json" file is obtained, but all Chinese characters in the file are changed to \uxxxx. Therefore, the “Train Pre-Experienced Model” file cannot be executed.

Can you check whether “data_by_cluster” and “data_analysis” do not support Chinese?

Thank you.

a confusion about Instruction-Following Difficulty (IFD) scores

according to the paper and data_by_IFD.py line 103-105, my understanding mean_1 = loss_list_1.mean() is Direct Answer Score, mean_2 = loss_list_2.mean() is Conditioned Answer Score, and mean_rate = mean_2/mean_1 should be Instruction-Following Difficulty (IFD) scores, right?
So mean_rate(IFD) should be large means that the sample is useful for model.
But from the code, i see if mean_rate > 1: continue from line 106 in data_by_IFD.py, is it right or i misunderstood.

关于Direct Answer Score sθ(A)

想请教下面这个问题，非常感谢您的回答：）

为什么das越高，对模型越有挑战呢？das越高不是表明模型预测的概率越大，掌握得越好吗？（from paper: A higher direct answer score may suggest that the answer is inherently more challenging or intricate for the model to generate.）

我理解的das的自回归计算过程：

对于数据：{"instruction": "what do you like to eat?", "answer": "I like eating apples."}
das要衡量模型对answer本身的生成难度，das的自回归计算方式为：
i
i like 
I like eating
I like eating apples

Evaluation reproducibility on benchmarks

Nice work!

I'm trying to sft Llama 2 on cherry_data_v1 under the same setting as this repo. Also, following the settings in your original paper, I use the lm-evaluation-harness framework to evaluate the fine-tuned models on mmlu,ai2_arc,hellaswag,truthfulqa these four benchmarks. But there is a slight performance gap between mine and what you reported in this repo.

What might lead to this? or can you share the scripts used in lm-evaluation-harness for evaluation?

BTW, the data splits in cherry_data_v1 are not exactly 5%, 10%, or 15% of the amount of original Stanford-Alpaca.

May I ask if this project is suitable for other large models, such as the Baichuan model, to filter high-quality datasets from other fields

Logic behind IFD score

Thanks for your work! I am currently trying to understand the logic behind IFD score and maybe I am misunderstanding something. For equation 4 of the direct answer score, the paper mentions "A higher direct answer score may suggest that the answer is inherently more challenging or intricate for the model to generate". However, isn't it when equation 4 (the probability of the sentence) gives a higher value indicates that the sentence makes more sense so it would be more natural for LM to generate? kindly looking forward to your reply. Thanks!

Multi-round conversation data set

Hello, I observed that the alpaca_data.json dataset we used is in the form of a single round of dialogue. May I ask if you have considered IFD screening for data sets with multiple rounds of dialogue?

How to filter code SFT data？

Very impressive work. I would like to ask a question. The paper says that IFD is ineffective for code sft data. Is there any improvement method specifically for code sft data screening? Thanks

how many epochs to train on cherry data?

I saw from the paper for pre-exprience model, your trained 1 epoch. But after data selection, it's fuzzy in the paper how many epochs you trained on cherry data, it that 3 ? Looking forward for your quick response.

about the paper

I think there's something wrong with this formula. Shouldn't there be a minus sign?

a confusion about data_by_IFD

I have read your paper in detail. It's a great work.
I still have a question: why not use cross_entropy or ppl directly when calculating IFD scores, but use log_softmax+nll_loss . I understand that log_softmax+nll_loss is the same as cross_entropy, so use cross_entropy or ppl should be possible, I don’t know if my understanding is correct. Thanks for taking the time out of your busy schedule to reply.

Question about the effect of labels[0, :start_token] = -100

Hi, I don't understand the function of code 'labels[0, :start_token] = -100' in Line 68 from cherry_seletion/data_analysis.py file.

Could the Pre-Experienced Model be used in other different dataset?

Hi authors,this project is great!I have some confusion and need your help.
The Pre-Experienced Model(stage 3) I fine-tuned with a certain data could be used to filter other datasets?For example, I used the selected pre-experienced samples(stage 2) from alpaca_data to fined tune my pretrain model and obtained a Pre-Experienced Model,and then use this model to select cherry data from alpaca_data.But could I use this Pre-Experienced Model to filter cherry data from other datasets (such as firefly)? In other words,If I have to use the selected pre-experienced samples from other datasets(such as firefly), and then fine-tune my pretrain model to obtained a new Pre-Experienced Model?
my english is poor,I don’t know if my description is clear or not..Thanks a lot!

Questions related to training

Thank you for sharing

I'm trying to train models using my Chinese SFT data. I have some questions as follows:

My first step is to run "pre_experience_analysis.sh", but it seems to run all my json data. Is that reasonable? It takes a long time. The "start_idx" and "end_idx" of “data_analysis.py” are not set in your code.
Do I need to modify the code for my own Chinese SFT data? Or just use it normally.

why is the process so slow

it costs 100 hours to select from 50000 multi_turn samples

I plan to apply this method on Llama2, which part of this project needs to be changed to adapt to Llama2?

Very interesting work. I plan to apply this method on Llama2, which part of this project needs to be changed to adapt to Llama2? I noticed that one of your future work is 'Implement our method on llama 2 models.' Do you have any progress? Thanks a lot!

batch?

How batch?

Any report of time consuming?

I've already changed the tensors into gpu. And i'm pretty sure that I'm using GPU.
However, running cherry_seletion/data_analysis.py takes about nearly a month on about a 50k dataset.
Is it normal?

'The training of pre-experienced models is discarded for more efficient usage': that means we can only use base model to do cherry analysis and selection?

I noticed your update in README. Did that mean we can only use base model to do cherry analysis and selection? Directly use base model or use pre-experienced model, which performance is better?

Need help: the loss curve is strange.

Thanks for your excellent work! Question in title(when I train the final cheery model).

GPT-4/ChatGPT Evaluation Code

Hi authors. I am just wondering if it is possible to share the eval_after_wrap.py code. Thanks!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.