princeton-nlp / less Goto Github PK

View Code? Open in Web Editor NEW

284.0 4.0 21.0 404 KB

[ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning

License: MIT License

Jupyter Notebook 56.81% Python 39.93% Shell 3.27%

data data-selection influence instruction-tuning llama llm mistral

less's Introduction

LESS: Selecting Influential Data for Targeted Instruction Tuning

🌟 ArXiv Preprint

This repo hosts the code for the paper "LESS: Selecting Influential Data for Targeted Instruction Tuning". In this work, we propose a data selection method to select influential data to induce a target capability.

🔗 Quick Links

LESS: Selecting Influential Data for Targeted Instruction Tuning

Install Requirements

Step 1: To get started with this repository, you'll need to follow these installation steps. Before proceeding, make sure you have Pytorch installed.

pip3 install torch==2.1.2 torchvision torchaudio

Step 2: Then install the rest of the required packages:

cd LESS
pip install -r requirement.txt

Step 3: Finally, install the less package in editable mode to make it accessible for your development environment:

pip install -e .

Data Preparation

We follow the open-instruct repo to prepare four instruction tuning datasets. In our project, we utilize a combination of four training datasets: Flan v2, COT, Dolly, and Open Assistant. For the purposes of evaluation, we employ three additional datasets: MMLU, Tydiqa, and BBH. A processed version of these files are available here.

Data Selection Pipeline

Step 1: Warmup training

To enhance downstream performance from data selection, it's crucial to start with a warmup training step. This involves selecting a small portion of your entire dataset to train using the LoRA method. Follow these steps for effective warmup training:

DATA_DIR=../data
MODEL_PATH=meta-llama/Llama-2-7b-hf
PERCENTAGE=0.05 # percentage of the full data to train, you can specify the training file you want to use in the script
DATA_SEED=3
JOB_NAME=llama2-7b-p${PERCENTAGE}-lora-seed${DATA_SEED}

./less/scripts/train/warmup_lora_train.sh "$DATA_DIR" "$MODEL_PATH" "$PERCENTAGE" "$DATA_SEED" "$JOB_NAME"

Step 2: Building the gradient datastore

Once the initial warmup training stage is completed, we will collect gradients for the entire training dataset. For each checkpoint, our goal is to obtain the gradients of all the training data that we would like to select from. An example script is shown below.

CKPT=105

TRAINING_DATA_NAME=dolly
TRAINING_DATA_FILE=../data/train/processed/dolly/dolly_data.jsonl # when changing data name, change the data path accordingly
GRADIENT_TYPE="adam"
MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
OUTPUT_PATH=../grads/llama2-7b-p0.05-lora-seed3/${TRAINING_DATA_NAME}-ckpt${CKPT}-${GRADIENT_TYPE}
DIMS="8192"

./less/scripts/get_info/get_train_lora_grads.sh "$TRAINING_DATA_FILE" "$MODEL_PATH" "$OUTPUT_PATH" "$DIMS" "$GRADIENT_TYPE"

Ideally, you would aim to create a datastore that encompasses a gradient of all the checkpoints and training data from which you wish to choose.

Step 3: Selecting data for a task

To select data for a particular downstream task, it's necessary to first prepare data specific to that task, using the same instruction-tuning prompt format as was employed during training. We have set up data loading modules for three evaluation datasets featured in our work: BBH, TydiQA, and MMLU. If you're interested in data selection for additional tasks, you can expand the less/data_selection/get_validation_dataset.py script to accommodate those tasks. Similar to obtaining gradients for training data, run the following script. The primary difference is that this process will yield SGD gradients for the validation data, following the formulation of the influence estimation.

CKPT=105
TASK=tydiqa
MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
OUTPUT_PATH=../grads/llama2-7b-p0.05-lora-seed3/${TASK}-ckpt${CKPT}-sgd # for validation data, we always use sgd
DATA_DIR=../data
DIMS="4096 8192" # We use 8192 as our default projection dimension 

./less/scripts/get_info/get_eval_lora_grads.sh "$TASK" "$DATA_DIR" "$MODEL_PATH" $OUTPUT_PATH "$DIMS"

You should gain the gradients of the validation data for all the checkpoints you used for building the gradient datastore in the previous step. After obtaining the gradients for the validation data, we can then select data for the task. The following script will calculate the influence score for each training data point, and select the top-k data points with the highest influence score.

DIM=8192 # decide which dimension to use
GRADIENT_PATH=../grads/llama2-7b-p0.05-lora-seed3/{}-ckpt{}-adam/dim${DIM}
TRAIN_FILE_NAMES="flan_v2 cot dolly oasst1"
CKPTS="105 211 317 420" # checkpoing index
CHECKPOINT_WEIGHTS="1.6877e-05 1.2859e-05 7.7030e-06 2.5616e-06" # average lr of the epoch

VALIDATION_GRADIENT_PATH=../grads/llama2-7b-p0.05-lora-seed3/{}-ckpt{}-sgd/dim${DIM}
TARGET_TASK_NAMES="tydiqa"
SELECTED_DATA_OUTPUT_PATH="../selected_data"

./less/scripts/data_selection/matching.sh "$GRADIENT_PATH" "$TRAIN_FILE_NAMES" "$CKPTS" "$CHECKPOINT_WEIGHTS" "$VALIDATION_GRADIENT_PATH" "$TARGET_TASK_NAMES" "$SELECTED_DATA_OUTPUT_PATH"

The influence score for each training data point will be saved in the OUTPUT_PATH directory. You can use the following script to select the top-k data points with the highest influence score.

python3 -m less.data_selection.write_selected_data \
--target_task_names ${TARGET_TASK_NAMES} \
--train_file_names ${TRAIN_FILE_NAMES} \
--train_files ../data/train/processed/dolly/dolly_data.jsonl ../data/train/processed/oasst1/oasst1_data.jsonl \
--output_path $SELECTED_DATA_OUTPUT_PATH \
--percentage 0.05

Step 4: Train with your selected data

After selecting the data, you can use the following script to train the model with the selected data.

TARGET_TASK_NAME="tydiqa"
PERCENTAGE=0.05
TRAIN_FILES=../selected_data/${TARGET_TASK_NAME}/top_p${PERCENTAGE}.jsonl
MODEL_PATH=meta-llama/Llama-2-7b-hf
JOB_NAME=llama2-7b-less-p${PERCENTAGE}-lora

./less/scripts/train/lora_train.sh "$TRAIN_FILES" "$MODEL_PATH" "$JOB_NAME"

Note that you can also perform full-parameter finetuning by removing the lora training parameters.

Evaluation

Please follow the instructions in the evaluation folder to evaluate the performance of the model trained on the selected data.

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Mengzhou ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you find the repo helpful in your work:

@article{xia2024less,
  title={Less: Selecting Influential Data for Instruction Tuning},
  author={Xia, Mengzhou and Malladi, Sadhika and Gururangan, Suchin and Arora, Sanjeev and Chen, Danqi},
  year={2024}
}

less's People

Contributors

Stargazers

Watchers

Forkers

saurabh3949 jxzhangjhu philipmay evelynmitchell elenaviewsynthesis ftgreat jpegah leopoldwhite ilyocoris rabirajb kgourgou qrdai mihara-bot njuhugn smashfan ranonrkm

less's Issues

No optimizer.bin in Step 2

Hi，
When I run "Step 2: Building the gradient datastore"
FileNotFoundError: [Errno 2] No such file or directory: '../out/llama2-7b-p0.05-lora-seed3/checkpoint-1688/optimizer.bin'
I check the folder "llama2-7b-p0.05-lora-seed3" generate from Step 1, only files optimizer.pt in checkpoint-1688.

I noticed that in other issues, some scholars had problem on generating optimizer.pt. I think my problem is different from these.
May I kindly ask for you help?

Question about running stage 3 script

Following the script provided in the second step of "Selecting data for a task" in your readme, I have a command line that needs to be run as shown below:

./less/scripts/data_selection/matching.sh ../grads/llama2-7b-p0.05-lora-seed3/{}-ckpt{}-adam/dim8192 "flan_v2 cot dolly oasst1" "422 845 1268 1688" "1.681734880879658e-05 1.4844950172237425e-05 1.2464498996972609e-05 9.999999999999999e-06" ../grads/llama2-7b-p0.05-lora-seed3/{}-ckpt{}-sgd/dim8192 "tydiqa" "../selected_data"

However, after running it, I encountered the following error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/export/project/jzhanggr/qyx/LESS/less/data_selection/matching.py", line 58, in <module>
    validation_info = torch.load(validation_path)
  File "/export/project/jzhanggr/miniconda3/lib/python3.11/site-packages/torch/serialization.py", line 986, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/export/project/jzhanggr/miniconda3/lib/python3.11/site-packages/torch/serialization.py", line 435, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/export/project/jzhanggr/miniconda3/lib/python3.11/site-packages/torch/serialization.py", line 416, in __init__
    super().__init__(open(name, mode))
IsADirectoryError: [Errno 21] Is a directory: '../grads/llama2-7b-p0.05-lora-seed3/tydiqa-ckpt422-sgd/dim8192'

This location appears to be a file, but upon inspection, there are three files in that folder named all_orig.pt, all_unormalized.pt, and grads-9.pt. I wonder if I should be loading grads-9.pt at this point. Your assistance would be greatly appreciated.

FileNotFoundError: optimizer.bin and KeyError: 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'

I run step2 :

CKPT=422

TRAINING_DATA_NAME=dolly
TRAINING_DATA_FILE=./data/train/processed/dolly/dolly_data.jsonl # when changing data name, change the data path accordingly
GRADIENT_TYPE="adam"
MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
OUTPUT_PATH=../grads/llama2-7b-p0.05-lora-seed3/${TRAINING_DATA_NAME}-ckpt${CKPT}-${GRADIENT_TYPE}
DIMS="8192"

./less/scripts/get_info/grad/get_train_lora_grads.sh "$TRAINING_DATA_FILE" "$MODEL_PATH" "$OUTPUT_PATH" "$DIMS" "$GRADIENT_TYPE"

throw this error,

FileNotFoundError: [Errno 2] No such file or directory: '../out/llama2-7b-p0.05-lora-seed3/checkpoint-422/optimizer.bin'

so i copy optimizer.pt to optimizer.bin
Rerun step2, throw this error,:

trainable params: 134,217,728 || all params: 6,872,641,536 || trainable%: 1.9529278123549145
Generating train split: 15011 examples [00:00, 33217.42 examples/s]
Tokenizing and reformatting instruction data (num_proc=10): 100%|██████████| 15011/15011 [00:02<00:00, 7119.25 examples/s]
Traceback (most recent call last):
  File "/maindata/data/shared/Security-SFT/common_tools/mambaforge/envs/less/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/maindata/data/shared/group/common_tools/mambaforge/envs/less/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/maindata/data/shared/group/xxxx/git/LESS/less/data_selection/get_info.py", line 156, in <module>
    collect_grads(dataloader,
  File "/maindata/data/shared/group/xxxx/git/LESS/less/data_selection/collect_grad_reps.py", line 195, in collect_grads
    m, v = prepare_optimizer_state(model, adam_optimizer_state, device)
  File "/maindata/data/shared/group/xxxx/git/LESS/less/data_selection/collect_grad_reps.py", line 132, in prepare_optimizer_state
    avg = torch.cat([optimizer_state[n]["exp_avg"].view(-1) for n in names])
  File "/maindata/data/shared/group/xxxx/git/LESS/less/data_selection/collect_grad_reps.py", line 132, in <listcomp>
    avg = torch.cat([optimizer_state[n]["exp_avg"].view(-1) for n in names])
KeyError: 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'
There are 15011 examples in the dataset

OOM Errors in Step 3.1

Hello, I encounter an Out of Memory error at step 3.1. My configuration includes a single A100 GPU with 80GB of memory. Reducing the max_length to 128 allows me to avoid the OOM error. I would like to know if this is a reasonable approach and if there are any other methods to resolve this issue.

Here is the command I am using:
CUDA_VISIBLE_DEVICES=0 python3 -m less.data_selection.get_info
--task $task
--info_type grads
--model_path $model
--output_path $output_path
--gradient_projection_dimension $dims
--gradient_type sgd
--data_dir $data_dir
--max_length 128

Using Multiple GPUs

How can we utilize multiple GPUs for the gradient feature collection step? The current implementation only works with a single GPU.

Are samples used for warmup training and gradient calculation the same?

Hi,

I'm trying to run experiments following the instructions given in README.

I find that in Step 1 warmup training, 5% of samples are randomly selected to train $M_S$. But in Step 2 Building the gradient datastore, the selected samples used to calculate gradients seem to be fixed as the first 200 samples of each dataset.

This makes me confused about whether the samples used for warmup training and gradient calculation should be the same, can you kindly explain it to me?

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cuda:0!

Hi!

I try to do step 2 on device cuda"6" since cuda "0" is in use,so I move the batches and model to cuda"6". I print device of batch and model in obtain_gradients_with_adam function to confirm.

But err occurs as below:

Traceback (most recent call last):
File "/home/u2019000171/.conda/envs/less/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/u2019000171/.conda/envs/less/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/u2019000171/cjy/LESS/less/data_selection/get_info.py", line 156, in
collect_grads(dataloader,
File "/home/u2019000171/cjy/LESS/less/data_selection/collect_grad_reps.py", line 263, in collect_grads
vectorized_grads = obtain_gradients_with_adam(model, batch, m, v)
File "/home/u2019000171/cjy/LESS/less/data_selection/collect_grad_reps.py", line 121, in obtain_gradients_with_adam
loss = model(**batch,).loss
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/peft/peft_model.py", line 1081, in forward
return self.base_model(
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 103, in forward
return self.model.forward(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1183, in forward
outputs = self.model(
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1026, in forward
inputs_embeds = self.embed_tokens(input_ids)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/accelerate/hooks.py", line 166, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/home/u2019000171/.conda/envs/less/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:6 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Using Adam gradients
cuda:6
cuda:6

I didn't figure out that...

Step 2: Building the gradient datastore: FileNotFoundError: [Errno 2] No such file or directory: '../out/llama2-7b-p0.05-lora-seed3/checkpoint-1688/optimizer.bin'

Hi，
There is an error , when I run "Step 2: Building the gradient datastore"
FileNotFoundError: [Errno 2] No such file or directory: '../out/llama2-7b-p0.05-lora-seed3/checkpoint-1688/optimizer.bin'
I check the folder "llama2-7b-p0.05-lora-seed3" generate from Step 1, only files optimizer.pt in checkpoint-1688.

thanks

a small mistake in the code

thanks for sharing your code,
https://github.com/princeton-nlp/LESS/blob/main/less/data_selection/write_selected_data.py#L76
In this code version, a small mistake made sorted.csv goes wrong, to make it correct, line 76 and line 77 should exchange their position

Error with fast_jl

CUDA error: the provided PTX was compiled with an unsupported toolchain.

What might be the cause for this mistake?

Questions about the influence value

I'm wondering what's the appropriate influence value in LESS setting. I'm reproducing it and the max influence value across all sub-tasks are about 0.1-0.4 for some samples. Is this value correct? Or the similarity should be more or less than this value(0.1-0.4 for the most similar sub-task).

What if we have the optim states of specific tasks at the beginning?

I wonder if we can use the optim states of the original large model instead of warming up by lora if we have optim states at the begining?

Error while deserializing header: InvalidHeaderDeserialization

Hi,

Thanks for sharing the code for this great job. There is a problem at step 2, which cannot load the model and the error shows that safetensors_rust.SafetensorError: Error while deserializing header: InvalidHeaderDeserialization.

May I seek any assistance?

some question about calulate socre？

N_SUBTASKS = {"mmlu": 57, "bbh": 27, "tydiqa": 9} 
influence_score = influence_score.reshape(
            influence_score.shape[0], N_SUBTASKS[target_task_name], `-1).mean(-1).max(-1)[0]

what is meaning N_SUBTASKS , why do this? Can I change it to " influence_score =influence_score.mean(-1)[0]" ?

Question about accuracy

Hello, I have some questions about the accuracy of llama2-7b.

In the Table 5, the accuracy of llama2-7b-base on MMLU/TYDIQA/BBH are 46.7/52.1/39.8, but we use llama2-7b from "https://huggingface.co/meta-llama/Llama-2-7b-hf/tree/main" to test as 46.0/42.5/40.4, why is it so different from the table?

Also, we trained using the provided selected data and the results on the MMLU/TYDIQA/BBH are 50.0/54.8/41.1, the results of our reproduction are 49.3/54.0/42.3, lower than 50.2/56.2/41.5 in the table.

Can you kindly explain it to me? Thanks!

step 2 when run "/get_train_lora_grads.sh", load the optimizer.pt error is happend

when load the optimizer.pt display the key is different
KeyError: 'base_model.model.model.layers.0.self_attn.q_proj.lora_A.default.weight'

the items in optimizer.pt state is 0~255.

data selection results

Thank you very much for your work!
i hope it's not too forward to ask if you could share the datasets obtained during the third step of data selection. I believe it could help reduce costs.

Does it support baichuan or chatglm?

And I found I can't pip install -e .. I think it should have a setup.py file.

A concerned question about adding a new baseline.

a good job!
I have a very concerned question that I hope can be answered. Reading the article, the standard for measuring model capabilities should be the test set of the dataset? What if we directly select the sample in the train dataset that is most similar to the test set as the baseline result?
For example, if they are all text, directly embed them into vectors and then calculate the sample that is most similar to the test set.