Git Product home page Git Product logo

lm-bff's Introduction

LM-BFF (Better Few-shot Fine-tuning of Language Models)

This is the implementation of the paper Making Pre-trained Language Models Better Few-shot Learners. LM-BFF is short for better few-shot fine-tuning of language models.

Quick links

Overview

In this work we present LM-BFF, a suite of simple and complementary techniques for fine-tuning pre-trained language models on a small number of training examples. Our approach includes:

  1. Prompt-based fine-tuning together with a novel pipeline for automating prompt generation.
  2. A refined strategy for incorporating demonstrations into context.

You can find more details of this work in our paper.

Requirements

To run our code, please install all the dependency packages by using the following command:

pip install -r requirements.txt

NOTE: Different versions of packages (like pytorch, transformers, etc.) may lead to different results from the paper. However, the trend should still hold no matter what versions of packages you use.

Prepare the data

We pack the original datasets (SST-2, SST-5, MR, CR, MPQA, Subj, TREC, CoLA, MNLI, SNLI, QNLI, RTE, MRPC, QQP, STS-B) here. Please download it and extract the files to ./data/original, or run the following commands:

cd data
bash download_dataset.sh

Then use the following command (in the root directory) to generate the few-shot data we need:

python tools/generate_k_shot_data.py

See tools/generate_k_shot_data.py for more options. For results in the paper, we use the default options: we take K=16 and take 5 different seeds of 13, 21, 42, 87, 100. The few-shot data will be generated to data/k-shot. In the directory of each dataset, there will be folders named as $K-$SEED indicating different dataset samples. You can use the following command to check whether the generated data are exactly the same as ours:

cd data/k-shot
md5sum -c checksum

NOTE: During training, the model will generate/load cache files in the data folder. If your data have changed, make sure to clean all the cache files (starting with "cache").

Run LM-BFF

Quick start

Our code is built on transformers and we use its 3.4.0 version. Other versions of transformers might cause unexpected errors.

Before running any experiments, create the result folder by mkdir result to save checkpoints. Then you can run our code with the following example:

python run.py \
    --task_name SST-2 \
    --data_dir data/k-shot/SST-2/16-42 \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --do_predict \
    --evaluate_during_training \
    --model_name_or_path roberta-large \
    --few_shot_type prompt-demo \
    --num_k 16 \
    --max_steps 1000 \
    --eval_steps 100 \
    --per_device_train_batch_size 2 \
    --learning_rate 1e-5 \
    --num_train_epochs 0 \
    --output_dir result/tmp \
    --seed 42 \
    --template "*cls**sent_0*_It_was*mask*.*sep+*" \
    --mapping "{'0':'terrible','1':'great'}" \
    --num_sample 16 \

Most arguments are inherited from transformers and are easy to understand. We further explain some of the LM-BFF's arguments:

  • few_shot_type: There are three modes
    • finetune: Standard fine-tuning
    • prompt: Prompt-based fine-tuning.
    • prompt-demo: Prompt-based fine-tuning with demonstrations.
  • num_k: Number of training instances for each class. We take num_k=16 in our paper. This argument is mainly used for indexing logs afterwards (because the training example numbers are actually decided by the data split you use).
  • template: Template for prompt-based fine-tuning. We will introduce the template format later.
  • mapping: Label word mapping for prompt-based fine-tuning. It is a string of dictionary indicating the mapping from label names to label words. NOTE: For RoBERTa, the model will automatically add space before the word. See the paper appendix for details.
  • num_sample: When using demonstrations during inference, the number of samples for each input query. Say num_sample=16, then we sample 16 different sets of demonstrations for one input, do the forward seperately, and average the logits for all 16 samples as the final prediction.

Also, this codebase supports BERT-series and RoBERTa-series pre-trained models in Huggingface's transformers. You can check Huggingface's website for available models and pass models with a "bert" or "roberta" in their names to --model_name_or_path. Some examples would be bert-base-uncased, bert-large-uncased, roberta-base, roberta-large, etc.

To easily run our experiments, you can also use run_experiment.sh (this command runs prompt-based fine-tuning with demonstrations, no filtering, manual prompt):

TAG=exp TYPE=prompt-demo TASK=SST-2 BS=2 LR=1e-5 SEED=42 MODEL=roberta-large bash run_experiment.sh

We have already defined the templates and label word mappings in it, so you only need manipulate several hyper-parameters and TAG (you can use whatever tag you want and it just makes finding results easier). See run_experiment.sh for more options of these environment variables. Besides, you can add extra arguments by

TAG=exp TYPE=prompt-demo TASK=SST-2 BS=2 LR=1e-5 SEED=42 MODEL=roberta-large bash run_experiment.sh "--output_dir result/exp --max_seq_length 512"

Experiments with multiple runs

To carry out experiments with multiple data splits, as the evaluation protocol detailed in $3.3 of our paper (grid-search for each seed and aggregate the results over 5 different seeds), you can use the following scripts:

for seed in 13 21 42 87 100
do
    for bs in 2 4 8
    do
        for lr in 1e-5 2e-5 5e-5
        do
            TAG=exp \
            TYPE=prompt-demo \
            TASK=SST-2 \
            BS=$bs \
            LR=$lr \
            SEED=$seed \
            MODEL=roberta-large \
            bash run_experiment.sh
        done
    done
done

All the results will be stored in ./log. To gather all the results, run the following command:

python tools/gather_result.py --condition "{'tag': 'exp', 'task_name': 'sst-2', 'few_shot_type': 'prompt-demo'}"

Then the program will find all the trials that satisfy the condition in ./log, and print the mean/std of the final results. Note that the task names are all lower-cased and if the task has more than one metric, you need to specify the major metric (used for taking the best validation trial) in the name (e.g., mnli, mnli-mm, mrpc/acc, mrpc/f1, qqp/acc, qqp/f1, sts-b/pearson, sts-b/spearman).

Using demonstrations with filtering

To use the filtering mechanism when using demonstrations, we need to first generate Sentence-BERT embeddings. To generate embeddings for datasets in our paper, you can directly run

bash tools/get_sbert_embedding.sh roberta-large

roberta-large can also be replaced by bert-base, bert-large, roberta-base and distilbert-base (see Sentence Transformers for details). See tools/get_sbert_embedding.sh and tools/get_sbert_embedding.py if you want to add more datasets.

After generating the embeddings (embeddings are saved as numpy files in the data folders), we can run the following commands to do prompt-based fine-tuning with demonstrations with filtering:

TAG=exp TYPE=prompt-demo TASK=SST-2 BS=2 LR=1e-5 SEED=42 MODEL=roberta-large bash run_experiment.sh "--demo_filter --demo_filter_model sbert-roberta-large"

Automatically searched prompt

We provide our automatic search results in auto_template and auto_label_mapping. There are three types of files:

  • SST-2/16-42.txt: Initial search results for SST-2 dataset, K=16 and SEED=42.
  • SST-2/16-42.sort.txt: Do prompt-based fine-tuning on initial results and sort them based on dev set performance.
  • SST-2/16-42.score.txt: Same as above, but with dev set scores.

To use the best automatic template (auto-T in the paper), use the following command:

TAG=exp TYPE=prompt-demo TASK=SST-2 BS=2 LR=1e-5 SEED=42 MODEL=roberta-large bash run_experiment.sh "--template_path auto_template/SST-2/16-42.sort.txt --template_id 0"

You can also use the i-th automatic result by specifying different template_id.

Similarly, to use automatic label (auto-L in the paper), use the following command:

TAG=exp TYPE=prompt-demo TASK=SST-2 BS=2 LR=1e-5 SEED=42 MODEL=roberta-large bash run_experiment.sh "--mapping_path auto_label_mapping/SST-2/16-42.sort.txt --mapping_id 0"

NOTE: Make sure to use the corresponding automatic search results with different data split seeds.

Our final results (LM-BFF) take prompt-based fine-tuning with demonstrations, filtering and automatic template, for example:

for seed in 13 21 42 87 100
do
    for bs in 2 4 8
    do
        for lr in 1e-5 2e-5 5e-5
        do
            TAG=LM-BFF \
            TYPE=prompt-demo \
            TASK=SST-2 \
            BS=$bs \
            LR=$lr \
            SEED=$seed \
            MODEL=roberta-large \
            bash run_experiment.sh "--template_path auto_template/SST-2/16-$seed.sort.txt --template_id 0 --demo_filter --demo_filter_model sbert-roberta-large"
        done
    done
done

python tools/gather_result.py --condition "{'tag': 'LM-BFF', 'task_name': 'sst-2', 'few_shot_type': 'prompt-demo'}"

Search for automatic templates

If you want to try automatically generating templates by yourself, here are the instructions. Note that it is an extremely long process :)

To get automatic templates, we first generate template candidates by using T5:

python tools/generate_template.py \
    --output_dir my_auto_template \
    --task_name SST-2 \
    --seed 13 21 42 87 100 \
    --t5_model t5-3b \
    --beam 100

Where --t5_model specifies the pre-trained T5 checkpoint to use and --beam specifies the beam search width. Note that t5-3b model will take approximately 15GB GPU memory, and if your GPU does not support it, you can try smaller T5 models (e.g., t5-base).

Then we do prompt-based fine-tuning of all the templates

for template_id in {0..99}
do
    for seed in 13 21 42 87 100
    do
        # To save time, we fix these hyper-parameters
        bs=8
        lr=1e-5

        # Since we only use dev performance here, use --no_predict to skip testing
        TAG=exp-template \
        TYPE=prompt \
        TASK=SST-2 \
        BS=$bs \
        LR=$lr \
        SEED=$seed \
        MODEL=roberta-large \
        bash run_experiment.sh "--template_path my_auto_template/SST-2/16-$seed.txt --template_id $template_id --no_predict"
    done
done

... and sort them based on dev set performance:

python tools/sort_template.py --condition "{'tag': 'exp-template', 'task_name': 'sst-2'}" --template_dir my_auto_template

The sorted results will be saved in my_auto_template, with the same format as described in Automatically searched prompt.

Search for automatic label word mappings

Similar to the process of automatic template search, we first generate candidate label word mappings by running:

bash tools/run_generate_labels.sh

You can modify the options in tools/run_generate_labels.sh to run this for different datasets or save mappings to different directories. After running the generation, the candidate label mappings will be saved in my_auto_label_mapping/manual_template.

Then we do prompt-based fine-tuning of all the mappings by:

for mapping_id in {0..99}
do
    for seed in 13 21 42 87 100
    do
        # To save time, we fix these hyper-parameters
        bs=8
        lr=1e-5

        # Since we only use dev performance here, use --no_predict to skip testing
        TAG=exp-mapping \
        TYPE=prompt \
        TASK=SST-2 \
        BS=$bs \
        LR=$lr \
        SEED=$seed \
        MODEL=roberta-large \
        bash run_experiment.sh "--mapping_path my_auto_label_mapping/manual_template/SST-2/16-$seed.txt --mapping_id $mapping_id --no_predict"
    done
done

... and sort them based on dev set performance:

python tools/sort_mapping.py --condition "{'tag': 'exp-mapping', 'task_name': 'sst-2'}" --mapping_dir my_auto_label_mapping/manual_template

The sorted results will be saved in my_auto_label_mapping/manual_template, with the same format as described in Automatically searched prompt.

Auto T + L: We can also do a joint search of templates and label word mappings following these steps:

  1. First, do the automatic template search following Search for automatic templates.
  2. The following steps are similar to automatic label mapping except a few arguments. When running tools/run_generate_labels.sh, change LOAD_TEMPLATES to true in it and the template + mapping candidates will be written in my_auto_label_mapping/auto_template
  3. For the following fine-tuning, change --mapping_path and --mapping_id to --prompt_path and --prompt_id.
  4. In the end, for re-ranking all the prompts, change tools/sort_mapping.py to tools/sort_prompt.py to get the final lists.

Ensemble model

First we need to train models with different templates:

mkdir ensemble_predict_results
for template_id in {0..19} # Use top 20 templates
do
    array_id=0
    for seed in 13 21 42 87 100
    do
        for bs in 2 4 8
        do
            for lr in 1e-5 2e-5 5e-5
            do
                TAG=exp-ensemble \
                TYPE=prompt-demo \
                TASK=SST-2 \
                BS=$bs \
                LR=$lr \
                SEED=$seed \
                MODEL=roberta-large \
                bash run_experiment.sh "--template_path auto_template/SST-2/16-$seed.sort.txt --template_id $template_id --model_id $template_id --array_id $array_id --save_logit --save_logit_dir ensemble_predict_results"

                array_id=$(expr $array_id + 1)
            done
        done
    done
done

Looks a little complicated? It's actually pretty easy to understand: --model_id and --array_id is used to distinguish different runs, and --save_logit tells the program to save the prediction results for ensemble.

After finishing the experiments, use the following command to get the ensemble results:

python tools/ensemble.py --condition "{'tag': 'exp-ensemble', 'task_name': 'sst-2', 'few_shot_type': 'prompt-demo'}" --n_models 20

where --n_models specify how many models you want to use for ensemble (should be kept the same as the number of templates you use in experiments).

Zero-shot experiments

It's easy to run zero-shot experiments: just add the --no_train argument:

TAG=zero-shot TYPE=prompt TASK=SST-2 BS=2 LR=1e-5 SEED=42 MODEL=roberta-large bash run_experiment.sh "--no_train"

To do "GPT-3 style" in-context learning:

TAG=gpt3-in-context TYPE=prompt-demo TASK=SST-2 BS=2 LR=1e-5 SEED=42 MODEL=roberta-large bash run_experiment.sh "--no_train --num_sample 1 --gpt3_in_context_head --gpt3_in_context_num 32 --truncate_head --use_full_length"

How to design your own templates

Here are two template examples:

For SST-2: *cls**sent_0*_It_was*mask*.*sep+* => [CLS] {S0} It was [MASK]. [SEP]

For MNLI: *cls**sent-_0*?*mask*,*+sentl_1**sep+* => [CLS] {S0}? [MASK], {S1} [SEP]

The template is composed of special tokens and variables (surrounded by *) and text (e.g., It_was, where space is replaced by _). Special tokens and variables contain:

  • *cls*, *sep*, *sep+* and *mask*: Special tokens of CLS, SEP and MASK (different for different pre-trained models and tokenizers). *sep+* means the contents before and after this token have different segment embeddings (only for BERT).
  • *sent_i*: The i-th sentence.
  • *sent-_i*: The i-th sentence, discarding the last character.
  • *sentl_i*: The i-th sentence, lower-casing the first letter.
  • *sentl-_i*: The i-th sentence, discarding the last character and lower-casing the first letter.
  • *+sent_i*: The i-th sentence, adding an extra space at the beginning.
  • *+sentl_i*: The i-th sentence, adding an extra space at the beginning and lower-casing the first letter.

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Tianyu ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use LM-BFF in your work:

@inproceedings{gao2021making,
   title={Making Pre-trained Language Models Better Few-shot Learners},
   author={Gao, Tianyu and Fisch, Adam and Chen, Danqi},
   booktitle={Association for Computational Linguistics (ACL)},
   year={2021}
}

lm-bff's People

Contributors

danqi avatar gaotianyu1350 avatar raibows avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lm-bff's Issues

A little question about the evaluation

Hi! Thanks for your wonderful work!
I got a little question about the zero-shot evaluation on text classification. When doing argmax to get the predicted label, are we going to argmax on the whole vocab or on the specific label set (e.g., pos/neg in sentiment analysis)

Thanks a LOT

lm_bff.json文件中相互关联的参数:"inspired_templates", "replace_token_map_list"如何设置?

如下所示,是源代码中针对CoLA任务和数据集所设计的"inspired_templates", "replace_token_map_list",我要进行MNLI任务,该如何设计对应的模板和参数?

{
"model_dir": "./models/t5-3b",
"end_token": "",
"beam": 100,
"inspired_templates": ["*clssentu_0**<extra_id_0>label<extra_id_1>sep+", "cls.<extra_id_0>label**<extra_id_1>+sentu_0sep+*"],
"target_number": 2,
"batch_size": 32,
"gen_max_len": 20,
"truncates": ["head", "tail"],
"first_mask_token": "<extra_id_0>",
"forbid_tokens": [3, 19794, 22354],
"forbid_continuous_token": [5],
"replace_token_map_list": [{
"<extra_id_0>": "*cls
sent_0*",
"<extra_id_1>": "mask",
"<extra_id_2>": "sep+",
"": "sep+",
"▁":""
}, {
"<extra_id_0>": "cls",
"<extra_id_1>": "mask",
"<extra_id_2>": "+sent_0**sep+",
"": "+sent_0**sep+",
"▁":"
"
}]
}
如果作者能整理出GLUE所有9项任务的参数设置,读者能更好地理解该方法。感谢作者~

Load and predict

Hi, thanks for the great research and codebase, and sorry for the basic question but how would you recommend to load a trained model and predict on new texts for a given task?
Thanks

Is there a way to deal with label words with multiple tokens?

Hi,

It seems like the model mainly deals with English and most labels contain only 1 token. However in Chinese tasks it's quite common that labels contain multiple tokens.

I found in https://github.com/princeton-nlp/LM-BFF/blob/main/src/models.py#L75, the code says,

sequence_output, pooled_output = outputs[:2]
sequence_mask_output = sequence_output[torch.arange(sequence_output.size(0)), mask_pos]

In which mask_pos have shape [batch_size,]. Is there a way I can make mask_pos into shape [batch_size, label_word_length] and use it to calculate loss of multi-token labels?

The issue for the loss of regression tasks

loss_fct = nn.KLDivLoss(log_target=True)

Hi Tianyu,
Thank you for releasing the code.
I found one problem with the loss of regression tasks.

"logits" is through the operation "logsoftmax" and is in [-infinite, 0], while "labels" is not through that operation and is always greater than 0. They are not in the same space, so I think here log_target cannot be True. Or the "labels" should be operated by "torch.log(labels+very small number)".

What do you think of it?

Look forward to your reply.

Best wishes,
Chuan

cannot import name 'default_hp_search_backend' from 'transformers.integrations'

Now, by trying the command TAG=exp TYPE=prompt-demo TASK=SST-2 BS=2 LR=1e-5 SEED=42 MODEL=roberta-large bash run_experiment.sh which is in the README I got this error:
Traceback (most recent call last):
File "/LM-BFF/run.py", line 23, in
from src.trainer import Trainer
File "/LM-BFF/src/trainer.py", line 43, in
from transformers.integrations import (
ImportError: cannot import name 'default_hp_search_backend' from 'transformers.integrations' (/LM-BFF/venv/lib/python3.10/site-packages/transformers/integrations/init.py)
I've already found that it has been a problem in another project but I don't know how to fix it.

Bug in dataloader ?

Hi guys, I am trying to reproducing your work. In the dataloader, I found this code:

for sample_idx in range(self.num_sample):
    for query_idx in range(len(self.query_examples)):
        # If training, exclude the current example. Else keep all.
        if self.use_demo and args.demo_filter:
            # Demonstration filtering
            candidate = [support_idx for support_idx in support_indices
                           if support_idx != query_idx or mode != "train"]
            sim_score = []
            for support_idx in candidate:
                sim_score.append((support_idx, util.pytorch_cos_sim(self.support_emb[support_idx], self.query_emb[query_idx])))
            sim_score.sort(key=lambda x: x[1], reverse=True)
            if self.num_labels == 1:
                # Regression task
                limit_each_label = int(len(sim_score) // 2 * args.demo_filter_rate)
                count_each_label = {'0': 0, '1': 0}
                context_indices = []

                if args.debug_mode:
                    print("Query %s: %s" % (self.query_examples[query_idx].label, self.query_examples[query_idx].text_a)) # debug
                for support_idx, score in sim_score:
                    if count_each_label['0' if float(self.support_examples[support_idx].label) <= median_mapping[args.task_name] else '1'] < limit_each_label:
                        count_each_label['0' if float(self.support_examples[support_idx].label) <= median_mapping[args.task_name] else '1'] += 1
                        context_indices.append(support_idx)
                        if args.debug_mode:
                            print("    %.4f %s | %s" % (score, self.support_examples[support_idx].label, self.support_examples[support_idx].text_a)) # debug
            else:
                limit_each_label = int(len(sim_score) // self.num_labels * args.demo_filter_rate)
                count_each_label = {label: 0 for label in self.label_list}
                context_indices = []

                if args.debug_mode:
                    print("Query %s: %s" % (self.query_examples[query_idx].label, self.query_examples[query_idx].text_a)) # debug
                for support_idx, score in sim_score:
                    if count_each_label[self.support_examples[support_idx].label] < limit_each_label:
                        count_each_label[self.support_examples[support_idx].label] += 1
                        context_indices.append(support_idx)
                        if args.debug_mode:
                            print("    %.4f %s | %s" % (score, self.support_examples[support_idx].label, self.support_examples[support_idx].text_a)) # debug
        else:
            # Using demonstrations without filtering
            context_indices = [support_idx for support_idx in support_indices
                       if support_idx != query_idx or mode != "train"]

        # We'll subsample context_indices further later.
        self.example_idx.append((query_idx, context_indices, sample_idx))

Here it is calculating the similarity.
But I don't know why you use this loop: for sample_idx in range(self.num_sample) at outermost, the sample_idx is only used when you add the result into self.sample_idx

This codes is really slow, since you set the num_sample=16

I think you can remove for sample_idx in range(self.num_sample) and change the last line as

for query_idx in range(len(self.query_examples)):
    ....
    # We'll subsample context_indices further later.
    for sample_idx in range(self.num_sample):
        self.example_idx.append((query_idx, context_indices, sample_idx))

I don't know whether am I right.

Weird SST2 dataset size

I've just reproduced this work and starting with the dataset named SST2.

I found that the size of SST2 in this paper is 6.9k, but the size of SST2 in GLUE paper is 69k.

I double checked this information, and the SST2 dataset distributed in this repository has 6.9k, but the SST2 dataset distributed in the huggingface datasets has 69k.

I think some kind of filtering is applied to this data.

Could you clarify what it is?

Thank you

Which pretrained models can be used with this codebase?

First of all - thanks for this work! It is super nice and very helpful.

I would like to finetune a bunch of different pretrained model bases and collect results. The description in the README states
"roberta-large can also be replaced by bert-base, bert-large, roberta-base and distilbert-base". However, out of the four, only roberta-base seems to work out of the box. For the other three, I get some error that looks like this:

OSError: Can't load config for 'bert-large'. Make sure that:
- 'bert-large' is a correct model identifier listed on 'https://huggingface.co/models'

Which indeed it seems true that bert-large isn't listed on the site. So I tried the closest thing I could find to it, bert-large-cased, which gives the following error:

Traceback (most recent call last):
  File "run.py", line 623, in <module>
    main()
  File "run.py", line 476, in main
    resize_token_type_embeddings(model, new_num_types=10, random_segment=model_args.random_segment)
AttributeError: 'ModelArguments' object has no attribute 'random_segment'

(Same story holds for bert-base-cased, but distilbert-base-cased seems to work).

Do you have a list of a few models which I can simply plug in as a command line option and expect them to work? I am not particularly set on the list of models in this issue - any other pretrained models would work as well.

Thanks for the help in advance,
Rohan

Error while calling dataset.py: ValueError: 50264 is not in list

Taken from: https://github.com/shi-kejian
Hi,
Thanks again for the great work.

Today I actually encountered the same error as issue #7 ., when testing a model prompt-tuned on SST-2 directly on imdb movie review dataset, by replacing the dev.tsv in /original with the imdb dataset, as mentioned in issue #14 .

What I did:

prompt tune a model ckpt on SST-2, and save the model
replace the data/original/SST-2/dev.tsv with my own imdb dataset, and format it correctly
run tools/generate_k_shot.py again. The data/k-shot/SST-2/test.tsv turns to imdb.
load the model in 1) and put --no_train, --do_predict, --overwrite_cache, and other necessary flags to zero-shot on the imdb dataset. I also cleared the cache before I run it.
Error occurs.
Traceback (most recent call last):
File "run.py", line 628, in
main()
File "run.py", line 466, in main
if training_args.do_predict
File "/home/yb1025/Research/ML_2/robustness/LM-BFF/src/dataset.py", line 465, in init
verbose=True if _ == 0 else False,
File "/home/yb1025/Research/ML_2/robustness/LM-BFF/src/dataset.py", line 585, in convert_fn
other_sent_limit=self.args.other_sent_limit,
File "/home/yb1025/Research/ML_2/robustness/LM-BFF/src/dataset.py", line 243, in tokenize_multipart_input
mask_pos = [input_ids.index(tokenizer.mask_token_id)]
ValueError: 50264 is not in list
This "50264" is the same error as in issue Index not in list error when evaluating models zero-shot #7
Sorry for the inconvenience but do you happen to know what might went wrong?
Many thanks.

模板生成中的疑问

天宇,丹琦你们好!
generate_template.py中大约182行的位置,aggr_output[i][word_id]是不是应该取一个logexp,因为aggr_output是logits
盼回复,谢谢
微信图片_20211026114223

Some questions regarding details of your paper results

Hi, thanks for your outstanding work. I have a few questions regarding the details of your paper. Your insight would be highly valuable.

  1. In your paper, you use the pre-trained RoBERTa model as-is for label generation. I observed that in the code, a lm_head is initialized for converting hidden states into vocabulary space. However, this lm_head does not seem to be included in the pre-trained checkpoint. Could this randomly initialized lm_head potentially lead to the generation of meaningless or wrong labels?

  2. In my experiments using the SST-2 dataset, I achieved similar results to those reported in your paper for prompt-based fine-tuning. However, for zero-shot experiments, my results hovered around guessing levels (50%~55%), which is significantly lower than the results reported in your paper (80%~85%). Are there specific details or considerations in conducting zero-shot experiments that I should be aware of to improve these outcomes?

Looking forward to your response.

Bug in get_sbert_embedding.py?

Hi,

When I run get_bert_embedding.bash, I get an error bellow.

urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/sentence-transformers/roberta-large-nli-stsb-mean-tokens (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:1131)')))

Is this your code's problem or sentence-transformers' problem? Thanks a lot !

Adapting LM-BFF to WiC dataset

I am attempting to identify the most suitable prompt for a word sense disambiguation task using the WiC dataset. A single data point from this dataset could look like this:

head N 3-6 They shot 20 head of quail. A reduction in the assessment per head of sheep. yes

Primarily, we aim to incorporate not only the two given sentences but also the sense that the system needs to disambiguate. Is there a feasible approach to include the sense as a variable alongside the two sentences? Another challenge is the lack of a suitable processor for the WiC dataset to handle its processing. Do you have any recommendations for a processor to use or suggestions on how to develop a new one?

I apologize for the spam. I am relatively inexperienced in this field.

About the SNLI datasets

Hi Tianyu,

Thanks for your code. I found that the downloaded SNLI dataset is slightly different from the original one. The statistics are shown below:
yours:
9843 dev.tsv
9825 test.tsv
549368 train.tsv
569036 total

original:
10001 snli_1.0_dev.txt
10001 snli_1.0_test.txt
550153 snli_1.0_train.txt
570246 total

May I ask what changes have you made?

Regards,
Yiming

code release

👍👍Your work is very enlightening and I am looking forward to the code release.

Question about the Ensemble results

Hi, I'm confused about the ensemble results in Section 7.2.
I know that PET trained multiple models on different prompts.
Did LM-BFF also follow this, and trained multiple models or only trained one model but use multiple prompts which is like multi-task learning?
Thanks.

Question about prompt-based finetuning and automatic selection of label words

In the paper, it mentions "Let M: Y → V be a mapping from the task label space to individual words in the vocabulary V of L." Here, V is the set of "individual words" or "individual sub-words"?

I noticed that many auto-generated label words, such as "unforgettable/extraordinary/good/better/terrible" in SST-5 (Table E.1), are very long and should not be a single sub-word (from the view of a Roberta tokenizer). Then it seems that each label may contain multiple sub-words. In this case, the following sentence is confusing:
"Then for each xin, let the manipulation xprompt = T (xin) be a masked language modeling (MLM) input which contains one [MASK] token."
I'm not sure how one [MASK] token can reconstruct multiple tokens (sub-words), like "unforgettable".

This issue is also related to the automatic selection of label words, to determine whether we are searching over all the sub-words or all the words.

Could the authors clarify this detail?

Average the logits of 16 demonstrations

Hi!
Really thanks for your work. And I have a question about prediction at inference with demonstrations.
As the paper mentioned, the final prediction logits is by averaging the the results of 16 demonstrations.
But I cannot find such an average operation, and I only find you augment the dataset with 16 times larger, that means every input will accompany with 16 demonstrations and then there should be an operation to gather these 16 concated texts (input+demo) to get the final predicton logits. The pseudo code is like below

Require: the prediction logits of an input text_a
Input: 16 concated inputs (text_a, demo1) .... (text_a, demo16)
predictions = [16, num_classes]
Output: torch.mean(predictions, dim=0) # [1, num_classes]

Can you help me find this operation in the code? Or maybe I have some mis understandings...
Thanks!

ImportError: cannot import name 'BertOnlyMLMHead' from 'transformers'

I installed the environment using
pip install -r requirements.txt

However, when I ran the example code,

python run.py \
    --task_name SST-2 \
    --data_dir data/k-shot/SST-2/16-42 \
    --overwrite_output_dir \
    --do_train \
    --do_eval \
    --do_predict \
    --evaluate_during_training \
    --model_name_or_path roberta-large \
    --few_shot_type prompt-demo \
    --num_k 16 \
    --max_steps 1000 \
    --eval_steps 100 \
    --per_device_train_batch_size 2 \
    --learning_rate 1e-5 \
    --num_train_epochs 0 \
    --output_dir result/tmp \
    --seed 42 \
    --template "*cls**sent_0*_It_was*mask*.*sep+*" \
    --mapping "{'0':'terrible','1':'great'}" \
    --num_sample 16 \

I encountered the following error:

Traceback (most recent call last):
  File "run.py", line 19, in <module>
    from src.models import BertForPromptFinetuning, RobertaForPromptFinetuning, resize_token_type_embeddings
  File "/home/xuhui/project/LM-BFF/src/models.py", line 6, in <module>
    from transformers import BertPreTrainedModel, BertForSequenceClassification, BertModel, BertOnlyMLMHead
ImportError: cannot import name 'BertOnlyMLMHead' from 'transformers'

I make sure that transformers==3.4.0 .
Any chance you may have more insight on this?

ValueError: 50264 is not in list

Hi @gaotianyu1350 !

I think I meet the same error as is described in #10 :

  Traceback (most recent call last):
    File "run.py", line 628, in <module>
      main()
    File "run.py", line 461, in main
      if training_args.do_eval
    File "/Users/yfhuang/Documents/GitHub/LM-BFF/src/dataset.py", line 465, in __init__
      verbose=True if _ == 0 else False,
    File "/Users/yfhuang/Documents/GitHub/LM-BFF/src/dataset.py", line 585, in convert_fn
      other_sent_limit=self.args.other_sent_limit,
    File "/Users/yfhuang/Documents/GitHub/LM-BFF/src/dataset.py", line 244, in tokenize_multipart_input
      mask_pos = [input_ids.index(tokenizer.mask_token_id)]
  ValueError: 50264 is not in list

I run the code on my own sentiment analysis dataset, which is similar to sst-5. The language model I used is RoBERTa-base. The version of the Transformers is 3.4.0. The difference is that I only have 3 different labels (nagetive/ neutral/ positive) instead of 5 compared to sst-5. Therefore, I modified "src/processors.py" and changed the number of labels from 5 to 3. Then, I run the code on my own dataset with the task name "sst-5".

I'm not sure if it is a good way. The example for sst-5 works well, but my own test case cannot run properly. Can you please help me out? Thank you!

Index not in list error when evaluating models zero-shot

Hi,

I would like to evaluate a few different BERT models zero-shot on MNLI, SNLI, QNLI, and RTE. As such, I am running this following command, for example:

TYPE=prompt TASK=MNLI BS=2 LR=1e-5 SEED=42 K=16 MODEL=roberta-large bash run_experiment.sh "--no_train"

Unfortunately, this gives me a somewhat cryptic error message:

Traceback (most recent call last):
  File "run.py", line 632, in <module>
    main()
  File "run.py", line 465, in main
    FewShotDataset(data_args, tokenizer=tokenizer, mode="test", use_demo=("demo" in model_args.few_shot_type))
  File "/sailhome/rtaori/zero-shot-generalization/LM-BFF/src/dataset.py", line 456, in __init__
    self.features.append(self.convert_fn(
  File "/sailhome/rtaori/zero-shot-generalization/LM-BFF/src/dataset.py", line 575, in convert_fn
    inputs = tokenize_multipart_input(
  File "/sailhome/rtaori/zero-shot-generalization/LM-BFF/src/dataset.py", line 243, in tokenize_multipart_input
    mask_pos = [input_ids.index(tokenizer.mask_token_id)]
ValueError: 50264 is not in list

A similar error occurs for SNLI, QNLI, and RTE.

Do you have any advice on how to deal with this? Any help would be much appreciated!

Thanks in advance,
Rohan

Specify Model Training Epochs

Hi,

I want to run a fast experiment without training too many epochs. I notice that there's a argument --num_train_epochs in run_experiment.sh. I change the number from 0 to 1 or 2, but the number of epochs is always 500 (32 samples, bs=8).

It seems like in trainer.py there's an argument self.args.max_steps specifying the training steps,

if self.args.max_steps > 0:
            t_total = self.args.max_steps
            num_train_epochs = self.args.max_steps // num_update_steps_per_epoch + int(
                self.args.max_steps % num_update_steps_per_epoch > 0
            )

But I'm not sure how to modify it. Is there any way could modify the number of epochs?

Thank you.

Clarification re: --num_samples value

Hi!

I enjoyed reading your paper, and thanks for releasing this nice codebase. I had a quick question: In appendix B, it's mentioned that When finetuning with demonstrations, we sample 16 different sets of demonstrations for each input and average the predicted log probability for each class during inference. However, I noticed that QQP, MNLI, and SNLI seem to use --num_sample 4 by default in the run_experiment.sh script (e.g., https://github.com/princeton-nlp/LM-BFF/blob/main/run_experiment.sh#L41 ). If I wanted to faithfully reproduce the results of the paper, should I set num_sample to 16 for these tasks?

Thanks!

Negative CoLA MCC - Prompt-Based Finetuning error?

Hi, currently I am getting negative values for CoLA prompt-tuning tests.

python run.py
--task_name CoLA
--overwrite_cache
--data_dir data/k-shot/CoLA/16-13
--do_train
--do_eval \
--do_predict
--evaluate_during_training
--model_name_or_path roberta-base
--few_shot_type prompt
--num_k 16
--max_steps 1000
--eval_steps 100
--per_device_train_batch_size 2
--learning_rate 1e-5
--num_train_epochs 0
--output_dir result/BERT-LARGE-13-CoLA
--seed 13
--template "cls**sent_0_This_ismask.sep+"
--mapping "{'0':'incorrect','1':'correct'}"
--num_sample 16 \

My result was: -0.01883200893

I am just a little bit concerned if I am training the model correctly.

Any help would be greatly appreciated. Thank you for your assistance so far.

ValueError Some specified arguments are not used by the HfArgumentParser

When running the run.py in the Quick Start session under Run LM-BFF, I get

raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining args}"

I believe it is complaining about --evaluate_during_training. Is this deprecated? Should this simply be removed, or is there a replacement? I have had to fix the paths to Bert and Roberta in the imports so I think this just has to do with the evolution of transformers since this was originally published.

Using trained models in a separate process for inference

Hi, thank you for this great paper and implementation. I have a question about loading a trained model in a separate process and using it for inference.

I have been able to successfully train a BertForPromptFinetuning model using a BERT MLM, and it has been saved to a directory. I am attempting to load it for inference using the .from_pretrained function, like so:

# load models, config and tokenizers
config = AutoConfig.from_pretrained("../LM-BFF/result/tmp/config.json")
model = BertForPromptFinetuning.from_pretrained('../LM-BFF/result/tmp/', config=config)
tokenizer = AutoTokenizer.from_pretrained('../LM-BFF/result/tmp/')

This required loading the custom class BertForPromptFinetuning from your repo, along with some other configuration, e.g. initialising label_word_list to the list of vocabulary IDs used during training, which in my case was:

However, when attempting inference like this:

# predict
inputs = tokenizer("employment agreement", return_tensors="pt")
model(**inputs)

I get an error relating to prediction_mask_scores: IndexError: index 413 is out of bounds for dimension 1 with size 1

Now, I am unsure if what I attempting is even possible. I am pretty new to using transformers, so I could be doing something silly. But I wondered if you had any advice on using the model in this way.

ValueError: 103 is not in list / 50264 not in list

Hi!

I noticed that sometimes, if I run the model on MNLI and set the max_length to a value that's insufficiently low (e.g., 128), I'll get the this error:

  File "run.py", line 672, in <module>
    main()
  File "run.py", line 469, in main
    if training_args.do_eval
  File "/home/nfliu/git/LM-BFF/src/dataset.py", line 476, in __init__
    verbose=True if _ == 0 else False,
  File "/home/nfliu/git/LM-BFF/src/dataset.py", line 596, in convert_fn
    other_sent_limit=self.args.other_sent_limit,
  File "/home/nfliu/git/LM-BFF/src/dataset.py", line 244, in tokenize_multipart_input
    mask_pos = [input_ids.index(tokenizer.mask_token_id)]
ValueError: 103 is not in list

Where you get 103 for BERT models and 50264 for RoBERTa models.

Looking at the code, it seems like the culprit is that when the template produces more input ids than max_length, it's truncated down ( https://github.com/princeton-nlp/LM-BFF/blob/main/src/dataset.py#L237 ). However, sometimes, this truncation also removes the MASK token, which leads to the IndexError above (seeing this when running on a new dataset that has slightly longer sequence lengths). In this case, would you recommend setting first_sent_limit and other_sent_limit?

Bug in tools/ensemble.py/get_labels

The function returns label_ids. However, when print_name in ['sst-5', 'mr', 'cr', 'mpqa', 'subj', 'trec'], label_ids is not defined. I guess there should be return labels in this branch?

label generation problem

Hi, when I use your code to auto-generate labels, the label pairings are empty. The image is the details. And the output file is also empty. But I still have not found the wrong place.
image

question about label words which consist of more than 2 sub tokens

Thank you for your great work.

#13

I have already seen the issue above, and I understand that calculating the loss for one '[MASK]' token from multiple tokens is unavailable in normal BERT/Roberta.

But in one of your example, you used "terrible" as the negative label word, which should be tokenized into "ter" and "rible" in RobertTokenizer.

Then, how did you map the two sub tokens "ter" and "rible" into "[MASK]" and calculate the loss from them?

image

Testing: New Data for GLUE Tasks

Now, I can see that there was another issue similar to this. However, I am still not clear on how to deal with OOD Test Data.

I want to train and validation on original train.tsv and dev.tsv in the folder ORIGINAL. But, I want to test on an out of distribution dataset.

So, let's say I want to test SST-2 on IMDB for roberta-base. How should I go about it? Currently, I replace test.tsv in ORIGINAL folder and generate K shot data. The I run the file using the commands given on README on the repo page. However, the test eval accuracy is the same as the original SST-2 test dataset. I don't know what is happening here. To reiterate:

My objective:

  1. Test IMDB on roberta-base 42 seed SST-2. But train and validate on original data provided with repo.

Action:

  1. Replace test.tsv of ORIGINAL SST-2 with IMDB.

Observed Behaviour:

  1. Same test eval accuracy as original one as if not replaced test.tsv.

Expected Behaviour:

  1. Same test and dev accuracy, different test accuracy.

Request:

  1. Please help :) We changed the original test.tsv and then generated K shot again, but there was no change.

How can I create templates for right to left languages?

I am trying to apply LM-BFF to some sentiment analysis datasets and Named Entity Recognition in Arabic. I am having difficulties in creating the templates as the order in which I write the template in is not what is outputted when I run the model.

12/17/2023 21:53:54 - INFO - __main__ -   | *cls**sent_0*كانت**mask*.*sep+* => *cls**sent_0*كانت**mask*.*sep+**sent_1*كانت**label_0*.*sep+**sent_2*كانت**label_1*.*sep+*

According to the pattern, I wrote the template should output [CLS] sent_0 كانت [MASK]. [SEP] sent_1 كانت [MASK]. [SEP] but the output that is returned is [CLS] كانت sent_0 [MASK]. [SEP] sent_1 كانت [MASK]. [SEP]. For some reason the first part of the template get switched I don't know if its because I am using a right to left language or if my template is wrong.

Question about the regression problem

Hi, I have a question about the method of regression. In section 4.2 in your paper, why use kl-divergence loss between p(yu | xin) and the scaled score (y−vl)/(vu−vl) (loss = loss_fct(logits.view(-1, 2), labels)), but not cross entropy loss? The logits and labels here are both probability distribution on the 2 polarities.
Thanks in advance!

一些细节的关于paper和代码的问题

1.看paper你们似乎是先找的label mapping再找的template,但是在找label mapping的时候也需要template,请问这个时候的template是手动定的么?
2.请问你们的代码应该用多少卡跑?好像现在的代码对应的总batch会随着卡数的增加而增加,从而导致训练过程的不同
3.你们的readme里提到了跑sst-2的命令,在“Our final results (LM-BFF) take prompt-based fine-tuning with demonstrations, filtering and automatic template, for example:”后面,但是后面给的这个命令似乎没有用到你们搜到的label mapping,只用了template,而mapping就是manual的,请问就是这样的嘛

Training sample template conversion

Hello,

first of all, thanks for your work and for publishing the code. I'm curious about one thing:
If I understand your paper correctly, you use the prompts with demonstration also while training. I.e., I expected to see the training samples also augmented by the demonstrations. However, I cannot find this in the code. In the dataset.py you write If it is not training, we pre-process the data; otherwise, we process the data online. Perhaps I misinterpret this, but then I'd wonder why the training_samples would even get the context_indicesadded with the respective training-sample-id left out.

Could you please point me to the location in the code where you build the templates for the training samples as you do for the dev and test samples? Thanks a lot!

num_k parameter is not used

run.py accepts parameter num_k, which should control how many training examples per class we have. However, it's not used anywhere in the code. It seems that num_sample is used instead.

What is difference between --type prompt & --type prompt-demo?

In the few-shot types: prompt & prompt-demo. I know that prompt-demo includes demonstrations for training and inferencing. But I have a very basic question that how the prompt-based finetuning will be trained on RoBERTa if I choose only prompt as fewshot type for training?

Is it training k-shots each at a time in the --type prompt method?
Please put some light on this query, I am getting good results but I am unaware of the way training happens.
Thanks.

Problem with the requirements file

have been trying to install the requirements file in a virtual environment, and I am encountering a problem with dependencies. It says there may be an issue with the numpy version. Perhaps the Python version you used is the key. Is there any way to fix it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.