zhuchen03 / freelb Goto Github PK

Adversarial Training for Natural Language Understanding

HTML 0.02% Makefile 0.04% Python 87.21% Batchfile 0.02% Shell 2.34% C++ 0.07% Lua 0.08% Dockerfile 0.01% CSS 0.09% Jupyter Notebook 10.14%

freelb adversarial natural-language iclr2020

freelb's Introduction

Introduction

This repository contains the implementation for FreeLB on GLUE tasks based on both fairseq and HuggingFace's transformers libraries, under ./fairseq-RoBERTa/ and ./huggingface-transformers/ respectively. We also integrated our implementations of vanilla PGD, FreeAT and YOPO in our fairseq version. FreeLB is an adversarial training approach for improving transformer-based language models on Natural Language Understanding tasks. It accumulates the gradient in the ascent steps and updates the parameters with the accumulated gradients, which is approximately equivalent to enlarging the batch size with diversified adversarial examples within different radiuses around the clean input. FreeLB improves the performance of BERT and RoBERTa on various Natural Language Understanding tasks including Question Answering, Natural Language Inference, and Sentiment Analysis.

For technical details and additional experimental results, please refer to our paper:

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. FreeLB: Enhanced Adversarial Training for Language Understanding. In ICLR, 2020.

What's New

Feb 15, 2020: Initial release of FreeLB based on fairseq and HuggingFace's transformers. The first one contains our implementations of FreeLB, FreeAT, YOPO for RoBERTa, while the latter one is FreeLB for ALBERT.
May 16, 2020: Hyperparameters for ALBERT are now available at huggingface-transformers/launch/run_glue.sh.

Prerequisites

The code is compatible with PyTorch 1.4.0. In addition, you need to execute the followings in order, to install the prerequisites for fairseq and HuggingFace's transformers:

# Install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

# Configure fairseq
cd ../fairseq-RoBERTa
pip install --editable .

# Download and pre-process GLUE data
wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
python download_glue_data.py --data_dir glue_data --tasks all
source ./examples/roberta/preprocess_GLUE_tasks.sh glue_data ALL

cd ../huggingface-transformers
pip install --editable .
mkdir logs

Launch

The launch scripts are under ./fairseq-RoBERTa/launch/ and ./huggingface-transformers/launch/, where we have included most of the running scripts for RoBERTa and ALBERT on GLUE dev sets. We will release more details in the future.

freelb's People

Contributors

Stargazers

Watchers

freelb's Issues

Does anyone meet the Nan error during the end epochs of training?

First thanks for your wonderful work.

Does anyone meet the Nan error during the training-end epoch?

I embedding FreeLB as a plugin format(without handle dropout_mask):
freelb.attack()
freelb.update()
to my network. But I face a problem of Nan error. The backbone is BERT.

At the beginning epoch, all works well, loss converges and accuracy boost.
Till the loss converges to a very small scale, APX(fp16) scale the loss to very small scale about 1e-100, then the Nan error.

NaN encounted if FreeLB is used at the beginning of finetune stage

Based on https://github.com/zhuchen03/FreeLB/blob/master/fairseq-RoBERTa/fairseq/tasks/sentence_prediction.py#L103, I implemented FreeLB at the finetune stage for GLM model. I have four questions.

First, how to get <input_mask> for GLM model? Is it right that all positions for padding tokens should be 0 for <input_mask>? Do I need to set other positions as 0 based on <input_ids>? This question is not discussed in the paper.

Second, if I set <adv_begin_iter> as -1, the optimization of model will be stuck in NaN issue. But if I set <adv_begin_iter> as 20 or larger number, the NaN issue disappear. Did you encounter the same issue during experiments? Or is there any other methods to fix NaN problem?

Third, I found you didn't use <adv_begin_iter> in your bert model(https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L224). Does this mean bert-base is more stable than Roberta? Or <adv_begin_iter> differs between different models?

Finally, where to find the code implementation for 'when adversarial training meets dropout' in the paper?

Looking forward to your response. Thanks!

Errors generated during data preprocessing

hi,

When i execute this command source ./examples/roberta/preprocess_GLUE_tasks.sh glue_data ALL, the following error is generated.

  File "/root/FreeLB/fairseq-RoBERTa/fairseq_cli/preprocess.py", line 2
    ../preprocess.py
    ^
SyntaxError: invalid syntax

https://github.com/zhuchen03/FreeLB/blob/master/fairseq-RoBERTa/fairseq_cli/preprocess.py#L1

I don't understand why this script is written like this （ ../preprocess.py）

But after finishing the data processing, I can run run_glue.sh successfully.

so what is the purpose of this script？

Hope to get your answer, thanks.

Some confusion about the detach operation and embeds_init

Hi, first of all thank you for your excellent work

I found that freelb uses a lot of detach operations, which will detach delta.grad, delta_norm, and delta. I know that detach can separate variables from the current graph and can be used to prevent the gradient from propagating further, then If want to prevent the calculation of the gradient for the parameters before the final delta is used to calculate the final delta, it feels that a step of detach is enough here on line 290. I am not sure if this is correct.（https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L290）
In addition, regarding denorm and delta_norm, is there any difference between them, why delta_norm does a detach operation, but denorm does not. And I found that many researchers will detach delta.grad. Is there anything special about grad?

So I want to know what your purpose is in the following five detaches, what role they play, and whether some detach operations can be deleted.

FreeLB-RoBERTa within HuggingFace's transformers?

Can you provide the FreeLB-RoBERTa in HuggingFace's transformers

'AlbertForSequenceClassification' object has no attribute 'encoder'

hi, when i used the example from huggingface-transformers/examples/run_glue_freelb.py
i met the error as this 'AlbertForSequenceClassification' object has no attribute 'encoder'
it seems the code

if isinstance(model, torch.nn.DataParallel):
        embeds_init = model.module.encoder.embeddings.word_embeddings(batch[0])
else:
        embeds_init = model.encoder.embeddings.word_embeddings(batch[0])

can not work!

the "dp_mask" problem

I used the Code for adversarial training and the problem follows:
forward() got an unexpected keyword argument 'dp_masks'
I wonder if you change the inside of the pretrained-model? or other staff
How can I fix it?
thanks!

API Key

It looks like your Comet API key made its way into your commits; assuming that this was unintentional, it's probably best to revoke the key and update the code to take this as an input.

Cheers!

Is it still working with update_freq > 1?

In fairseq implementation, the "update_freq" configuration (from the original fairseq code) specifies how often the optimizer updates model parameters. when update_freq > 1, it will accumulate gradients and halt gradient synchronization until the last step. In adversarial training, is gradient synchronization needed in gradient computation? If so, does it mean that setting update_freq > 1 will make the computation incorrect?

词向量空间的不变性

作者您好，
请问如何理解FreeLB改善了词向量空间的不变性呢？
（论文的Introduction部分中的这句话：We observe improved invariance in the embedding space for models trained with FreeLB, which is positively correlated with generalization.）

Reproducing results from the paper with roberta using fairseq

Hi! Thanks for this repository.

I've been trying to reproduce the results from the paper but ran into some problems. I tried the script in fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh which I would've expected to score 88.13 on average as shown in Table 1 in the paper.

I tried:

five seeds with the setup currently checked into the repo:

# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     1  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     2  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     3  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     4  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     9016  1.4e-1

and got the scores 0.8597, 0.8884, 0.8057, 0.8669, 0.8633 (mean 0.8568). logs here.

five seeds with the parameters from Table 7 in the paper, using the default fairseq parameters (from https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md#3-fine-tuning-on-glue-task) for parameters which are not specified:

# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     1  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     2  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     3  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     4  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     5  0

and got the scores 0.8741, 0.7949, 0.8417, 0.6330, 0.6007 (mean 0.7488). logs here.

Appreciate any help!

Could you add some comments in the code？

It's hard to understand the code, including bash shell.

Having issues with training RoBERTa. Loss not decreasing

Hi！Thanks for your great repo.

I tried the script in fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh and used the same setting as that in Issue #11 .

# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      0        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         123     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         456     1.4e-1

But I got the best scores 0.5152, 0.5152. This is the log. Seems that training loss does not decrease.
My implementation environment is python 3.6.9, torch 1.6.0, torchvision 0.7.0 and cuda 10.2.
It's really confused. Appreciate your help!

FreeLB didn't use the original training samples?

In this code, if adv_init_mag > 0, model will only be trained on adversarial examples?
I did an experiment on SST-2 using albert-base-v2 with the hyper-parameters in this shell.

experiment	result
No FreeLB	93.00
FreeLB	91.86
FreeLB with original data	93.46

For FreeLB with original data, I added these code before this line

inputs['inputs_embeds'] = embeds_init
inputs['dp_masks'] = dp_masks
outputs, dp_masks = model(**inputs)
loss = outputs[0] 
if args.n_gpu > 1:
    loss = loss.mean()  # mean() to average on multi-gpu parallel training
if args.gradient_accumulation_steps > 1:
    loss = loss / args.gradient_accumulation_steps
tr_loss += loss.item()
if args.fp16:
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
else:
    loss.backward()

(Maybe FreeLB's hyper parameters on albert-base is very different from albert-xxlarge?)

A few questions about FreeLB and dropout

Thanks for your valuable work "FreeLB", which indeed improves my models' performance.
One question that arises when discussing the relationship between Dropout and Adversarial Training, in Section 3.3 of Paper https://arxiv.org/pdf/1909.11764.pdf, where the same mask is suggested to be used in each forward-backward step. However, I can't find such implementations in code (Or maybe I missed it by mistake? ). Does using the same mask affect the results? In my own experiments, I ignore the same mask suggestion, but I also improve performance.
Very Thanks!

Would you please release the hyper-parameters for FreeLB based on ALBERT(hugging-face)

There are only 4 tasks' hyper parameters in this file, would you please release others?

ImportError: cannot import name 'glue_criterion_metrics' from 'transformers'

I install transformers from pip and the version is 2.8.0 and it reported such a error:

ImportError: cannot import name 'glue_criterion_metrics' from 'transformers'

Now I have solved by defining such a function manually in code. But why such an error comes out? Thanks!!!

Regarding the release of FreeLB ^_^

Hello, friend!
Could you please provide the version of FreeLB based on tensorflow?
If possible, that would be great!!!