zhuchen03 / freelb Goto Github PK
View Code? Open in Web Editor NEWAdversarial Training for Natural Language Understanding
Adversarial Training for Natural Language Understanding
Hi, first of all thank you for your excellent work
I found that freelb uses a lot of detach operations, which will detach delta.grad, delta_norm, and delta. I know that detach can separate variables from the current graph and can be used to prevent the gradient from propagating further, then If want to prevent the calculation of the gradient for the parameters before the final delta is used to calculate the final delta, it feels that a step of detach is enough here on line 290. I am not sure if this is correct.(https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L290)
In addition, regarding denorm and delta_norm, is there any difference between them, why delta_norm does a detach operation, but denorm does not. And I found that many researchers will detach delta.grad. Is there anything special about grad?
So I want to know what your purpose is in the following five detaches, what role they play, and whether some detach operations can be deleted.
Related code:
https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L239
https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L278
https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L284
https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L286
https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L290
Based on https://github.com/zhuchen03/FreeLB/blob/master/fairseq-RoBERTa/fairseq/tasks/sentence_prediction.py#L103, I implemented FreeLB at the finetune stage for GLM model. I have four questions.
First, how to get <input_mask> for GLM model? Is it right that all positions for padding tokens should be 0 for <input_mask>? Do I need to set other positions as 0 based on <input_ids>? This question is not discussed in the paper.
Second, if I set <adv_begin_iter> as -1, the optimization of model will be stuck in NaN issue. But if I set <adv_begin_iter> as 20 or larger number, the NaN issue disappear. Did you encounter the same issue during experiments? Or is there any other methods to fix NaN problem?
Third, I found you didn't use <adv_begin_iter> in your bert model(https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L224). Does this mean bert-base is more stable than Roberta? Or <adv_begin_iter> differs between different models?
Finally, where to find the code implementation for 'when adversarial training meets dropout' in the paper?
Looking forward to your response. Thanks!
作者您好,
请问如何理解FreeLB改善了词向量空间的不变性呢?
(论文的Introduction部分中的这句话:We observe improved invariance in the embedding space for models trained with FreeLB, which is positively correlated with generalization.)
Hi!Thanks for your great repo.
I tried the script in fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh and used the same setting as that in Issue #11 .
# run_exp GPU TOTAL_NUM_UPDATES WARMUP_UPDATES LR NUM_CLASSES MAX_SENTENCES FREQ DATA ADV_LR ADV_STEP INIT_MAG SEED MNORM
run_exp 0 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 123 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 3e-2 3 1.6e-1 456 1.4e-1
But I got the best scores 0.5152, 0.5152
. This is the log. Seems that training loss does not decrease.
My implementation environment is python 3.6.9, torch 1.6.0, torchvision 0.7.0 and cuda 10.2.
It's really confused. Appreciate your help!
Hello, friend!
Could you please provide the version of FreeLB based on tensorflow?
If possible, that would be great!!!
In fairseq implementation, the "update_freq" configuration (from the original fairseq code) specifies how often the optimizer updates model parameters. when update_freq > 1, it will accumulate gradients and halt gradient synchronization until the last step. In adversarial training, is gradient synchronization needed in gradient computation? If so, does it mean that setting update_freq > 1 will make the computation incorrect?
Thanks for your valuable work "FreeLB", which indeed improves my models' performance.
One question that arises when discussing the relationship between Dropout and Adversarial Training, in Section 3.3 of Paper https://arxiv.org/pdf/1909.11764.pdf, where the same mask is suggested to be used in each forward-backward step. However, I can't find such implementations in code (Or maybe I missed it by mistake? ). Does using the same mask affect the results? In my own experiments, I ignore the same mask suggestion, but I also improve performance.
Very Thanks!
There are only 4 tasks' hyper parameters in this file, would you please release others?
First thanks for your wonderful work.
Does anyone meet the Nan error during the training-end epoch?
I embedding FreeLB as a plugin format(without handle dropout_mask):
freelb.attack()
freelb.update()
to my network. But I face a problem of Nan error. The backbone is BERT.
At the beginning epoch, all works well, loss converges and accuracy boost.
Till the loss converges to a very small scale, APX(fp16) scale the loss to very small scale about 1e-100, then the Nan error.
I install transformers from pip and the version is 2.8.0 and it reported such a error:
ImportError: cannot import name 'glue_criterion_metrics' from 'transformers'
Now I have solved by defining such a function manually in code. But why such an error comes out? Thanks!!!
hi, when i used the example from huggingface-transformers/examples/run_glue_freelb.py
i met the error as this 'AlbertForSequenceClassification' object has no attribute 'encoder'
it seems the code
if isinstance(model, torch.nn.DataParallel):
embeds_init = model.module.encoder.embeddings.word_embeddings(batch[0])
else:
embeds_init = model.encoder.embeddings.word_embeddings(batch[0])
can not work!
I used the Code for adversarial training and the problem follows:
forward() got an unexpected keyword argument 'dp_masks'
I wonder if you change the inside of the pretrained-model? or other staff
How can I fix it?
thanks!
hi,
When i execute this command source ./examples/roberta/preprocess_GLUE_tasks.sh glue_data ALL, the following error is generated.
File "/root/FreeLB/fairseq-RoBERTa/fairseq_cli/preprocess.py", line 2
../preprocess.py
^
SyntaxError: invalid syntax
https://github.com/zhuchen03/FreeLB/blob/master/fairseq-RoBERTa/fairseq_cli/preprocess.py#L1
I don't understand why this script is written like this ( ../preprocess.py)
But after finishing the data processing, I can run run_glue.sh successfully.
so what is the purpose of this script?
Hope to get your answer, thanks.
Can you provide the FreeLB-RoBERTa in HuggingFace's transformers
It's hard to understand the code, including bash shell.
Hi! Thanks for this repository.
I've been trying to reproduce the results from the paper but ran into some problems. I tried the script in fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh
which I would've expected to score 88.13 on average as shown in Table 1 in the paper.
I tried:
# run_exp GPU TOTAL_NUM_UPDATES WARMUP_UPDATES LR NUM_CLASSES MAX_SENTENCES FREQ DATA ADV_LR ADV_STEP INIT_MAG SEED MNORM
run_exp 1 2036 122 1e-5 2 2 8 RTE 6e-2 3 1.6e-1 1 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 6e-2 3 1.6e-1 2 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 6e-2 3 1.6e-1 3 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 6e-2 3 1.6e-1 4 1.4e-1
run_exp 1 2036 122 1e-5 2 2 8 RTE 6e-2 3 1.6e-1 9016 1.4e-1
and got the scores 0.8597, 0.8884, 0.8057, 0.8669, 0.8633
(mean 0.8568). logs here.
# run_exp GPU TOTAL_NUM_UPDATES WARMUP_UPDATES LR NUM_CLASSES MAX_SENTENCES FREQ DATA ADV_LR ADV_STEP INIT_MAG SEED MNORM
run_exp 1 2036 122 2e-5 2 2 8 RTE 3e-2 3 1.5e-1 1 0
run_exp 1 2036 122 2e-5 2 2 8 RTE 3e-2 3 1.5e-1 2 0
run_exp 1 2036 122 2e-5 2 2 8 RTE 3e-2 3 1.5e-1 3 0
run_exp 1 2036 122 2e-5 2 2 8 RTE 3e-2 3 1.5e-1 4 0
run_exp 1 2036 122 2e-5 2 2 8 RTE 3e-2 3 1.5e-1 5 0
and got the scores 0.8741, 0.7949, 0.8417, 0.6330, 0.6007
(mean 0.7488). logs here.
Appreciate any help!
It looks like your Comet API key made its way into your commits; assuming that this was unintentional, it's probably best to revoke the key and update the code to take this as an input.
Cheers!
In this code, if adv_init_mag > 0, model will only be trained on adversarial examples?
I did an experiment on SST-2 using albert-base-v2 with the hyper-parameters in this shell.
experiment | result |
---|---|
No FreeLB | 93.00 |
FreeLB | 91.86 |
FreeLB with original data | 93.46 |
For FreeLB with original data
, I added these code before this line
inputs['inputs_embeds'] = embeds_init
inputs['dp_masks'] = dp_masks
outputs, dp_masks = model(**inputs)
loss = outputs[0]
if args.n_gpu > 1:
loss = loss.mean() # mean() to average on multi-gpu parallel training
if args.gradient_accumulation_steps > 1:
loss = loss / args.gradient_accumulation_steps
tr_loss += loss.item()
if args.fp16:
with amp.scale_loss(loss, optimizer) as scaled_loss:
scaled_loss.backward()
else:
loss.backward()
(Maybe FreeLB's hyper parameters on albert-base is very different from albert-xxlarge?)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.