Git Product home page Git Product logo

freelb's People

Contributors

dependabot[bot] avatar zhuchen03 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

freelb's Issues

Some confusion about the detach operation and embeds_init

Hi, first of all thank you for your excellent work

I found that freelb uses a lot of detach operations, which will detach delta.grad, delta_norm, and delta. I know that detach can separate variables from the current graph and can be used to prevent the gradient from propagating further, then If want to prevent the calculation of the gradient for the parameters before the final delta is used to calculate the final delta, it feels that a step of detach is enough here on line 290. I am not sure if this is correct.(https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L290)
In addition, regarding denorm and delta_norm, is there any difference between them, why delta_norm does a detach operation, but denorm does not. And I found that many researchers will detach delta.grad. Is there anything special about grad?

So I want to know what your purpose is in the following five detaches, what role they play, and whether some detach operations can be deleted.

Related code:
https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L239
https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L278
https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L284
https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L286
https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L290

NaN encounted if FreeLB is used at the beginning of finetune stage

Based on https://github.com/zhuchen03/FreeLB/blob/master/fairseq-RoBERTa/fairseq/tasks/sentence_prediction.py#L103, I implemented FreeLB at the finetune stage for GLM model. I have four questions.

First, how to get <input_mask> for GLM model? Is it right that all positions for padding tokens should be 0 for <input_mask>? Do I need to set other positions as 0 based on <input_ids>? This question is not discussed in the paper.

Second, if I set <adv_begin_iter> as -1, the optimization of model will be stuck in NaN issue. But if I set <adv_begin_iter> as 20 or larger number, the NaN issue disappear. Did you encounter the same issue during experiments? Or is there any other methods to fix NaN problem?

Third, I found you didn't use <adv_begin_iter> in your bert model(https://github.com/zhuchen03/FreeLB/blob/master/huggingface-transformers/examples/run_glue_freelb.py#L224). Does this mean bert-base is more stable than Roberta? Or <adv_begin_iter> differs between different models?

Finally, where to find the code implementation for 'when adversarial training meets dropout' in the paper?

Looking forward to your response. Thanks!

词向量空间的不变性

作者您好,
请问如何理解FreeLB改善了词向量空间的不变性呢?
(论文的Introduction部分中的这句话:We observe improved invariance in the embedding space for models trained with FreeLB, which is positively correlated with generalization.)

Having issues with training RoBERTa. Loss not decreasing

Hi!Thanks for your great repo.

I tried the script in fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh and used the same setting as that in Issue #11 .

# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      0        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         123     1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           3e-2      3       1.6e-1         456     1.4e-1

But I got the best scores 0.5152, 0.5152. This is the log. Seems that training loss does not decrease.
My implementation environment is python 3.6.9, torch 1.6.0, torchvision 0.7.0 and cuda 10.2.
It's really confused. Appreciate your help!

Is it still working with update_freq > 1?

In fairseq implementation, the "update_freq" configuration (from the original fairseq code) specifies how often the optimizer updates model parameters. when update_freq > 1, it will accumulate gradients and halt gradient synchronization until the last step. In adversarial training, is gradient synchronization needed in gradient computation? If so, does it mean that setting update_freq > 1 will make the computation incorrect?

A few questions about FreeLB and dropout

Thanks for your valuable work "FreeLB", which indeed improves my models' performance.
One question that arises when discussing the relationship between Dropout and Adversarial Training, in Section 3.3 of Paper https://arxiv.org/pdf/1909.11764.pdf, where the same mask is suggested to be used in each forward-backward step. However, I can't find such implementations in code (Or maybe I missed it by mistake? ). Does using the same mask affect the results? In my own experiments, I ignore the same mask suggestion, but I also improve performance.
Very Thanks!

Does anyone meet the Nan error during the end epochs of training?

First thanks for your wonderful work.

Does anyone meet the Nan error during the training-end epoch?

I embedding FreeLB as a plugin format(without handle dropout_mask):
freelb.attack()
freelb.update()
to my network. But I face a problem of Nan error. The backbone is BERT.

At the beginning epoch, all works well, loss converges and accuracy boost.
Till the loss converges to a very small scale, APX(fp16) scale the loss to very small scale about 1e-100, then the Nan error.

'AlbertForSequenceClassification' object has no attribute 'encoder'

hi, when i used the example from huggingface-transformers/examples/run_glue_freelb.py
i met the error as this 'AlbertForSequenceClassification' object has no attribute 'encoder'
it seems the code

if isinstance(model, torch.nn.DataParallel):
        embeds_init = model.module.encoder.embeddings.word_embeddings(batch[0])
else:
        embeds_init = model.encoder.embeddings.word_embeddings(batch[0])

can not work!

the "dp_mask" problem

I used the Code for adversarial training and the problem follows:
forward() got an unexpected keyword argument 'dp_masks'
I wonder if you change the inside of the pretrained-model? or other staff
How can I fix it?
thanks!

Errors generated during data preprocessing

hi,

When i execute this command source ./examples/roberta/preprocess_GLUE_tasks.sh glue_data ALL, the following error is generated.

  File "/root/FreeLB/fairseq-RoBERTa/fairseq_cli/preprocess.py", line 2
    ../preprocess.py
    ^
SyntaxError: invalid syntax

https://github.com/zhuchen03/FreeLB/blob/master/fairseq-RoBERTa/fairseq_cli/preprocess.py#L1

I don't understand why this script is written like this ( ../preprocess.py)

But after finishing the data processing, I can run run_glue.sh successfully.

so what is the purpose of this script?

Hope to get your answer, thanks.

Reproducing results from the paper with roberta using fairseq

Hi! Thanks for this repository.

I've been trying to reproduce the results from the paper but ran into some problems. I tried the script in fairseq-RoBERTa/launch/FreeLB/rte-fp32-clip.sh which I would've expected to score 88.13 on average as shown in Table 1 in the paper.

I tried:

  1. five seeds with the setup currently checked into the repo:
# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     1  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     2  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     3  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     4  1.4e-1
run_exp      1        2036                 122         1e-5       2           2            8     RTE           6e-2      3      1.6e-1     9016  1.4e-1

and got the scores 0.8597, 0.8884, 0.8057, 0.8669, 0.8633 (mean 0.8568). logs here.

  1. five seeds with the parameters from Table 7 in the paper, using the default fairseq parameters (from https://github.com/pytorch/fairseq/blob/master/examples/roberta/README.glue.md#3-fine-tuning-on-glue-task) for parameters which are not specified:
# run_exp   GPU    TOTAL_NUM_UPDATES    WARMUP_UPDATES  LR      NUM_CLASSES MAX_SENTENCES   FREQ    DATA    ADV_LR  ADV_STEP  INIT_MAG  SEED    MNORM
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     1  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     2  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     3  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     4  0
run_exp      1        2036                 122         2e-5       2           2            8     RTE           3e-2      3      1.5e-1     5  0

and got the scores 0.8741, 0.7949, 0.8417, 0.6330, 0.6007 (mean 0.7488). logs here.

Appreciate any help!

API Key

It looks like your Comet API key made its way into your commits; assuming that this was unintentional, it's probably best to revoke the key and update the code to take this as an input.

Cheers!

FreeLB didn't use the original training samples?

In this code, if adv_init_mag > 0, model will only be trained on adversarial examples?
I did an experiment on SST-2 using albert-base-v2 with the hyper-parameters in this shell.

experiment result
No FreeLB 93.00
FreeLB 91.86
FreeLB with original data 93.46

For FreeLB with original data, I added these code before this line

inputs['inputs_embeds'] = embeds_init
inputs['dp_masks'] = dp_masks
outputs, dp_masks = model(**inputs)
loss = outputs[0] 
if args.n_gpu > 1:
    loss = loss.mean()  # mean() to average on multi-gpu parallel training
if args.gradient_accumulation_steps > 1:
    loss = loss / args.gradient_accumulation_steps
tr_loss += loss.item()
if args.fp16:
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
else:
    loss.backward()

(Maybe FreeLB's hyper parameters on albert-base is very different from albert-xxlarge?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.