Git Product home page Git Product logo

r-drop's Introduction

R-Drop: Regularized Dropout for Neural Networks

This repo contains the code of our NeurIPS-2021 paper, R-drop: Regularized Dropout for Neural Networks.

R-Drop is a simple yet very effective regularization method built upon dropout, by minimizing the bidirectional KL-divergence of the output distributions of any pair of sub models sampled from dropout in model training.

@inproceedings{liang2021rdrop,
  title={R-Drop: Regularized Dropout for Neural Networks},
  author={Liang, Xiaobo* and Wu, Lijun* and Li, Juntao and Wang, Yue and Meng, Qi and Qin, Tao and Chen, Wei and Zhang, Min and Liu, Tie-Yan},
  booktitle={NeurIPS},
  year={2021}
}

Usage:

R-Drop is an almost universal method for supervised tasks and even performs well for semi-supervised setting. For other settings and tasks that are not mentioned in our paper, feel free to try the following piece of code.

import torch.nn.functional as F

# define your task model, which outputs the classifier logits
model = TaskModel()

def compute_kl_loss(p, q, pad_mask=None):
    
    p_loss = F.kl_div(F.log_softmax(p, dim=-1), F.softmax(q, dim=-1), reduction='none')
    q_loss = F.kl_div(F.log_softmax(q, dim=-1), F.softmax(p, dim=-1), reduction='none')
    
    # pad_mask is for seq-level tasks
    if pad_mask is not None:
        p_loss.masked_fill_(pad_mask, 0.)
        q_loss.masked_fill_(pad_mask, 0.)

    # You can choose whether to use function "sum" and "mean" depending on your task
    p_loss = p_loss.sum()
    q_loss = q_loss.sum()

    loss = (p_loss + q_loss) / 2
    return loss

# keep dropout and forward twice
logits = model(x)

logits2 = model(x)

# cross entropy loss for classifier
ce_loss = 0.5 * (cross_entropy_loss(logits, label) + cross_entropy_loss(logits2, label))

kl_loss = compute_kl_loss(logits, logits2)

# carefully choose hyper-parameters
loss = ce_loss + α * kl_loss

Quick Links:

R-Drop is capable to handle many tasks for both NLP and CV:

  1. Neural Machine Translation Task

  2. Abstractive Summarization Task

  3. Language Modeling Task

  4. Language Understanding Task

  5. Image Classification Task

r-drop's People

Contributors

apeterswu avatar double22a avatar dropreg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

r-drop's Issues

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

Hi, after following the instructions here to make the code run for abstractive text summarization, I am running into the following issue:

2021-08-02 18:15:48 | INFO | fairseq_cli.train | task: RDropTranslationTask
2021-08-02 18:15:48 | INFO | fairseq_cli.train | model: BARTModel
2021-08-02 18:15:48 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
2021-08-02 18:15:48 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
2021-08-02 18:15:53 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-08-02 18:15:53 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-08-02 18:15:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:15:53 | INFO | fairseq.utils | rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
2021-08-02 18:15:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:15:53 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-08-02 18:15:53 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
2021-08-02 18:15:53 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
tcmalloc: large alloc 1625169920 bytes == 0x5612fbcaa000 @  0x7f8425b51b6b 0x7f8425b71379 0x7f838e16525e 0x7f838e1669d2 0x7f838ff265f5 0x7f8401ea8c09 0x561256deea65 0x561256daf7b2 0x561256e22e65 0x561256e1e235 0x561256db034b 0x561256dafe59 0x561256ef725d 0x561256e66c3b 0x561256daef01 0x561256ea0c0d 0x561256e230d8 0x561256e1e235 0x561256cefe2c 0x561256e20318 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a
tcmalloc: large alloc 1625169920 bytes == 0x56135ca8c000 @  0x7f8425b51b6b 0x7f8425b71379 0x7f838e16525e 0x7f838e1669d2 0x7f838ff265f5 0x7f8401ea8c09 0x561256deea65 0x561256daf7b2 0x561256e22e65 0x561256e1e235 0x561256db034b 0x561256dafe59 0x561256ef725d 0x561256e66c3b 0x561256daef01 0x561256ea0c0d 0x561256e230d8 0x561256e1e235 0x561256cefe2c 0x561256e20318 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a
2021-08-02 18:16:00 | INFO | fairseq.trainer | NOTE: your device does NOT support faster training with --fp16, please switch to FP32 which is likely to be faster
2021-08-02 18:16:00 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
2021-08-02 18:16:00 | INFO | fairseq.trainer | loading train data for epoch 1
2021-08-02 18:16:00 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
2021-08-02 18:16:00 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
2021-08-02 18:16:00 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
2021-08-02 18:16:00 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
2021-08-02 18:16:01 | INFO | fairseq.trainer | begin training epoch 1
2021-08-02 18:16:11 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
2021-08-02 18:16:20 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
2021-08-02 18:16:29 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
2021-08-02 18:16:38 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
2021-08-02 18:16:48 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
2021-08-02 18:16:57 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
2021-08-02 18:17:06 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
2021-08-02 18:17:15 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
2021-08-02 18:17:30 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 449, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/content/R-Drop/fairseq_src/fairseq/distributed/utils.py", line 361, in call_main
    main(cfg, **kwargs)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 143, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 243, in train
    log_output = trainer.train_step(samples)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 587, in train_step
    raise e
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 561, in train_step
    ignore_grad=is_dummy_batch,
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/rdrop_translation.py", line 22, in train_step
    loss, sample_size, logging_output = criterion.forward_reg(model, sample, optimizer, 0.7, ignore_grad)
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/loss/rdrop_cross_entropy_loss.py", line 156, in forward_reg
    optimizer.backward(loss)
  File "/content/R-Drop/fairseq_src/fairseq/optim/fp16_optimizer.py", line 101, in backward
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

I am using CUDA 11.4 (tried with 11.0 before), pytorch 1.8.1, python 3.7. I have preprocessed the CNN/Daily Mail data as instructed, am using bart.large and the script/run_train.sh is in the default configuration.

If I run without the --fp16 option, my code fails instead in the following way

2021-08-02 18:27:32 | INFO | fairseq_cli.train | task: RDropTranslationTask
2021-08-02 18:27:32 | INFO | fairseq_cli.train | model: BARTModel
2021-08-02 18:27:32 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
2021-08-02 18:27:32 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
2021-08-02 18:27:37 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-08-02 18:27:37 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-08-02 18:27:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:27:37 | INFO | fairseq.utils | rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
2021-08-02 18:27:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:27:37 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-08-02 18:27:37 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
2021-08-02 18:27:37 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
tcmalloc: large alloc 1625169920 bytes == 0x5610c8c0c000 @  0x7f6dd1f4eb6b 0x7f6dd1f6e379 0x7f6d3a56225e 0x7f6d3a5639d2 0x7f6d3c3235f5 0x7f6dae2a5c09 0x560ff2dfaa65 0x560ff2dbb7b2 0x560ff2e2ee65 0x560ff2e2a235 0x560ff2dbc34b 0x560ff2dbbe59 0x560ff2f0325d 0x560ff2e72c3b 0x560ff2dbaf01 0x560ff2eacc0d 0x560ff2e2f0d8 0x560ff2e2a235 0x560ff2cfbe2c 0x560ff2e2c318 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a
tcmalloc: large alloc 1625169920 bytes == 0x56112a1ee000 @  0x7f6dd1f4eb6b 0x7f6dd1f6e379 0x7f6d3a56225e 0x7f6d3a5639d2 0x7f6d3c3235f5 0x7f6dae2a5c09 0x560ff2dfaa65 0x560ff2dbb7b2 0x560ff2e2ee65 0x560ff2e2a235 0x560ff2dbc34b 0x560ff2dbbe59 0x560ff2f0325d 0x560ff2e72c3b 0x560ff2dbaf01 0x560ff2eacc0d 0x560ff2e2f0d8 0x560ff2e2a235 0x560ff2cfbe2c 0x560ff2e2c318 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a
2021-08-02 18:27:42 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
2021-08-02 18:27:42 | INFO | fairseq.trainer | loading train data for epoch 1
2021-08-02 18:27:43 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
2021-08-02 18:27:43 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
2021-08-02 18:27:43 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
2021-08-02 18:27:43 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
2021-08-02 18:27:44 | INFO | fairseq.trainer | begin training epoch 1
/content/R-Drop/fairseq_src/fairseq/utils.py:345: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  "amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
2021-08-02 18:28:42 | INFO | train_inner | epoch 001:    100 / 253944 loss=14.455, nll_loss=9.638, ppl=796.58, wps=117.7, ups=1.72, wpb=68.4, bsz=1.1, num_updates=100, lr=6e-06, gnorm=232.648, clip=100, train_wall=58, gb_free=4.4, wall=66
2021-08-02 18:29:37 | INFO | train_inner | epoch 001:    200 / 253944 loss=10.224, nll_loss=6.292, ppl=78.34, wps=125.7, ups=1.81, wpb=69.4, bsz=1.1, num_updates=200, lr=1.2e-05, gnorm=34.896, clip=100, train_wall=55, gb_free=6.8, wall=121
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 449, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/content/R-Drop/fairseq_src/fairseq/distributed/utils.py", line 361, in call_main
    main(cfg, **kwargs)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 143, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 243, in train
    log_output = trainer.train_step(samples)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 587, in train_step
    raise e
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 561, in train_step
    ignore_grad=is_dummy_batch,
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/rdrop_translation.py", line 22, in train_step
    loss, sample_size, logging_output = criterion.forward_reg(model, sample, optimizer, 0.7, ignore_grad)
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/loss/rdrop_cross_entropy_loss.py", line 156, in forward_reg
    optimizer.backward(loss)
  File "/content/R-Drop/fairseq_src/fairseq/optim/fairseq_optimizer.py", line 99, in backward
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

I have tried to use the bart.base model, thinking it could be due to the size requirements and that my GPU only has 16GB of memory, but I run into dictionary size issues as described here.

Any advice on the above?

Can mseloss replace KL divergence?

Great job. R-Drop forces the output distributions of different sub models generated by dropout to be consistent with each other. So can mseloss replace KL divergence?Looking forward to your reply.

About the implementation in transformers, where the reduction in ce_loss uses the mean (by default), while KL uses the reduction is sum ?

      for logits in logits_list:
          if labels is not None:
              if self.num_labels == 1:
                  #  We are doing regression
                  loss_fct = MSELoss()
                  if loss:
                      loss += alpha * loss_fct(logits.view(-1), labels.view(-1))
                  else:
                      loss = alpha * loss_fct(logits.view(-1), labels.view(-1))
              else:
                  loss_fct = CrossEntropyLoss()
                  if loss:
                      loss += alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
                  else:
                      loss = alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        
        if loss is not None:
            if self.num_labels == 1:
                loss_fct = MSELoss()
                loss += 1.0 * loss_fct(logits_list[0].view(-1), logits_list[-1].view(-1))
            else:
                p = torch.log_softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
                p_tec = torch.softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
                q = torch.log_softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
                q_tec = torch.softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)

                kl_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none').sum()
                reverse_kl_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none').sum()
                loss += 0.5 * (kl_loss + reverse_kl_loss) / 2.

def bert_kl(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds,

Why you use (p, q_tec) and (q, p_tec) rather than (p, q) and (q, p) to compute kl-loss?

Here is the function:

    def compute_kl_loss(self, model, net_output, pad_mask=None, reduce=True):
        net_prob = model.get_normalized_probs(net_output, log_probs=True)
        net_prob_tec = model.get_normalized_probs(net_output, log_probs=False)

        p, q = torch.split(net_prob, net_prob.size(0)//2, dim=0)
        p_tec, q_tec = torch.split(net_prob_tec, net_prob_tec.size(0)//2, dim=0)
        
        p_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none')
        q_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none')
        
        if pad_mask is not None:
            p_loss.masked_fill_(pad_mask, 0.)
            q_loss.masked_fill_(pad_mask, 0.)

        if reduce:
            p_loss = p_loss.sum()
            q_loss = q_loss.sum()

        loss = (p_loss + q_loss) / 2
        return loss

Clarification on Using Concatenated Input for R-Drop Training

Hello,

I'm currently studying your implementation of the R-Drop training algorithm and have a question regarding the use of concatenated input for Dropout.

In the original paper, it's mentioned that "we do not forward the input data twice, instead, we repeat the input data x and concatenate them ([x; x]) in the batch-size dimension, which can make forward procedure happen in the same mini-batch to save the training cost".

I understand that despite the inputs being identical, the Dropout layer might still lead to different outputs due to the randomness in neuron dropout. However, I'm still a bit unclear on why this approach is valid when we hope to obtain two different outputs affected by Dropout differently.

Could you please provide some insights into why and how this approach works as expected? Any clarification would be greatly appreciated.

Thank you very much for your time!

Best,
xyb314

Unable to preprocess data for summarization

I followed these instructions:

git clone https://github.com/dropreg/R-Drop.git
cd R-Drop/fairseq_src/
pip install --editable .

and tried to preprocess the data for summarization by running,

bash script/preprocess.sh

However, I get the following error:

/users/gpu/samiks/anaconda3/envs/rdrop/bin/python: No module named examples.roberta.multiprocessing_bpe_encoder

It seems multiprocessing_bpe_encoder is missing from this repo. Are we supposed to run the preprocessing with a separate fairseq install?

Summarization task fails with 'Trying to backward through the graph a second time'

Hi, by following the instructions verbatim in the readme file, the summarization task defined here will fail with the following error

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time. ,
if the following four lines are not removed from here

        if ignore_grad:
            loss *= 0
        with torch.autograd.profiler.record_function("backward"):
            optimizer.backward(loss)             

It seems like these lines are duplicated in this part of the code, causing the error.

R-drop makes my model broken.

In my NMT task,I try to let the encoder and decoder to forward twice ,but the kl_loss is too large.
Then I tried to compute the mean,but it is too small to have effect.
image
image

Can someone help me?

unable to reproduce results on GLUE

**Hi,
I am trying to reproduce results on GLUE, but obvious lower than paper.
I run the code with suggested hyperparameters on 32G V100 - cuda10.2/ubuntu - python 3.6 / pytorch 1.8

**

_==> run_task_baseline_CoLA.log <==
[INFO|trainer.py:1963] 2021-10-26 01:37:38,527 >>   Num examples = 1043
[INFO|trainer.py:1966] 2021-10-26 01:37:38,527 >>   Batch size = 8
100%|##########| 131/131 [00:05<00:00, 23.67it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 01:37:44,107 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   epoch                     =       9.97
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_loss                 =     1.7947
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_matthews_correlation =     0.6032
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_runtime              = 0:00:05.57
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_samples              =       1043
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_samples_per_second   =    186.945

==> run_task_baseline_MNLI.log <==
[INFO|trainer.py:1963] 2021-10-26 08:31:45,749 >>   Num examples = 9832
[INFO|trainer.py:1966] 2021-10-26 08:31:45,749 >>   Batch size = 8
100%|##########| 1229/1229 [00:54<00:00, 22.53it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 08:32:40,333 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   epoch                   =      10.09
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_accuracy           =      0.853
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_loss               =     0.8149
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_runtime            = 0:00:54.58
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_samples            =       9832
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_samples_per_second =    180.129

==> run_task_baseline_MRPC.log <==
100%|##########| 51/51 [00:02<00:00, 23.40it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 01:30:21,104 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   epoch                   =       9.98
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_accuracy           =      0.848
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_combined_score     =     0.8708
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_f1                 =     0.8935
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_loss               =     1.4835
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_runtime            = 0:00:02.22
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_samples            =        408
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_samples_per_second =    183.422

==> run_task_baseline_QNLI.log <==
[INFO|trainer.py:1963] 2021-10-26 03:19:05,420 >>   Num examples = 5463
[INFO|trainer.py:1966] 2021-10-26 03:19:05,420 >>   Batch size = 8
100%|##########| 683/683 [00:29<00:00, 22.80it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 03:19:35,416 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   epoch                   =      10.11
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_accuracy           =     0.9143
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_loss               =     0.5311
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_runtime            = 0:00:29.99
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_samples            =       5463
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_samples_per_second =    182.125

==> run_task_baseline_QQP.log <==
100%|##########| 5054/5054 [03:42<00:00, 22.72it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 08:26:18,073 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,073 >>   epoch                   =       9.96
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,073 >>   eval_accuracy           =      0.912
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,073 >>   eval_combined_score     =     0.8972
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_f1                 =     0.8824
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_loss               =      0.533
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_runtime            = 0:03:42.46
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_samples            =      40430
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_samples_per_second =     181.74

==> run_task_baseline_RTE.log <==
[INFO|trainer.py:1963] 2021-10-26 01:28:46,682 >>   Num examples = 277
[INFO|trainer.py:1966] 2021-10-26 01:28:46,682 >>   Batch size = 8
100%|##########| 35/35 [00:01<00:00, 24.70it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 01:28:48,142 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   epoch                   =       6.53
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_accuracy           =     0.6462
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_loss               =     1.9563
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_runtime            = 0:00:01.46
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_samples            =        277
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_samples_per_second =    189.656

==> run_task_baseline_SST2.log <==
[INFO|trainer.py:1963] 2021-10-26 02:33:21,244 >>   Num examples = 872
[INFO|trainer.py:1966] 2021-10-26 02:33:21,244 >>   Batch size = 8
100%|##########| 109/109 [00:04<00:00, 23.88it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 02:33:25,854 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   epoch                   =       9.95
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_accuracy           =     0.9255
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_loss               =     0.6675
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_runtime            = 0:00:04.60
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_samples            =        872
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_samples_per_second =    189.197

==> run_task_baseline_STSB.log <==
100%|##########| 188/188 [00:08<00:00, 22.73it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 01:34:24,178 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   epoch                   =       9.99
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_combined_score     =     0.8904
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_loss               =     0.9493
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_pearson            =     0.8921
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_runtime            = 0:00:08.31
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_samples            =       1500
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_samples_per_second =    180.347
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_spearmanr          =     0.8887_

Fairseq tasks install work?

I followed the instructions entering:
cd R-Drop/fairseq_src/
pip install --editable .

I still get:

fairseq-train: error: argument --task: invalid choice: 'rdrop_translation' (choose from 'translation', 'multilingual_translation', 'semisupervised_translation', 'language_modeling', 'speech_to_text', 'masked_lm', 'translation_from_pretrained_xlm', 'audio_pretraining', 'denoising', 'multilingual_denoising', 'legacy_masked_lm', 'translation_lev', 'sentence_prediction', 'sentence_ranking', 'cross_lingual_lm', 'translation_from_pretrained_bart', 'multilingual_masked_lm', 'translation_multi_simple_epoch', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')

It's not finding the task. I'm running
bash script/run_train.sh
in the examples/summeration_rdrop/ folder.

Some question about reproducing GLUE

Sorry to bother you, I'm very interested in your work:R-Drop, but I encountered some problems when reproducing the GLUE experiment with bert-base-uncased. I used pytorch version = 1.8, python version = 3.6.13 and pip install --editable ., Different hyperparameter are also set according to different datasets in readme, but the results of CoLA, RTE and MRPC are only 58.1, 66.4 and 82.8, which are very different from 62.6, 71.1 and 87.3 in the paper.

Will KLD loss degrease very fast?

Hi, as I mentioned in the title, did you find that the KLD part would converge very fast and the value of KLD loss is very small after several steps?

JS divergence in the research paper?

In your paper uploaded in arxiv, you mentioned that the "R-Drop method tries to regularize on the model predictions by minimizing the bidirectional Kullback-Leibler (KL) divergence between these two output distributions for the same sample, which is:"
image

Is this bidirectional KL divergence diffrent from a standard JS divergence?

Inconsistency for KL loss and CE loss hyper-parameters and baselines results in GLUE

Inconsistency exits in the code of bert_modeling and roberta_modeling files, i.e. bert loss is like this-->
ce(logits1, labels)+ce(logits2,labels)+
0.5/2.0*(kl(logits1, logits2)+kl(logits2, logits1)), where alpha in paper is 0.5 here
and that in roberta loss is like this-->
0.5*(ce(logits1, labels)+ce(logits2,labels))+
0.7/2.0*(kl(logits1, logits2)+kl(logits2, logits1)), where alpha in paper is 0.7 and ce loss also aeveraged
### What are the tricks here???

In BERT
`
alpha = 1.0
for logits in logits_list:
if labels is not None:
if self.num_labels == 1:
# We are doing regression
loss_fct = MSELoss()
if loss:
loss += alpha * loss_fct(logits.view(-1), labels.view(-1))
else:
loss = alpha * loss_fct(logits.view(-1), labels.view(-1))
else:
loss_fct = CrossEntropyLoss()
if loss:
loss += alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
else:
loss = alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

    if loss is not None:
        if self.num_labels == 1:
            loss_fct = MSELoss()
            loss += 1.0 * loss_fct(logits_list[0].view(-1), logits_list[-1].view(-1))
        else:
            p = torch.log_softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            p_tec = torch.softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            q = torch.log_softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
            q_tec = torch.softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)

            kl_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none').sum()
            reverse_kl_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none').sum()
            loss += 0.5 * (kl_loss + reverse_kl_loss) / 2.`

In Roberta
`
loss = None
if labels is not None:
if self.num_labels == 1:
# We are doing regression
loss_fct = MSELoss()
if loss is None:
loss = 0.5 * loss_fct(logits_list[0].view(-1), labels.view(-1))
else:
loss += 0.5 * loss_fct(logits_list[-1].view(-1), labels.view(-1))
else:
loss_fct = CrossEntropyLoss()
if loss is None:
loss = 0.5 * loss_fct(logits_list[0].view(-1, self.num_labels), labels.view(-1))
else:
loss += 0.5 * loss_fct(logits_list[-1].view(-1, self.num_labels), labels.view(-1))

    if loss is not None:
        if self.num_labels == 1:
            loss_fct = MSELoss()
            loss += 0.8 * loss_fct(logits_list[0].view(-1), logits_list[-1].view(-1))
        else:
            p = torch.log_softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            p_tec = torch.softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            q = torch.log_softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
            q_tec = torch.softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)

            kl_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none')
            reverse_kl_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none')
            
            loss += 0.7 * (kl_loss.sum() + reverse_kl_loss.sum()) / 2`

Question of the proof

In Appendix B, How the equation 9 transfer to equation 10?
截屏2022-07-30 23 12 34

I think is 2(1-p) but not (1-p).
$||w^Tx_i - \frac{1}{p}(w^Tx_i)* \zeta || = ||w^Tx_i * [1,...,1]^T - \frac{1}{p}(w^Tx_i)* \zeta || = ||w^Tx_i ||*||C||$
||.|| is 1-norm. C has p $(1-\frac{1}{p})$, 1-p $1$, hence the $||C|| = p * |1-\frac{1}{p}| + 1-p = 2(1-p)$

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.