dropreg / r-drop Goto Github PK

Python 99.24% C++ 0.14% Cuda 0.33% Shell 0.12% Lua 0.04% Makefile 0.02% Cython 0.11%

r-drop's Introduction

R-Drop: Regularized Dropout for Neural Networks

This repo contains the code of our NeurIPS-2021 paper, R-drop: Regularized Dropout for Neural Networks.

R-Drop is a simple yet very effective regularization method built upon dropout, by minimizing the bidirectional KL-divergence of the output distributions of any pair of sub models sampled from dropout in model training.

@inproceedings{liang2021rdrop,
  title={R-Drop: Regularized Dropout for Neural Networks},
  author={Liang, Xiaobo* and Wu, Lijun* and Li, Juntao and Wang, Yue and Meng, Qi and Qin, Tao and Chen, Wei and Zhang, Min and Liu, Tie-Yan},
  booktitle={NeurIPS},
  year={2021}
}

Usage:

R-Drop is an almost universal method for supervised tasks and even performs well for semi-supervised setting. For other settings and tasks that are not mentioned in our paper, feel free to try the following piece of code.

import torch.nn.functional as F

# define your task model, which outputs the classifier logits
model = TaskModel()

def compute_kl_loss(p, q, pad_mask=None):
    
    p_loss = F.kl_div(F.log_softmax(p, dim=-1), F.softmax(q, dim=-1), reduction='none')
    q_loss = F.kl_div(F.log_softmax(q, dim=-1), F.softmax(p, dim=-1), reduction='none')
    
    # pad_mask is for seq-level tasks
    if pad_mask is not None:
        p_loss.masked_fill_(pad_mask, 0.)
        q_loss.masked_fill_(pad_mask, 0.)

    # You can choose whether to use function "sum" and "mean" depending on your task
    p_loss = p_loss.sum()
    q_loss = q_loss.sum()

    loss = (p_loss + q_loss) / 2
    return loss

# keep dropout and forward twice
logits = model(x)

logits2 = model(x)

# cross entropy loss for classifier
ce_loss = 0.5 * (cross_entropy_loss(logits, label) + cross_entropy_loss(logits2, label))

kl_loss = compute_kl_loss(logits, logits2)

# carefully choose hyper-parameters
loss = ce_loss + α * kl_loss

Quick Links:

R-Drop is capable to handle many tasks for both NLP and CV:

r-drop's People

Contributors

Stargazers

Watchers

Forkers

fuxuelinwudi dumpmemory sidney1994 qsong4 hfxunlp codewithzichao tommylitlle chaoso tualgfhite saber5433 qingqingsun yyht lijuntaopku qianrenjian yunan4nlp qingkongzhiqian wurentidai zhihao-chen robot-ai-machinelearning chunningdu fanwan curiszhou chenghuige chenyueg yaoao2017 htfhxx cv-ip jegzheng immrz nicemartin youngflyasd zhyuyang vishwa27yvs lutao750310 hannlp alexlee01 haojiepan liuchuang0059 sunxingxingtf ouykai huqingyan77 lifanchen-simm super-louis zeng-wh zhhao1 cytsinghua alpsalps chang111 hrwleo jyxhyan bettyhczhang 1024er gxdalu-yaya myt517 smj0 zurichrain stellakim1012 oftendream jzmrexu1s senwang98 galenzxl chenjie97 truebluejason juyongjiang lacias owenleng hahlw daisy9977525 heodel litterbrother-xiao yangyangyang-github hell-to-heaven luckylhy fangtao-123 pangjh3 double22a wwhappylife olexandryermilov aehuspham yuezhao-zy ryanhuangnlp struggle-forever limore0129 emp325 liwenju0 ssbuild wusir1122 yangsenwxy windafar shiyanlou-015555 renliao xinxinxing jayagami michal-olek jiangllan libeineu koi-boy louiszango aiah bodhibudd

r-drop's Issues

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED

Hi, after following the instructions here to make the code run for abstractive text summarization, I am running into the following issue:

2021-08-02 18:15:48 | INFO | fairseq_cli.train | task: RDropTranslationTask
2021-08-02 18:15:48 | INFO | fairseq_cli.train | model: BARTModel
2021-08-02 18:15:48 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
2021-08-02 18:15:48 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
2021-08-02 18:15:53 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-08-02 18:15:53 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-08-02 18:15:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:15:53 | INFO | fairseq.utils | rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
2021-08-02 18:15:53 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:15:53 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-08-02 18:15:53 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
2021-08-02 18:15:53 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
tcmalloc: large alloc 1625169920 bytes == 0x5612fbcaa000 @  0x7f8425b51b6b 0x7f8425b71379 0x7f838e16525e 0x7f838e1669d2 0x7f838ff265f5 0x7f8401ea8c09 0x561256deea65 0x561256daf7b2 0x561256e22e65 0x561256e1e235 0x561256db034b 0x561256dafe59 0x561256ef725d 0x561256e66c3b 0x561256daef01 0x561256ea0c0d 0x561256e230d8 0x561256e1e235 0x561256cefe2c 0x561256e20318 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a
tcmalloc: large alloc 1625169920 bytes == 0x56135ca8c000 @  0x7f8425b51b6b 0x7f8425b71379 0x7f838e16525e 0x7f838e1669d2 0x7f838ff265f5 0x7f8401ea8c09 0x561256deea65 0x561256daf7b2 0x561256e22e65 0x561256e1e235 0x561256db034b 0x561256dafe59 0x561256ef725d 0x561256e66c3b 0x561256daef01 0x561256ea0c0d 0x561256e230d8 0x561256e1e235 0x561256cefe2c 0x561256e20318 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a 0x561256e1f93b 0x561256e1dc35 0x561256db073a
2021-08-02 18:16:00 | INFO | fairseq.trainer | NOTE: your device does NOT support faster training with --fp16, please switch to FP32 which is likely to be faster
2021-08-02 18:16:00 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
2021-08-02 18:16:00 | INFO | fairseq.trainer | loading train data for epoch 1
2021-08-02 18:16:00 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
2021-08-02 18:16:00 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
2021-08-02 18:16:00 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
2021-08-02 18:16:00 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
2021-08-02 18:16:01 | INFO | fairseq.trainer | begin training epoch 1
2021-08-02 18:16:11 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 64.0
2021-08-02 18:16:20 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 32.0
2021-08-02 18:16:29 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 16.0
2021-08-02 18:16:38 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 8.0
2021-08-02 18:16:48 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 4.0
2021-08-02 18:16:57 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 2.0
2021-08-02 18:17:06 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 1.0
2021-08-02 18:17:15 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.5
2021-08-02 18:17:30 | INFO | fairseq.trainer | NOTE: gradient overflow detected, ignoring gradient, setting loss scale to: 0.25
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 449, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/content/R-Drop/fairseq_src/fairseq/distributed/utils.py", line 361, in call_main
    main(cfg, **kwargs)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 143, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 243, in train
    log_output = trainer.train_step(samples)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 587, in train_step
    raise e
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 561, in train_step
    ignore_grad=is_dummy_batch,
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/rdrop_translation.py", line 22, in train_step
    loss, sample_size, logging_output = criterion.forward_reg(model, sample, optimizer, 0.7, ignore_grad)
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/loss/rdrop_cross_entropy_loss.py", line 156, in forward_reg
    optimizer.backward(loss)
  File "/content/R-Drop/fairseq_src/fairseq/optim/fp16_optimizer.py", line 101, in backward
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

I am using CUDA 11.4 (tried with 11.0 before), pytorch 1.8.1, python 3.7. I have preprocessed the CNN/Daily Mail data as instructed, am using bart.large and the script/run_train.sh is in the default configuration.

If I run without the --fp16 option, my code fails instead in the following way

2021-08-02 18:27:32 | INFO | fairseq_cli.train | task: RDropTranslationTask
2021-08-02 18:27:32 | INFO | fairseq_cli.train | model: BARTModel
2021-08-02 18:27:32 | INFO | fairseq_cli.train | criterion: RegLabelSmoothedCrossEntropyCriterion
2021-08-02 18:27:32 | INFO | fairseq_cli.train | num. model params: 406,290,432 (num. trained: 406,290,432)
2021-08-02 18:27:37 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.embed_tokens.weight
2021-08-02 18:27:37 | INFO | fairseq.trainer | detected shared parameter: encoder.embed_tokens.weight <- decoder.output_projection.weight
2021-08-02 18:27:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:27:37 | INFO | fairseq.utils | rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-PCIE-16GB                    
2021-08-02 18:27:37 | INFO | fairseq.utils | ***********************CUDA enviroments for all 1 workers***********************
2021-08-02 18:27:37 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2021-08-02 18:27:37 | INFO | fairseq_cli.train | max tokens per GPU = 1024 and batch size per GPU = None
2021-08-02 18:27:37 | INFO | fairseq.trainer | Preparing to load checkpoint /content/bart.large/model.pt
tcmalloc: large alloc 1625169920 bytes == 0x5610c8c0c000 @  0x7f6dd1f4eb6b 0x7f6dd1f6e379 0x7f6d3a56225e 0x7f6d3a5639d2 0x7f6d3c3235f5 0x7f6dae2a5c09 0x560ff2dfaa65 0x560ff2dbb7b2 0x560ff2e2ee65 0x560ff2e2a235 0x560ff2dbc34b 0x560ff2dbbe59 0x560ff2f0325d 0x560ff2e72c3b 0x560ff2dbaf01 0x560ff2eacc0d 0x560ff2e2f0d8 0x560ff2e2a235 0x560ff2cfbe2c 0x560ff2e2c318 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a
tcmalloc: large alloc 1625169920 bytes == 0x56112a1ee000 @  0x7f6dd1f4eb6b 0x7f6dd1f6e379 0x7f6d3a56225e 0x7f6d3a5639d2 0x7f6d3c3235f5 0x7f6dae2a5c09 0x560ff2dfaa65 0x560ff2dbb7b2 0x560ff2e2ee65 0x560ff2e2a235 0x560ff2dbc34b 0x560ff2dbbe59 0x560ff2f0325d 0x560ff2e72c3b 0x560ff2dbaf01 0x560ff2eacc0d 0x560ff2e2f0d8 0x560ff2e2a235 0x560ff2cfbe2c 0x560ff2e2c318 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a 0x560ff2e2b93b 0x560ff2e29c35 0x560ff2dbc73a
2021-08-02 18:27:42 | INFO | fairseq.trainer | Loaded checkpoint /content/bart.large/model.pt (epoch 41 @ 0 updates)
2021-08-02 18:27:42 | INFO | fairseq.trainer | loading train data for epoch 1
2021-08-02 18:27:43 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.source
2021-08-02 18:27:43 | INFO | fairseq.data.data_utils | loaded 287,227 examples from: /content/cnn-dailymail/cnn_dm-bin/train.source-target.target
2021-08-02 18:27:43 | INFO | fairseq.tasks.translation | /content/cnn-dailymail/cnn_dm-bin/ train source-target 287227 examples
2021-08-02 18:27:43 | WARNING | fairseq.tasks.fairseq_task | 4 samples have invalid sizes and will be skipped, max_positions=(1024, 1024), first few sample ids=[189447, 112053, 286032, 172051]
2021-08-02 18:27:44 | INFO | fairseq.trainer | begin training epoch 1
/content/R-Drop/fairseq_src/fairseq/utils.py:345: UserWarning: amp_C fused kernels unavailable, disabling multi_tensor_l2norm; you may get better performance by installing NVIDIA's apex library
  "amp_C fused kernels unavailable, disabling multi_tensor_l2norm; "
2021-08-02 18:28:42 | INFO | train_inner | epoch 001:    100 / 253944 loss=14.455, nll_loss=9.638, ppl=796.58, wps=117.7, ups=1.72, wpb=68.4, bsz=1.1, num_updates=100, lr=6e-06, gnorm=232.648, clip=100, train_wall=58, gb_free=4.4, wall=66
2021-08-02 18:29:37 | INFO | train_inner | epoch 001:    200 / 253944 loss=10.224, nll_loss=6.292, ppl=78.34, wps=125.7, ups=1.81, wpb=69.4, bsz=1.1, num_updates=200, lr=1.2e-05, gnorm=34.896, clip=100, train_wall=55, gb_free=6.8, wall=121
Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 33, in <module>
    sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')())
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 449, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/content/R-Drop/fairseq_src/fairseq/distributed/utils.py", line 361, in call_main
    main(cfg, **kwargs)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 143, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq_cli/train.py", line 243, in train
    log_output = trainer.train_step(samples)
  File "/usr/lib/python3.7/contextlib.py", line 74, in inner
    return func(*args, **kwds)
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 587, in train_step
    raise e
  File "/content/R-Drop/fairseq_src/fairseq/trainer.py", line 561, in train_step
    ignore_grad=is_dummy_batch,
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/rdrop_translation.py", line 22, in train_step
    loss, sample_size, logging_output = criterion.forward_reg(model, sample, optimizer, 0.7, ignore_grad)
  File "/content/R-Drop/fairseq_src/examples/summeration_rdrop/summeration_rdrop_src/loss/rdrop_cross_entropy_loss.py", line 156, in forward_reg
    optimizer.backward(loss)
  File "/content/R-Drop/fairseq_src/fairseq/optim/fairseq_optimizer.py", line 99, in backward
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

I have tried to use the bart.base model, thinking it could be due to the size requirements and that my GPU only has 16GB of memory, but I run into dictionary size issues as described here.

Any advice on the above?

A simple way to double the impact of R-Drop

Add the TensorFlow code of R-Drop at README.

Thank you very much.

I am afraid that my own coding has some misstake.

Can mseloss replace KL divergence？

Great job. R-Drop forces the output distributions of different sub models generated by dropout to be consistent with each other. So can mseloss replace KL divergence？Looking forward to your reply.

About the implementation in transformers, where the reduction in ce_loss uses the mean (by default), while KL uses the reduction is sum ?

      for logits in logits_list:
          if labels is not None:
              if self.num_labels == 1:
                  #  We are doing regression
                  loss_fct = MSELoss()
                  if loss:
                      loss += alpha * loss_fct(logits.view(-1), labels.view(-1))
                  else:
                      loss = alpha * loss_fct(logits.view(-1), labels.view(-1))
              else:
                  loss_fct = CrossEntropyLoss()
                  if loss:
                      loss += alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
                  else:
                      loss = alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        
        if loss is not None:
            if self.num_labels == 1:
                loss_fct = MSELoss()
                loss += 1.0 * loss_fct(logits_list[0].view(-1), logits_list[-1].view(-1))
            else:
                p = torch.log_softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
                p_tec = torch.softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
                q = torch.log_softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
                q_tec = torch.softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)

                kl_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none').sum()
                reverse_kl_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none').sum()
                loss += 0.5 * (kl_loss + reverse_kl_loss) / 2.

R-Drop/huggingface_transformer_src/src/transformers/models/bert/modeling_bert.py

Line 1545 in 084365d

 def bert_kl(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, 

Where is R-Drop code in R-Drop/huggingface_transformer_src/bert_rdrop/run_glue.py?

Can not reproduce following the hyperparameter in the paper for finefuning ViT on Cifar100

I run the code provided with hyperparameter lr = 1e-2, alpha = 0.3, dropout = 0.1, resolution = 384*384, 10000 global steps, batch size = 512 yet the result I got is far from the improvement given by the paper

Readme File for RoBerta example.

Hey @dropreg ,
It would be great if you could make and share a readme file for the RoBerta task, for the pre-processing of dataset, or training of task.

Thanks

What are the core code lines of R-Drop? Thank you very much.

How to use the data parallel in r-drop.

how to use the data parallel in r-drop.
As I use the distributed-world-size ,the bleu is lower than the one gpu.
Here is my train.sh

Why you use (p, q_tec) and (q, p_tec) rather than (p, q) and (q, p) to compute kl-loss?

Here is the function:

    def compute_kl_loss(self, model, net_output, pad_mask=None, reduce=True):
        net_prob = model.get_normalized_probs(net_output, log_probs=True)
        net_prob_tec = model.get_normalized_probs(net_output, log_probs=False)

        p, q = torch.split(net_prob, net_prob.size(0)//2, dim=0)
        p_tec, q_tec = torch.split(net_prob_tec, net_prob_tec.size(0)//2, dim=0)
        
        p_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none')
        q_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none')
        
        if pad_mask is not None:
            p_loss.masked_fill_(pad_mask, 0.)
            q_loss.masked_fill_(pad_mask, 0.)

        if reduce:
            p_loss = p_loss.sum()
            q_loss = q_loss.sum()

        loss = (p_loss + q_loss) / 2
        return loss

Clarification on Using Concatenated Input for R-Drop Training

Hello,

I'm currently studying your implementation of the R-Drop training algorithm and have a question regarding the use of concatenated input for Dropout.

In the original paper, it's mentioned that "we do not forward the input data twice, instead, we repeat the input data x and concatenate them ([x; x]) in the batch-size dimension, which can make forward procedure happen in the same mini-batch to save the training cost".

I understand that despite the inputs being identical, the Dropout layer might still lead to different outputs due to the randomness in neuron dropout. However, I'm still a bit unclear on why this approach is valid when we hope to obtain two different outputs affected by Dropout differently.

Could you please provide some insights into why and how this approach works as expected? Any clarification would be greatly appreciated.

Thank you very much for your time!

Best,
xyb314

Unable to preprocess data for summarization

I followed these instructions:

git clone https://github.com/dropreg/R-Drop.git
cd R-Drop/fairseq_src/
pip install --editable .

and tried to preprocess the data for summarization by running,

bash script/preprocess.sh

However, I get the following error:

/users/gpu/samiks/anaconda3/envs/rdrop/bin/python: No module named examples.roberta.multiprocessing_bpe_encoder

It seems multiprocessing_bpe_encoder is missing from this repo. Are we supposed to run the preprocessing with a separate fairseq install?

difference between R-Drop and SimCse + Smart

Appreciate for your work.
why not try to smooth the embedding weights ?

error: argument --task: invalid choice: 'rdrop_translation'

I'm trying to run the summarization code, but still face this problem. Adding the '--user-dir' does not help either.

Summarization task fails with 'Trying to backward through the graph a second time'

Hi, by following the instructions verbatim in the readme file, the summarization task defined here will fail with the following error

RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling .backward() or autograd.grad() the first time. ,
if the following four lines are not removed from here

        if ignore_grad:
            loss *= 0
        with torch.autograd.profiler.record_function("backward"):
            optimizer.backward(loss)

It seems like these lines are duplicated in this part of the code, causing the error.

can not reproduce the results following the hyparameters in the paper

Hello, I tired to reproduce the GLUE results following the hyperparameters in the paper:

I added the alpha in the following lines for bert:

However, some results are far away from the reported in the paper:

my results:

paper:

R-drop makes my model broken.

In my NMT task,I try to let the encoder and decoder to forward twice ,but the kl_loss is too large.
Then I tried to compute the mean,but it is too small to have effect.

Can someone help me？

How the `warmup steps` affects the performance?

Hi, bro.
Thanks for your insightful work.
I would like to know the following details.
In:
https://github.com/dropreg/R-Drop/blob/main/huggingface_transformer_src/README.md
The hyperparameter of warmup steps is so weird.
How to choose it and how does it affect the performance?

pip install --editable .报错

unable to reproduce results on GLUE

**Hi,
I am trying to reproduce results on GLUE, but obvious lower than paper.
I run the code with suggested hyperparameters on 32G V100 - cuda10.2/ubuntu - python 3.6 / pytorch 1.8

_==> run_task_baseline_CoLA.log <==
[INFO|trainer.py:1963] 2021-10-26 01:37:38,527 >>   Num examples = 1043
[INFO|trainer.py:1966] 2021-10-26 01:37:38,527 >>   Batch size = 8
100%|##########| 131/131 [00:05<00:00, 23.67it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 01:37:44,107 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   epoch                     =       9.97
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_loss                 =     1.7947
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_matthews_correlation =     0.6032
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_runtime              = 0:00:05.57
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_samples              =       1043
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:37:44,107 >>   eval_samples_per_second   =    186.945

==> run_task_baseline_MNLI.log <==
[INFO|trainer.py:1963] 2021-10-26 08:31:45,749 >>   Num examples = 9832
[INFO|trainer.py:1966] 2021-10-26 08:31:45,749 >>   Batch size = 8
100%|##########| 1229/1229 [00:54<00:00, 22.53it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 08:32:40,333 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   epoch                   =      10.09
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_accuracy           =      0.853
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_loss               =     0.8149
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_runtime            = 0:00:54.58
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_samples            =       9832
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:32:40,333 >>   eval_samples_per_second =    180.129

==> run_task_baseline_MRPC.log <==
100%|##########| 51/51 [00:02<00:00, 23.40it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 01:30:21,104 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   epoch                   =       9.98
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_accuracy           =      0.848
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_combined_score     =     0.8708
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_f1                 =     0.8935
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_loss               =     1.4835
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_runtime            = 0:00:02.22
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_samples            =        408
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:30:21,104 >>   eval_samples_per_second =    183.422

==> run_task_baseline_QNLI.log <==
[INFO|trainer.py:1963] 2021-10-26 03:19:05,420 >>   Num examples = 5463
[INFO|trainer.py:1966] 2021-10-26 03:19:05,420 >>   Batch size = 8
100%|##########| 683/683 [00:29<00:00, 22.80it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 03:19:35,416 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   epoch                   =      10.11
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_accuracy           =     0.9143
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_loss               =     0.5311
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_runtime            = 0:00:29.99
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_samples            =       5463
[INFO|trainer_pt_utils.py:903] 2021-10-26 03:19:35,416 >>   eval_samples_per_second =    182.125

==> run_task_baseline_QQP.log <==
100%|##########| 5054/5054 [03:42<00:00, 22.72it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 08:26:18,073 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,073 >>   epoch                   =       9.96
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,073 >>   eval_accuracy           =      0.912
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,073 >>   eval_combined_score     =     0.8972
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_f1                 =     0.8824
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_loss               =      0.533
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_runtime            = 0:03:42.46
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_samples            =      40430
[INFO|trainer_pt_utils.py:903] 2021-10-26 08:26:18,074 >>   eval_samples_per_second =     181.74

==> run_task_baseline_RTE.log <==
[INFO|trainer.py:1963] 2021-10-26 01:28:46,682 >>   Num examples = 277
[INFO|trainer.py:1966] 2021-10-26 01:28:46,682 >>   Batch size = 8
100%|##########| 35/35 [00:01<00:00, 24.70it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 01:28:48,142 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   epoch                   =       6.53
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_accuracy           =     0.6462
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_loss               =     1.9563
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_runtime            = 0:00:01.46
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_samples            =        277
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:28:48,143 >>   eval_samples_per_second =    189.656

==> run_task_baseline_SST2.log <==
[INFO|trainer.py:1963] 2021-10-26 02:33:21,244 >>   Num examples = 872
[INFO|trainer.py:1966] 2021-10-26 02:33:21,244 >>   Batch size = 8
100%|##########| 109/109 [00:04<00:00, 23.88it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 02:33:25,854 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   epoch                   =       9.95
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_accuracy           =     0.9255
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_loss               =     0.6675
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_runtime            = 0:00:04.60
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_samples            =        872
[INFO|trainer_pt_utils.py:903] 2021-10-26 02:33:25,854 >>   eval_samples_per_second =    189.197

==> run_task_baseline_STSB.log <==
100%|##########| 188/188 [00:08<00:00, 22.73it/s]
[INFO|trainer_pt_utils.py:898] 2021-10-26 01:34:24,178 >> ***** eval metrics *****
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   epoch                   =       9.99
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_combined_score     =     0.8904
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_loss               =     0.9493
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_pearson            =     0.8921
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_runtime            = 0:00:08.31
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_samples            =       1500
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_samples_per_second =    180.347
[INFO|trainer_pt_utils.py:903] 2021-10-26 01:34:24,178 >>   eval_spearmanr          =     0.8887_

what the dropout should be set when we predict or test?

Hello, I have a question, what the dropout should be set when we predict or test? thank you

Fairseq tasks install work?

I followed the instructions entering:
cd R-Drop/fairseq_src/
pip install --editable .

I still get:

fairseq-train: error: argument --task: invalid choice: 'rdrop_translation' (choose from 'translation', 'multilingual_translation', 'semisupervised_translation', 'language_modeling', 'speech_to_text', 'masked_lm', 'translation_from_pretrained_xlm', 'audio_pretraining', 'denoising', 'multilingual_denoising', 'legacy_masked_lm', 'translation_lev', 'sentence_prediction', 'sentence_ranking', 'cross_lingual_lm', 'translation_from_pretrained_bart', 'multilingual_masked_lm', 'translation_multi_simple_epoch', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')

It's not finding the task. I'm running
bash script/run_train.sh
in the examples/summeration_rdrop/ folder.

Some question about reproducing GLUE

Sorry to bother you, I'm very interested in your work:R-Drop, but I encountered some problems when reproducing the GLUE experiment with bert-base-uncased. I used pytorch version = 1.8, python version = 3.6.13 and pip install --editable ., Different hyperparameter are also set according to different datasets in readme, but the results of CoLA, RTE and MRPC are only 58.1, 66.4 and 82.8, which are very different from 62.6, 71.1 and 87.3 in the paper.

Will KLD loss degrease very fast?

Hi, as I mentioned in the title, did you find that the KLD part would converge very fast and the value of KLD loss is very small after several steps?

JS divergence in the research paper?

In your paper uploaded in arxiv, you mentioned that the "R-Drop method tries to regularize on the model predictions by minimizing the bidirectional Kullback-Leibler (KL) divergence between these two output distributions for the same sample, which is:"

Is this bidirectional KL divergence diffrent from a standard JS divergence?

kl loss in ViT example supposed to be divided by 2?

R-Drop/vit_src/models/modeling.py

Line 298 in 3d97565

loss += self.alpha * (kl_loss + reverse_kl_loss)

Isn't L298 supposed to be the following?

loss += self.alpha * (kl_loss + reverse_kl_loss) / 2

What's Wrong with my TensorFlow (1.14 or 1.15) implementation?

Inconsistency for KL loss and CE loss hyper-parameters and baselines results in GLUE

Inconsistency exits in the code of bert_modeling and roberta_modeling files, i.e. bert loss is like this-->
ce(logits1, labels)+ce(logits2,labels)+
0.5/2.0*(kl(logits1, logits2)+kl(logits2, logits1)), where alpha in paper is 0.5 here
and that in roberta loss is like this-->
0.5*(ce(logits1, labels)+ce(logits2,labels))+
0.7/2.0*(kl(logits1, logits2)+kl(logits2, logits1)), where alpha in paper is 0.7 and ce loss also aeveraged
### What are the tricks here???

In BERT
`
alpha = 1.0
for logits in logits_list:
if labels is not None:
if self.num_labels == 1:
# We are doing regression
loss_fct = MSELoss()
if loss:
loss += alpha * loss_fct(logits.view(-1), labels.view(-1))
else:
loss = alpha * loss_fct(logits.view(-1), labels.view(-1))
else:
loss_fct = CrossEntropyLoss()
if loss:
loss += alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
else:
loss = alpha * loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

    if loss is not None:
        if self.num_labels == 1:
            loss_fct = MSELoss()
            loss += 1.0 * loss_fct(logits_list[0].view(-1), logits_list[-1].view(-1))
        else:
            p = torch.log_softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            p_tec = torch.softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            q = torch.log_softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
            q_tec = torch.softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)

            kl_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none').sum()
            reverse_kl_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none').sum()
            loss += 0.5 * (kl_loss + reverse_kl_loss) / 2.`

In Roberta
`
loss = None
if labels is not None:
if self.num_labels == 1:
# We are doing regression
loss_fct = MSELoss()
if loss is None:
loss = 0.5 * loss_fct(logits_list[0].view(-1), labels.view(-1))
else:
loss += 0.5 * loss_fct(logits_list[-1].view(-1), labels.view(-1))
else:
loss_fct = CrossEntropyLoss()
if loss is None:
loss = 0.5 * loss_fct(logits_list[0].view(-1, self.num_labels), labels.view(-1))
else:
loss += 0.5 * loss_fct(logits_list[-1].view(-1, self.num_labels), labels.view(-1))

    if loss is not None:
        if self.num_labels == 1:
            loss_fct = MSELoss()
            loss += 0.8 * loss_fct(logits_list[0].view(-1), logits_list[-1].view(-1))
        else:
            p = torch.log_softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            p_tec = torch.softmax(logits_list[0].view(-1, self.num_labels), dim=-1)
            q = torch.log_softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)
            q_tec = torch.softmax(logits_list[-1].view(-1, self.num_labels), dim=-1)

            kl_loss = torch.nn.functional.kl_div(p, q_tec, reduction='none')
            reverse_kl_loss = torch.nn.functional.kl_div(q, p_tec, reduction='none')
            
            loss += 0.7 * (kl_loss.sum() + reverse_kl_loss.sum()) / 2`

Question of the proof

In Appendix B, How the equation 9 transfer to equation 10?

I think is 2(1-p) but not (1-p).
$||w^Tx_i - \frac{1}{p}(w^Tx_i)* \zeta || = ||w^Tx_i * [1,...,1]^T - \frac{1}{p}(w^Tx_i)* \zeta || = ||w^Tx_i ||*||C||$
||.|| is 1-norm. C has p $(1-\frac{1}{p})$, 1-p $1$, hence the $||C|| = p * |1-\frac{1}{p}| + 1-p = 2(1-p)$