delightcmu / rsc Goto Github PK

This is the official implementation of Self-Challenging Improves Cross-Domain Generalization, ECCV2020

License: BSD 2-Clause "Simplified" License

Python 100.00%

rsc's Introduction

Self-Challenging Improves Cross-Domain Generalization

This is the official implementation of:

Zeyi Huang', Haohan Wang', Eric P. Xing, and Dong Huang, Self-Challenging Improves Cross-Domain Generalization, ECCV, 2020 (Oral), arxiv version.

Notice about DG task: In order to get the same results in the testing part, you should use the same environment configuration here, including software, hardware and random seed. When using a different environment configuration, similar to other DG repositories, you need to tune the parameters a little bit. According to my observations, a simple larger batch size and early stop can solve the problem. If you still can't solve the problem, don't panic! send me an email(zeyih(at)andrew(dot)cmu(dot)edu) with your environment. I'll help you out.

Update: To mitigate fluctuation in different environments, we modify RSC in a curriculum manner. Also, we unify RSC for different network architectures. If you have any questions about the code, feel free to contact me or pull a issue.

Citation:

@inproceedings{huangRSC2020,
  title={Self-Challenging Improves Cross-Domain Generalization},
  author={Zeyi Huang and Haohan Wang and Eric P. Xing and Dong Huang},
  booktitle={ECCV},
  year={2020}
}

Installation

Requirements:

Python ==3.7
Pytorch ==1.1.0
Torchvision == 0.3.0
Cuda ==10.0
Tensorflow ==1.14
GPU: RTX 2080

Data Preparation

Download PACS dataset from here. Once you have download the data, you must update the files in data/correct_txt_list to match the actual location of your files. Note: make sure you use the same train/val/test split in PACS paper.

Runing on PACS dataset

Experiments with different source/target domains are listed in train.py(L145-152).

To train a ResNet18, simply run:

  python train.py --net resnet18

To test a ResNet18, you can download RSC model below and logs:

Backbone	Target Domain	Acc %	models
ResNet-18	Photo	96.05	download
ResNet-18	Sketch	82.67	download
ResNet-18	Cartoon	81.61	download
ResNet-18	Art	85.16	download

To Do

Faster-RCNN

Other pretrained models

New ImageNet ResNet baselines training by RSC.

Backbone	Top-1 Acc %	Top-5 Acc %	pth models
ResNet-50	77.18	93.53	download
ResNet-101	78.23	94.16	download

Acknowledgement

We borrowed code and data augmentation techniques from Jigen, ImageNet-pytorch.

rsc's People

Contributors

Stargazers

Watchers

Forkers

wangtingwei1993 frankzhangrp sirrob1997 chaorenbuhuifeifei ankitshah009 engineerlion youtang1993 cv-ip xt-chen akshayg08 duxiaodan tor4z gkyustc wzml khawar-islam maoyanmei zivzone hshmatsahak

rsc's Issues

Code breaks at epoch 40

Hi All,

I've tried to run the code for more than 30 epochs (i.e. 100 epochs) but the code breaks due to IndexError, please advise.

Where is the caffenet.py file?

I try to run the training command, but it warns: ImportError: cannot import name 'caffenet', and I can't find the caffenet.py in ./models/

The result record in train.py has two items.

In train.py(L145), I find the code also records the best performance of test domain in logger.save_best().

Could you point out which result the paper uses, is it the best performance of test domain (test_res.max()) or the best validation performance (test_res[idx_best])?

Thanks a lot!

Alexnet issue

Hi,
I am trying to run the RSC code with the backbone Alexnet model, to get the results reported in table 6. Can I know that you used the same ALEXNET model file which is being used in JigenDG (https://github.com/fmcarlucci/JigenDG) ?

thanks

Experimental results in paper

Dear authors, thank you for work,
I run your implementation and got lower results than ones reported in the paper. I have some questions:

Your results reported in the paper (average ~85.15) is selected by test-set?
I got average results from your implementation:

selected by validation-set ~ 83.37
selected by test-set ~ 84.25

when I turn off your seed configs and try 5 runs with random seed?

selected by validation-set ~ 82.63
selected by test-set ~ 83.81

Thank you for your contribution!

Choice of datapoints to which RSC is applied

In the paper it is mentioned that RSC is applied to a random subset of the current batch. But it seems that from lines 126 to 146 in resnet.py, something more sophisticated is performed.

a) Can you explain what it done in that part of the code, especially the meaning of variables used in lines 142 to 146?
b) Why is the mask a variable that requires grad ? (line 149 in resnet.py)

Reasoning & Effects for batch part in the forward pass of resnet.py

Hey guys,

I've worked through the code in resnet.py and found line 152 - 172 (see below) in addition to the spatial and channel RSC.

            # ----------------------------------- batch ----------------------------------------
            cls_prob_before = F.softmax(output, dim=1)
            x_new_view_after = x_new * mask_all
            x_new_view_after = self.avgpool(x_new_view_after)
            x_new_view_after = x_new_view_after.view(x_new_view_after.size(0), -1)
            x_new_view_after = self.class_classifier(x_new_view_after)
            cls_prob_after = F.softmax(x_new_view_after, dim=1)

            sp_i = torch.ones([2, num_rois]).long()
            sp_i[0, :] = torch.arange(num_rois)
            sp_i[1, :] = index
            sp_v = torch.ones([num_rois])
            one_hot_sparse = torch.sparse.FloatTensor(sp_i, sp_v, torch.Size([num_rois, class_num])).to_dense().cuda()
            before_vector = torch.sum(one_hot_sparse * cls_prob_before, dim=1)
            after_vector = torch.sum(one_hot_sparse * cls_prob_after, dim=1)
            change_vector = before_vector - after_vector - 0.0001
            change_vector = torch.where(change_vector > 0, change_vector, torch.zeros(change_vector.shape).cuda())
            th_fg_value = torch.sort(change_vector, dim=0, descending=True)[0][int(round(float(num_rois) * 1/3))]
            drop_index_fg = change_vector.gt(th_fg_value)
            ignore_index_fg = 1 - drop_index_fg
            not_01_ignore_index_fg = ignore_index_fg.nonzero()[:, 0]
            mask_all[not_01_ignore_index_fg.long(), :] = 1

This part is not described in the paper and I don't know what is the reasoning behind it as well as the effect of excluding it. From my understanding, you determine the effect of your method and drop it if it doesn't help. Is this necessary? How much does this improve performance?

I guess this was added during the reviews and hence is not included in the original paper.

Best,
Robin

JigsawNewDataset Class error

Hi,
I am trying to run the code, but i am getting an error while executing the following line inside the JigsawNewDataset Class
` all_perm = np.load('permutations_%d.npy' % (classes))

I have looked through all the files and there isn't a file called permuations.npy as part of the code or the dataset. I have looked through the code and can't seem to find a line that generates the said file.

Help would be appreciated

Random seeds value

Dear authors, thank you for your supporting code for DG, it could be a good baseline for us to expand a new idea.
However, I have a question about your reported results. I am seeing your code have set "torch.manual_seed(0) torch.cuda.manual_seed(0)", this means that you fix seeds for your CUDA, but your papers said that you ran on 5 different times (I understand here is with different seeds for CUDA) then get the average.
I am asking this question because if removing 2 of those lines, I can not reproduce your results, especially when tuning on validation set (only around 81.2% accuracy). Could you please explain this problem?

Looking forward to your answer, thanks.

Alexnet not available

Only resnet is implemented, is there other implementation for Alexnet ?

torch.gt return BoolType value from 1.2.0

Since Pytorch 1.2.0, torch.gt returns BoolType instead of ByteType which will cause error on

RSC/models/resnet.py

Line 170 in 2622a6c

ignore_index_fg = 1 - drop_index_fg

Pls kindly take a look and add some requirements for pytorch version

Confusing naming conventions

I find myself confused about the naming conventions/comments and you should probably be more precise here. If I understood it correctly, the channel_mean is the mean used in Channel-Wise RSC and computes the mean across the spatial domain while the spatial_mean is the mean used in Spatial-Wise RSC and computes the mean across the channel domain.

This is why we need F.interpolate in the Spatial-Wise RSC since it doesn't make sense for architectures with average pooling afterwards like mentioned here: #3. By the way: This isn't mentioned in the paper, what is the intuition behind that even?

cannot reproduce results on PACS

I run your code on PACS to reproduce your results in Table 6. I run three times, the results are:

Best val 0.965909, corresponding test 0.68847 - best test: 0.787732, best epoch: 18

Best val 0.967532, corresponding test 0.771698 - best test: 0.782642, best epoch: 25

Best val 0.962662, corresponding test 0.750573 - best test: 0.780606, best epoch: 17

The average result is 73.69, which is significantly lower than your results (80.85) in Table 6 for Res18.

I guess I may be wrong somewhere. How to reproduce results?

My environment is:

pytorch=1.1.0, torchvision=0.3.0.

Questions about the implementation.

Thanks for sharing your code and interesting works. I have some concerns about the method details here.

In the newest version of the code, it seems that spatial-wise RSC is calcualted based on the activation instead of gradients. Is it intended?

RSC/Domain_Generalization/models/resnet.py

Line 102 in 79846ab

spatial_mean = torch.sum(x_new * grad_channel_mean, 1)
Another problem confuse me is why we choose to mute the top 33% features that have largest gradients? I mean, the bottem 33% features (of which the gradients are negative with a large value) are also very correlated, in a negative way.

Thanks,
Zeju

Question about the spatial-wise RSC implementation

Can you please elaborate more on the question (1) ? I still don't understand why the spatial-wise RSC was implemented as a sum of the activations (x_new) weighted by the mean gradients of each channel (spatial_mean = torch.sum(x_new * grad_channel_mean, 1)).

Based on the description in the paper ("global average pooling is applied along the channel dimension to the gradient tensor G to produce a weighting matrix wi of size [7 × 7]."), instead of lines 99-103 it should be just:
spatial_mean = torch.mean(grads_val.view(num_rois, num_channel, -1), dim=1).

Originally posted by @AhmedFrikha in #15 (comment)

Why not use cross entropy for determining which features to mask?

In equation 1 in the paper, you compute the gradient of the element-wise product and the ground truth one-hot label with respect to the input feature vector. This is to find the features that contribute most to the ground truth class logit. For a softmax output, ideally we want the true label logit to be towards positive infinity while the other logits to be towards negative infinity.

So my question is, why not compute a more classical cross-entropy loss here:

RSC/models/resnet.py

Line 90 in 6372680

one_hot = torch.sum(output * one_hot_sparse)

instead of just the sum of the true logits?

Why not choose the grad of x_new_4 directly for spatial _mean?

RSC/models/resnet.py

Line 95 in c465687

 spatial_mean = F.interpolate(torch.mean(x_new_3.grad.clone().detach(), dim=1, keepdim=True), scale_factor=0.5) 

Question about the batch part

The implementation of the batching part seems quite unintuitive for me, maybe you can clear up some of my understanding:

We calculate the before_vector and after_vector which represent the class probabilities for the correct class before and after applying the masking for certain samples inside each batch.

Next, we subtract the before_vector from the after_vector which means entries in change_vector represent if the masking makes our classifier more / less certain about the correct class for that specific sample. This is represented by negative (more) and positive (less) values inside change_vector.

We are only interested in the positive values, cases where masking decreases confidence, hence we calculate the threshold for Top-p according to only the positive values as done in L.134 and in L.135.

Next, we check which entries are greater than our threshold in L.136, this yields a binary mask.

This is where my question comes in:

L.137 basically inverts the mask. So instead of reverting the masking for Top-p percentage of samples where it decreases confidence, we are now reverting it for all samples besides Top-p?

Am I correct on this? Why was this done? For self-challenging, applying the masking for Top-p percentage of the samples with negative values seems more intuitive.

Also, while you're at it:

What is the purpose of subtracting 1e-5 in L.133? For me, this seems like a "threshold" (epsilon) i.e. the minimum confidence change to keep the masking. How did the performance change without it? In theory, this would be another hyperparameter

Application of the method on regression problems

Dear authors,

Thank you for your wonderful and interesting work! I have one question about the adaptation of your method on regression problems. When the label space is continuous, like monocular depth estimation, could you please provide some insights on how to modify the current version?

Thanks in advance!

Performance without pre-processing transforms.ColorJitter

Hi,

I find DeepALL with pre-processing transforms.ColorJitter can reach performance close to 76% Accuracy when test domain is sketch. (reported is 70.59%)

Have you tried your method without transforms.ColorJitter since I find your code uses it?

Thanks.

Non-Reproducible Results

I've been following the repository for a while and it seems like other people have observed something similar. The results presented in the paper seem to be not reproducible, even with the code you provide. Also, the method changed heavily in the implementation throughout the last few weeks, at one point also using downsampled middle layers which isn't described in the paper. Very recently you also introduced self.pecent = 3.0 / 10 + (epoch / interval) * 2.0 / 10 as a scheduling factor.

I've run the currently provided code producing the following results. Funnily enough, the sketch domain seems to be the only one not underperforming. I've tested the performance over the last few weeks with the different iterations you had but always observed similar underperformance, just never bothered to create an issue since there were a few already open for the code-base at the time e.g #2 #5.

Commit: 10540a8
Details: Environment as specified, Default hyperparameters, running one script at a time, 5 runs, Official PACS splits
Backbone: ResNet-18

PHOTO: Paper Result - 95.99%

Difference to observed maximum is -1.62%, Difference to observed minimum is -2,76%
Sample mean/std deviation for best validation performance: 93.73 +/- 0.4

Best val 0.958482, corresponding test 0.932335 - best test: 0.943713, best epoch: 17
Best val 0.95255, corresponding test 0.934132 - best test: 0.943114, best epoch: 17
Best val 0.946619, corresponding test 0.943713 - best test: 0.943713, best epoch: 3
Best val 0.950178, corresponding test 0.937126 - best test: 0.943114, best epoch: 18
Best val 0.953737, corresponding test 0.939521 - best test: 0.947904, best epoch: 19

ART: Paper Result - 83.43%

Difference to observed maximum is -1.75%, Difference to observed minimum is -4.04%
Sample mean/std deviation for best validation performance: 80.41 +/- 1.1

Best val 0.967742, corresponding test 0.816895 - best test: 0.817383, best epoch: 18
Best val 0.962779, corresponding test 0.794434 - best test: 0.807617, best epoch: 13
Best val 0.961538, corresponding test 0.800781 - best test: 0.81543, best epoch: 19
Best val 0.967742, corresponding test 0.814941 - best test: 0.825684, best epoch: 19
Best val 0.961538, corresponding test 0.793945 - best test: 0.805664, best epoch: 16

CARTOON: Paper Result - 80.31%

Difference to observed maximum is -1.47%, Difference to observed minimum is -3.37%
Sample mean/std deviation for best validation performance: 77.53 +/- 0.9

Best val 0.965251, corresponding test 0.781143 - best test: 0.808874, best epoch: 19
Best val 0.96139, corresponding test 0.773891 - best test: 0.799488, best epoch: 18
Best val 0.956242, corresponding test 0.765785 - best test: 0.789676, best epoch: 18
Best val 0.96139, corresponding test 0.788396 - best test: 0.799915, best epoch: 19
Best val 0.96139, corresponding test 0.767491 - best test: 0.800341, best epoch: 19

SKETCH: Paper Result - 80.85%

Difference to observed maximum is +1.05%, Difference to observed minimum is -1.67%
Sample mean/std deviation for best validation performance: 80.79 +/- 1.0

Best val 0.957792, corresponding test 0.811402 - best test: 0.811402, best epoch: 16
Best val 0.962662, corresponding test 0.805548 - best test: 0.808094, best epoch: 18
Best val 0.965909, corresponding test 0.791805 - best test: 0.799949, best epoch: 18
Best val 0.967532, corresponding test 0.819038 - best test: 0.820056, best epoch: 18
Best val 0.956169, corresponding test 0.811911 - best test: 0.82082, best epoch: 17

Question about the spatial-wise RSC

I am confused about the spatial-wise RSC. If you apply average pooling to z (final feature map) and feed in the fully connected layer, I believe that for any channel of z, all values in that channel will be the same. Therefore, after an average pooling along channel dimension, all cells in the 7*7 weighting matrix will have the same value too. So, how do you select top p percentage? Did I miss something? Thank you.

delightcmu / rsc Goto Github PK

rsc's Introduction

Self-Challenging Improves Cross-Domain Generalization

Citation:

Installation

Requirements:

Data Preparation

Runing on PACS dataset

To Do

Other pretrained models

Acknowledgement

rsc's People

Contributors

Stargazers

Watchers

Forkers

rsc's Issues

This is where my question comes in:

Also, while you're at it:

PHOTO: Paper Result - 95.99%

ART: Paper Result - 83.43%

CARTOON: Paper Result - 80.31%

SKETCH: Paper Result - 80.85%

Recommend Projects

Recommend Topics

Recommend Org