Git Product home page Git Product logo

repdistiller's People

Contributors

erjanmx avatar hobbitlong avatar sieu-n avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

repdistiller's Issues

Question on memory consumption for CRD loss when the dataset is very large

Hi,

Thank you for your great work, which helps me a lot.

I want to ask about the CRD contrast memory. In class ContrastMemory, there will be 2 buffers generated as 2 random tensor, for each is of the shape (number of data, number of features). Assume the number of features is 128, this buffer will become really huge when training with a large dataset, such as Glint 360k. Actually I tried to use CRD for my face recognition project and there are 17091657 pictures in this dataset, which leads to a outbread use for the GPU memory and there is no room for training.

I wonder if you can tell me if I am understanding this part right, and if I am right, is there any solution for this problem? Thanks.

Has anyone implemented Wasserstein Contrastive Representation Distillation

I noticed that there is a paper titled "wasserstein contrastive representation distillation" lately which borrowed most of the idea from this RepDistiller while modifying the distance metric using wasserstein distance. While trying to implement the algorithm, I am wanting to see whether there are some code publicly available that I won't reinvent wheels. BTW I sent one email to the author of that paper yesterday, haven't got response yet.

Regression task

Thank you for your sharing!
In your paper, it seems that CRD method is suitable for classification task, does it compatible with regression task?

Question about pretrained teacher model

Thanks for your work and code!
I just fetched the pretrained teacher model(ResNet-110) from [http://shape2prog.csail.mit.edu/repo/resnet110_vanilla/ckpt_epoch_240.pth], and tested it with CIFAR-100. The accuracy is 70.27%, and in your paper the accuracy should be around 74%. Is there something wrong with this pretrained model?

The setting of Z_v1 and Z_v2 in class ContrastMemory?

Thanks for your great work and great code!

When I read your code of class "ContrastMemory" in "memory.py", I can not find the related introduction about the use of "Z_v1" and "Z_v2" in your arXiv preprint paper. I want to know why the "out_v1" should divide "Z_v1"? If the "outputSize" is big, then the "out_v1“ may be very small. And the "outputSize" is very different between datasets, will it influence the value of "out_v1" too much, and even influence the performance of the student network?

Looking forward to your reply. @HobbitLong

Here is the related code:

        # set Z if haven't been set yet
        if Z_v1 < 0:
            self.params[2] = out_v1.mean() * outputSize
            Z_v1 = self.params[2].clone().detach().item()
            print("normalization constant Z_v1 is set to {:.1f}".format(Z_v1))
        if Z_v2 < 0:
            self.params[3] = out_v2.mean() * outputSize
            Z_v2 = self.params[3].clone().detach().item()
            print("normalization constant Z_v2 is set to {:.1f}".format(Z_v2))

        # compute out_v1, out_v2
        out_v1 = torch.div(out_v1, Z_v1).contiguous()
        out_v2 = torch.div(out_v2, Z_v2).contiguous()

questions about ContrastMemory

Hi, according to Eq.19 in the paper, linear transform gT and gS are conducted on the teacher and student, respectively, i.e., gT(t), gS(s).

But as for your codes, the teacher transform gT is applied on the student feature, gT(s) , and the student transform gS is applied on the teacher feature, gS(t), like
out_v2 = torch.bmm(weight_v1, v2.view(batchSize, inputSize, 1))
out_v2 = torch.exp(torch.div(out_v2, T))
out_v1 = torch.bmm(weight_v2, v1.view(batchSize, inputSize, 1))
out_v1 = torch.exp(torch.div(out_v1, T))

and thus your contrast loss changes to be the addition of ContrastLoss(out_v1) + ContrastLoss(out_v2).

I wonder why you did this , instead of calculating output like Eq.19 by gT(t)*gS(s)/t and ContrastLoss(out).

Thanks.

ImageNet results

Hi, @HobbitLong @erjanmx ,

Thank you for the very interesting paper!

Is it correct that the code can reproduce only the results for CIFAR? If so, is it possible that you release code or scripts for ImageNet experiments too? I mean even if the results differ a bit, scripts and running commands will be very helpful.

Best,
Arsenii

how to train my model?

hi, thanks for your wonderful work, I want to train my model in order to reduce size and how to modify my model?

About the CE loss

Thanks for your sharing. Did all of experiments distillation work with the CE loss? I have a problem about this training strategy. First , well trained teacher model fixed parameters, then add one dimension linear transfer layer before the last classification layer of teacher and student model respectively, this linear transfer layer is trainable as the student model. But if you have the CE loss , you add after the original teacher student models' last layer. CE loss doesn't have any relationship with your linear dimension transfer layer. I feel this is a little strange. Your linear transfer layer has no connection with your final classification task. How could this layer learning. And another question is if my student and teacher model's penultimate layer have the same dimension, can I drop the linear dimension transfer layer?
Thanks very much for your reply.

Ensemble Task Implementation

@HobbitLong Thank you very much for making the effort to clean and post your code for these benchmarks! I'm sure that you don't have time to post code for the ensemble distillation task, but I am going to try reproducing that benchmark so perhaps if there are any tricks or different hyperparameters settings that you can remember for that particular task off the top of your head then we can document them in this issue.

Hyper-parameters for reproducing the results on ImageNet

This is a great work. However, when I try to reproduce results on the ImageNet dataset, there is a 1% accuracy gap between mine and that in your paper.

Would you mind providing the hyper-parameters for training on ImageNet?

Here is mine:
-r 1.0
-a 0.0
-b 0.8
--trial 1
--weight_decay 0.0001
--learning_rate 0.1
--epochs 100
--lr_decay_epochs 30,60,90
--print_freq 500
--batch_size 256 \

eight 1080Ti GPUs are used.
Thanks!

Question about normalization constant Z_v1 and Z_v2 in the ContrastMemory

What an excellent work! I have learned so much from this paper.
But I have a question about normalization constant Z_v1 and Z_v2 in the ContrastMemory.
The Z_v1 and Z_v2 is initialized to -1,and Z_v1/Z_v2 is calculated based on out_v1/out_v2.
As shown in the following code, Z_v1/Z_v2 is updated only if Z_v1<0 and Z_v2<0,so Z_v1/Z_v2 will only be updated in the first batch.

if Z_v1 < 0:  
    self.params[2] = out_v1.mean() * outputSize  
    Z_v1 = self.params[2].clone().detach().item()  
    print("normalization constant Z_v1 is set to {:.1f}".format(Z_v1))  
if Z_v2 < 0:  
    self.params[3] = out_v2.mean() * outputSize  
    Z_v2 = self.params[3].clone().detach().item()  
    print("normalization constant Z_v2 is set to {:.1f}".format(Z_v2))  

But I think Z_v1 and Z_v2 should be updated in every batch.
What is your opinion on this,what are the benefits of this design?

Problem of the order of the normalization in Similarity-Preserving loss.

In the paper for Similarity-Preserving loss. The normalization is before the operation of matrix Multiplication. Does the order matter the performance.

import torch
org_f_s = torch.rand((64, 96))
org_f_t = torch.rand((64, 96))

bsz = f_s.shape[0]
f_s = org_f_s.view(bsz, -1)
f_t = org_f_t.view(bsz, -1)

G_s = torch.mm(f_s, torch.t(f_s))
# G_s = G_s / G_s.norm(2)
G_s = torch.nn.functional.normalize(G_s)
G_t = torch.mm(f_t, torch.t(f_t))
# G_t = G_t / G_t.norm(2)
G_t = torch.nn.functional.normalize(G_t)

G_diff = G_t - G_s
loss = (G_diff * G_diff).view(-1, 1).sum(0) / (bsz * bsz)
print(loss)

f_s = org_f_s.view(bsz, -1)
f_t = org_f_t.view(bsz, -1)

f_s = torch.nn.functional.normalize(f_s)
G_s = torch.mm(f_s, torch.t(f_s))
# G_s = G_s / G_s.norm(2)

f_t = torch.nn.functional.normalize(f_t)
G_t = torch.mm(f_t, torch.t(f_t))
# G_t = G_t / G_t.norm(2)
G_diff = G_t - G_s
loss = (G_diff * G_diff).view(-1, 1).sum(0) / (bsz * bsz)
print(loss)

A question for Experimental result

Thank you for share benchmark.

In your experimental result, There are many teacher and student pairs.
Especially, in KD(Distilling the Knowledge in a Neural Network) method, the optimal setting(ie. temperature) may differ with each pairs.
Does performance change a lot with temperature difference?

There may be a similar problem with KD as well as other methods, how do you think about this?

Selection of teacher

Hi @HobbitLong
It is great work, I really like it. I have one issue about selection of teacher model. As shown in previous papers for classification, researcher use the frameworks which usually contains vgg16, vgg19, resent18, resent34, resnet50, resnet101 and so on. However, most teacher models you use are different. Can you explain the role you select teacher model? Is you method sensitive to framework. Please forgive me If I miss some parts.

Form of the h function for infinite dataset

Thanks for the great code and paper!

I have a question regarding the form of h function. I have a huge dataset, thus it's impossible to store all embeddings in memory, so I decided to increase the batch size and mine negatives from it.
So far so good, but from my understanding due to big dataset size nominator almost equals denominator and h approaches 1.

Do you think that it's a good idea to replace h with an angular similarity between embeddings instead of the ratio proposed in the paper? Or maybe you could kindly propose some other appropriate choice for h?

Results on ImageNet

Thanks for your great work.

When I conduct experiments on ImageNet, I use the same training hyper-parameters provided in PyTorch official examples.. The initial learning rate is 0.1 and it is decayed at 30,60 epoch respectively. I find that in first two stages, i.e. 1-30 epoch and 31-60 epoch, the standard KD has a higher accuracy than the student trained from scratch. But in the 3rd stage (61-90 epoch), KD's accuracy is lower than student trained from scratch. This phenomenon is exactly the same as that in the Figure 3 of this paper.

In your work, KD's top1 accuracy is 0.9 points better than student trained from scratch. I wonder if there are some special processes such as training scheme or hyper-parameters that are different from that in PyTorch official examples. It would be much better if you can provide your code for ImageNet.

Thanks in advance!

Error while running the code

Hi, I am getting this error while running your code -- any suggestion?

$ python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill kd --model_s resnet8x4 -r 0.1 -a 0.9 -b 0 --trial 1
2021-10-20 11:38:21.416626: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Files already downloaded and verified
Files already downloaded and verified
==> loading teacher model
Traceback (most recent call last):
  File "train_student.py", line 347, in <module>
    main()
  File "train_student.py", line 167, in main
    model_t = load_teacher(opt.path_t, n_cls)
  File "train_student.py", line 138, in load_teacher
    model.load_state_dict(torch.load(model_path)['model'])
  File "/home/frestuc/.conda/envs/py37/lib/python3.7/site-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/frestuc/.conda/envs/py37/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/frestuc/.conda/envs/py37/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './save/models/resnet32x4_vanilla/ckpt_epoch_240.pth'

The calculation of correlation matrix

Hi, I am interesting of the visualization of correlation matrix in your paper, I would like to know about how to calculate the correlation matrix of logits across the full CIFAR dataset? Thanks!

AttributeError: 'CIFAR100InstanceSample' object has no attribute 'train_data'

I run your code:python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill crd --model_s resnet8x4 -a 0 -b 0.8 --trial 1
but Encounter an error:

....../RepDistiller/dataset/cifar100.py", line 124, in init
num_samples = len(self.train_data)

AttributeError: 'CIFAR100InstanceSample' object has no attribute 'train_data'

teacher model is too big to run with batch_size 64

when I try to train a teacher model on cub200(200 classes), I use resnet50 and batch size 64, It will out of memory, I use 16G GPU. I could run when i set the batch size 8. Why resnet50 is so big ?

Teacher/Student Parameter ratio

Hey @HobbitLong, nice paper. Do you have any guidelines on the effectiveness of the approach as the number of parameters decrease in the student model vs the teacher (same architecture) and what a sensible ratio is a good starting point?

what is the difference between the position of putting "with torch.no_grad()"

Hi,
Thank you for your contribution,the code is very useful for me,but I want to ask you a question about this code:

 with torch.no_grad():
        l_pos = torch.index_select(self.memory_v1, 0, y.view(-1))
        l_pos.mul_(momentum)
        l_pos.add_(torch.mul(v1, 1 - momentum))
        l_norm = l_pos.pow(2).sum(1, keepdim=True).pow(0.5)
        updated_v1 = l_pos.div(l_norm)
        self.memory_v1.index_copy_(0, y, updated_v1)
        ab_pos = torch.index_select(self.memory_v2, 0, y.view(-1))
        ab_pos.mul_(momentum)
        ab_pos.add_(torch.mul(v2, 1 - momentum))
        ab_norm = ab_pos.pow(2).sum(1, keepdim=True).pow(0.5)
        updated_v2 = ab_pos.div(ab_norm)
        self.memory_v2.index_copy_(0, y, updated_v2)



In your Implemention of the paper,you calc the loss element first and then update the memory, 
if I update the memory first, and then calc the loss element, what is the difference between these two methods,

Looking forward to your reply!Thank you!

about using the resnet models for cifar10

Hello Sir,
I would like to start by saying how great this work is!
And I would like to know if the resnet models can be used directly on cifar10 instead of cifar100 or if there is some modification I should make?
I tried using KD (logits) to make resnet32 learn from resnet110 but unfortunately, the accuracy I get is pretty bad:
Acc= 43.35 after 250 epochs, a=0.1, b= 0.9 with no modification in the resnet models nor in function adjust_learning_rate.

Multiple GPU training

Hi, I'm a freshman in the context of knowledge distillation.
I wonder why there is no mutlple-gpu training in your code.
What is the reason and is there any solution with this question?
I'm very appereciate for any response, thank you!

Questions about ContrastMemory

@HobbitLong
Thanks for your excellent work!
I'm wondering why do you register two memories here in
self.register_buffer('memory_v1', torch.rand(outputSize, inputSize).mul_(2 * stdv).add_(-stdv)) self.register_buffer('memory_v2', torch.rand(outputSize, inputSize).mul_(2 * stdv).add_(-stdv))
since only the student network is trained while the teacher is fixed.

How do you choose the optimal hyper-parameters?

There are several hyper parameters existing:

  1. teacher model hyper parameters
  2. student model hyper parameters
  3. KD hyper parameters (e.g., balance weight for different losses)
  4. Training hyper parameters (e.g., learning rate)

It is hard to enumerate for every combination, because it may explode. How do you find the best (or suboptimal) hyper parameter?

Thanks!

Memory issue about the NST LOSS

Hi:
Thank you for your code, I am trying to reimplement your benchmark. While running the nst loss, since the gram matrix is too large. Even if I use a 8 * P100(32GB) device, it is still out of memory. Could you please tell me whether you use some memory trick?Thank you

KD method in both configurations seems to be doing better than all other methods except the one from your paper

Hi,

Maybe I am not correctly understanding these below-mentioned tables (copied from your README at the root of the repo) but it seems that for almost all the configurations KD worked better than all the methods except the one proposed by you.

The purpose of all these methods originally was to improve upon KD (Hinton et al) so I am very surprised by this table.

Please guide

Regards & thanks
Kapil

  1. Teacher and student are of the same architectural type.
Teacher
Student
wrn-40-2
wrn-16-2
wrn-40-2
wrn-40-1
resnet56
resnet20
resnet110
resnet20
resnet110
resnet32
resnet32x4
resnet8x4
vgg13
vgg8
Teacher
Student
75.61
73.26
75.61
71.98
72.34
69.06
74.31
69.06
74.31
71.14
79.42
72.50
74.64
70.36
KD 74.92 73.54 70.66 70.67 73.08 73.33 72.98
FitNet 73.58 72.24 69.21 68.99 71.06 73.50 71.02
AT 74.08 72.77 70.55 70.22 72.31 73.44 71.43
SP 73.83 72.43 69.67 70.04 72.69 72.94 72.68
CC 73.56 72.21 69.63 69.48 71.48 72.97 70.71
VID 74.11 73.30 70.38 70.16 72.61 73.09 71.23
RKD 73.35 72.22 69.61 69.25 71.82 71.90 71.48
PKT 74.54 73.45 70.34 70.25 72.61 73.64 72.88
AB 72.50 72.38 69.47 69.53 70.98 73.17 70.94
FT 73.25 71.59 69.84 70.22 72.37 72.86 70.58
FSP 72.91 0.00 69.95 70.11 71.89 72.62 70.23
NST 73.68 72.24 69.60 69.53 71.96 73.30 71.53
CRD 75.48 74.14 71.16 71.46 73.48 75.51 73.94
  1. Teacher and student are of different architectural type.
Teacher
Student
vgg13
MobileNetV2
ResNet50
MobileNetV2
ResNet50
vgg8
resnet32x4
ShuffleNetV1
resnet32x4
ShuffleNetV2
wrn-40-2
ShuffleNetV1
Teacher
Student
74.64
64.60
79.34
64.60
79.34
70.36
79.42
70.50
79.42
71.82
75.61
70.50
KD 67.37 67.35 73.81 74.07 74.45 74.83
FitNet 64.14 63.16 70.69 73.59 73.54 73.73
AT 59.40 58.58 71.84 71.73 72.73 73.32
SP 66.30 68.08 73.34 73.48 74.56 74.52
CC 64.86 65.43 70.25 71.14 71.29 71.38
VID 65.56 67.57 70.30 73.38 73.40 73.61
RKD 64.52 64.43 71.50 72.28 73.21 72.21
PKT 67.13 66.52 73.01 74.10 74.69 73.89
AB 66.06 67.20 70.65 73.55 74.31 73.34
FT 61.78 60.99 70.29 71.75 72.50 72.03
NST 58.16 64.96 71.28 74.12 74.68 74.89
CRD 69.73 69.11 74.30 75.11 75.65 76.05

Why using log_softmax instead of softmax?

I think it should be softmax instead. Otherwise p_t and p_s are not comparable.

Could you please explain why?

def forward(self, y_s, y_t):
p_s = F.log_softmax(y_s/self.T, dim=1)
p_t = F.softmax(y_t/self.T, dim=1)
loss = F.kl_div(p_s, p_t, size_average=False) * (self.T**2) / y_s.shape[0]
return loss

Cannot achieve the reported accuracy in paper

Thanks for your great work!

But I cannot achieve the reported accuracy in your paper. In the case where teacher and student have the similar architecture, my accuracy is ~1% lower than your results. And in the case where teacher and student have different architecture, the performance of KD and CRD are even worse than model from scratch.
The only change I've made is to wrap the model with nn.DataParalel and run on 8 GPUs. I enlarged the batch_size to make that each gpu's batch_size is the same as your original single gpu setting. I ran all the experiments according to your hyperparameter(several loss weights) in this repo, and just changed the architecture of teacher and student. I wonder if the dataparallel hurts the accuracy or hyperparameters have to be tuned carefully according to each architecture.

Looking forward to your reply.:)

the introduction of ContrastMemory

Thanks for your excellent work!
I wonder how can I learn about the implementation of memory bank in the paper. Is it the same as the implementation of memory bank in Kaiming's MoCo? I can't see the introduction in your paper.

AttributeError: 'CIFAR100Instance' object has no attribute 'train_data'

Thanks for your impressive work!

I try to reproduce the results from your paper, but when I test the code (i.e. cmd python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill kd --model_s resnet8x4 -r 0.1 -a 0.9 -b 0 --trial 1), it shows:

File "~/work/RepDistiller/dataset/cifar100.py", line 44, in getitem
img, target = self.train_data[index], self.train_labels[index]
AttributeError: 'CIFAR100Instance' object has no attribute 'train_data'

By the way, my environment: Ubuntu 18.01, Python 3.6, Pytorch 1.2, torchvision 0.4.1,
I also tested on Pytorch 0.4.0, still same issue

Could you please help me with this?

Thanks

How to train teacher model

When i try to train a teacher model resnet50 on new dataset(cub200), the backbone is different with origin resnet50. one the one hand, it is too big to use big batchsize(8 is ok on 16g gpu), on the other hand, the acc1 is too low when i train 300epoch which is 10% . why?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.