hobbitlong / repdistiller Goto Github PK

[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods

License: BSD 2-Clause "Simplified" License

Python 97.44% Shell 2.56%

repdistiller's Issues

Error while running the code

Hi, I am getting this error while running your code -- any suggestion?

$ python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill kd --model_s resnet8x4 -r 0.1 -a 0.9 -b 0 --trial 1
2021-10-20 11:38:21.416626: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Files already downloaded and verified
Files already downloaded and verified
==> loading teacher model
Traceback (most recent call last):
  File "train_student.py", line 347, in <module>
    main()
  File "train_student.py", line 167, in main
    model_t = load_teacher(opt.path_t, n_cls)
  File "train_student.py", line 138, in load_teacher
    model.load_state_dict(torch.load(model_path)['model'])
  File "/home/frestuc/.conda/envs/py37/lib/python3.7/site-packages/torch/serialization.py", line 594, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/home/frestuc/.conda/envs/py37/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "/home/frestuc/.conda/envs/py37/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
    super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './save/models/resnet32x4_vanilla/ckpt_epoch_240.pth'

In the 2 result tables, WRN-40-2, as the teacher, after distilling the students, the students get higher performance（CRD+KD）, why?

the introduction of ContrastMemory

Thanks for your excellent work!
I wonder how can I learn about the implementation of memory bank in the paper. Is it the same as the implementation of memory bank in Kaiming's MoCo? I can't see the introduction in your paper.

Teacher/Student Parameter ratio

Hey @HobbitLong, nice paper. Do you have any guidelines on the effectiveness of the approach as the number of parameters decrease in the student model vs the teacher (same architecture) and what a sensible ratio is a good starting point?

Why using log_softmax instead of softmax?

I think it should be softmax instead. Otherwise p_t and p_s are not comparable.

Could you please explain why?

RepDistiller/distiller_zoo/KD.py

Lines 13 to 17 in dcc0432

 def forward(self, y_s, y_t): 

 p_s = F.log_softmax(y_s/self.T, dim=1) 

 p_t = F.softmax(y_t/self.T, dim=1) 

 loss = F.kl_div(p_s, p_t, size_average=False) * (self.T**2) / y_s.shape[0] 

 return loss

Why "opt.nce_k" in dataset cifar100 is 16384? How can I get this ?

teacher model is too big to run with batch_size 64

when I try to train a teacher model on cub200(200 classes), I use resnet50 and batch size 64, It will out of memory, I use 16G GPU. I could run when i set the batch size 8. Why resnet50 is so big ?

Failed to download the teacher models

Sorry!When i want to run fecthc_pretrained_tearhers.sh, I can't download the teacher model.
Is this server down?

Question on memory consumption for CRD loss when the dataset is very large

Hi,

Thank you for your great work, which helps me a lot.

I want to ask about the CRD contrast memory. In class ContrastMemory, there will be 2 buffers generated as 2 random tensor, for each is of the shape (number of data, number of features). Assume the number of features is 128, this buffer will become really huge when training with a large dataset, such as Glint 360k. Actually I tried to use CRD for my face recognition project and there are 17091657 pictures in this dataset, which leads to a outbread use for the GPU memory and there is no room for training.

I wonder if you can tell me if I am understanding this part right, and if I am right, is there any solution for this problem? Thanks.

Question about pretrained teacher model

Thanks for your work and code!
I just fetched the pretrained teacher model(ResNet-110) from [http://shape2prog.csail.mit.edu/repo/resnet110_vanilla/ckpt_epoch_240.pth], and tested it with CIFAR-100. The accuracy is 70.27%, and in your paper the accuracy should be around 74%. Is there something wrong with this pretrained model?

About deep mutual learning setting

Thank you for your excellent code! I'm interested in "deep mutual learning setting". Can you tell me about some training details?

Does the crd can be applied to cross domain distillation

hi, i wonder if you have applied the crd or cmc to cross domain distillation, for example : vqa or different domains in visual , will it work ?

how to train my model?

hi, thanks for your wonderful work, I want to train my model in order to reduce size and how to modify my model?

Ensemble Task Implementation

@HobbitLong Thank you very much for making the effort to clean and post your code for these benchmarks! I'm sure that you don't have time to post code for the ensemble distillation task, but I am going to try reproducing that benchmark so perhaps if there are any tricks or different hyperparameters settings that you can remember for that particular task off the top of your head then we can document them in this issue.

The calculation of correlation matrix

Hi, I am interesting of the visualization of correlation matrix in your paper, I would like to know about how to calculate the correlation matrix of logits across the full CIFAR dataset? Thanks!

resnet structure seems to be a bit wrong

resnet use 7x7conv and maxpool In the beginning，but this rep uses 3x3 conv and no maxpool，is there any reason for doing this?

Multiple GPU training

Hi, I'm a freshman in the context of knowledge distillation.
I wonder why there is no mutlple-gpu training in your code.
What is the reason and is there any solution with this question?
I'm very appereciate for any response, thank you!

Cannot achieve the reported accuracy in paper

Thanks for your great work!

But I cannot achieve the reported accuracy in your paper. In the case where teacher and student have the similar architecture, my accuracy is ~1% lower than your results. And in the case where teacher and student have different architecture, the performance of KD and CRD are even worse than model from scratch.
The only change I've made is to wrap the model with nn.DataParalel and run on 8 GPUs. I enlarged the batch_size to make that each gpu's batch_size is the same as your original single gpu setting. I ran all the experiments according to your hyperparameter(several loss weights) in this repo, and just changed the architecture of teacher and student. I wonder if the dataparallel hurts the accuracy or hyperparameters have to be tuned carefully according to each architecture.

Looking forward to your reply.:)

Regression task

Thank you for your sharing!
In your paper, it seems that CRD method is suitable for classification task, does it compatible with regression task?

Problem of the order of the normalization in Similarity-Preserving loss.

In the paper for Similarity-Preserving loss. The normalization is before the operation of matrix Multiplication. Does the order matter the performance.

import torch
org_f_s = torch.rand((64, 96))
org_f_t = torch.rand((64, 96))

bsz = f_s.shape[0]
f_s = org_f_s.view(bsz, -1)
f_t = org_f_t.view(bsz, -1)

G_s = torch.mm(f_s, torch.t(f_s))
# G_s = G_s / G_s.norm(2)
G_s = torch.nn.functional.normalize(G_s)
G_t = torch.mm(f_t, torch.t(f_t))
# G_t = G_t / G_t.norm(2)
G_t = torch.nn.functional.normalize(G_t)

G_diff = G_t - G_s
loss = (G_diff * G_diff).view(-1, 1).sum(0) / (bsz * bsz)
print(loss)

f_s = org_f_s.view(bsz, -1)
f_t = org_f_t.view(bsz, -1)

f_s = torch.nn.functional.normalize(f_s)
G_s = torch.mm(f_s, torch.t(f_s))
# G_s = G_s / G_s.norm(2)

f_t = torch.nn.functional.normalize(f_t)
G_t = torch.mm(f_t, torch.t(f_t))
# G_t = G_t / G_t.norm(2)
G_diff = G_t - G_s
loss = (G_diff * G_diff).view(-1, 1).sum(0) / (bsz * bsz)
print(loss)

KD method in both configurations seems to be doing better than all other methods except the one from your paper

Hi,

Maybe I am not correctly understanding these below-mentioned tables (copied from your README at the root of the repo) but it seems that for almost all the configurations KD worked better than all the methods except the one proposed by you.

The purpose of all these methods originally was to improve upon KD (Hinton et al) so I am very surprised by this table.

Please guide

Regards & thanks
Kapil

Teacher and student are of the same architectural type.

Teacher Student	wrn-40-2 wrn-16-2	wrn-40-2 wrn-40-1	resnet56 resnet20	resnet110 resnet20	resnet110 resnet32	resnet32x4 resnet8x4	vgg13 vgg8
Teacher Student	75.61 73.26	75.61 71.98	72.34 69.06	74.31 69.06	74.31 71.14	79.42 72.50	74.64 70.36
KD	74.92	73.54	70.66	70.67	73.08	73.33	72.98
FitNet	73.58	72.24	69.21	68.99	71.06	73.50	71.02
AT	74.08	72.77	70.55	70.22	72.31	73.44	71.43
SP	73.83	72.43	69.67	70.04	72.69	72.94	72.68
CC	73.56	72.21	69.63	69.48	71.48	72.97	70.71
VID	74.11	73.30	70.38	70.16	72.61	73.09	71.23
RKD	73.35	72.22	69.61	69.25	71.82	71.90	71.48
PKT	74.54	73.45	70.34	70.25	72.61	73.64	72.88
AB	72.50	72.38	69.47	69.53	70.98	73.17	70.94
FT	73.25	71.59	69.84	70.22	72.37	72.86	70.58
FSP	72.91	0.00	69.95	70.11	71.89	72.62	70.23
NST	73.68	72.24	69.60	69.53	71.96	73.30	71.53
CRD	75.48	74.14	71.16	71.46	73.48	75.51	73.94

Teacher and student are of different architectural type.

Teacher Student	vgg13 MobileNetV2	ResNet50 MobileNetV2	ResNet50 vgg8	resnet32x4 ShuffleNetV1	resnet32x4 ShuffleNetV2	wrn-40-2 ShuffleNetV1
Teacher Student	74.64 64.60	79.34 64.60	79.34 70.36	79.42 70.50	79.42 71.82	75.61 70.50
KD	67.37	67.35	73.81	74.07	74.45	74.83
FitNet	64.14	63.16	70.69	73.59	73.54	73.73
AT	59.40	58.58	71.84	71.73	72.73	73.32
SP	66.30	68.08	73.34	73.48	74.56	74.52
CC	64.86	65.43	70.25	71.14	71.29	71.38
VID	65.56	67.57	70.30	73.38	73.40	73.61
RKD	64.52	64.43	71.50	72.28	73.21	72.21
PKT	67.13	66.52	73.01	74.10	74.69	73.89
AB	66.06	67.20	70.65	73.55	74.31	73.34
FT	61.78	60.99	70.29	71.75	72.50	72.03
NST	58.16	64.96	71.28	74.12	74.68	74.89
CRD	69.73	69.11	74.30	75.11	75.65	76.05

questions about ContrastMemory

Hi, according to Eq.19 in the paper, linear transform gT and gS are conducted on the teacher and student, respectively, i.e., gT(t), gS(s).

But as for your codes, the teacher transform gT is applied on the student feature, gT(s) , and the student transform gS is applied on the teacher feature, gS(t), like
out_v2 = torch.bmm(weight_v1, v2.view(batchSize, inputSize, 1))
out_v2 = torch.exp(torch.div(out_v2, T))
out_v1 = torch.bmm(weight_v2, v1.view(batchSize, inputSize, 1))
out_v1 = torch.exp(torch.div(out_v1, T))

and thus your contrast loss changes to be the addition of ContrastLoss(out_v1) + ContrastLoss(out_v2).

I wonder why you did this , instead of calculating output like Eq.19 by gT(t)*gS(s)/t and ContrastLoss(out).

Thanks.

Reported results based on early stopping?

Thanks for sharing this repo.

I noticed that you store the best model based on test accuracy.
I wonder whether the published results are also based on the best

How can I use CRD_loss to face landmark detetct for model compression? There is no "opt.nce_k: number of negatives paired with each positive".

code for ensemble distillation

Nice work! I was wondering if you are planning to release code for ensemble distillation?

How to train teacher model

When i try to train a teacher model resnet50 on new dataset(cub200), the backbone is different with origin resnet50. one the one hand, it is too big to use big batchsize(8 is ok on 16g gpu), on the other hand, the acc1 is too low when i train 300epoch which is 10% . why?

Training scheme for linear probe on STL10 and TinyImagenet

Hello,

I was wondering if you could provide the exact strategy and optimization details for the transferability of representations experiment (Table 4) in the paper.

The sampler is not consistent with the original implementation of CCKD

hi，why the sampler (CUR or SUR) is not consistent with the original implementation of CCKD (Correlation Congruence for Knowledge Distillation)？
And I would like to know what delta[:-1] * delta[1:] denotes？
Thanks!

Selection of teacher

Hi @HobbitLong
It is great work, I really like it. I have one issue about selection of teacher model. As shown in previous papers for classification, researcher use the frameworks which usually contains vgg16, vgg19, resent18, resent34, resnet50, resnet101 and so on. However, most teacher models you use are different. Can you explain the role you select teacher model? Is you method sensitive to framework. Please forgive me If I miss some parts.

The setting of Z_v1 and Z_v2 in class ContrastMemory?

Thanks for your great work and great code!

When I read your code of class "ContrastMemory" in "memory.py", I can not find the related introduction about the use of "Z_v1" and "Z_v2" in your arXiv preprint paper. I want to know why the "out_v1" should divide "Z_v1"? If the "outputSize" is big, then the "out_v1“ may be very small. And the "outputSize" is very different between datasets, will it influence the value of "out_v1" too much, and even influence the performance of the student network?

Looking forward to your reply. @HobbitLong

Here is the related code:

        # set Z if haven't been set yet
        if Z_v1 < 0:
            self.params[2] = out_v1.mean() * outputSize
            Z_v1 = self.params[2].clone().detach().item()
            print("normalization constant Z_v1 is set to {:.1f}".format(Z_v1))
        if Z_v2 < 0:
            self.params[3] = out_v2.mean() * outputSize
            Z_v2 = self.params[3].clone().detach().item()
            print("normalization constant Z_v2 is set to {:.1f}".format(Z_v2))

        # compute out_v1, out_v2
        out_v1 = torch.div(out_v1, Z_v1).contiguous()
        out_v2 = torch.div(out_v2, Z_v2).contiguous()

Cross modal KD implementation release?

Thank you for this great work! It's awesome.

I wonder will you release the implementation of cross-modal KD in the future? Thanks!

Memory issue about the NST LOSS

Hi:
Thank you for your code, I am trying to reimplement your benchmark. While running the nst loss, since the gram matrix is too large. Even if I use a 8 * P100(32GB) device, it is still out of memory. Could you please tell me whether you use some memory trick？Thank you

Form of the h function for infinite dataset

Thanks for the great code and paper!

I have a question regarding the form of h function. I have a huge dataset, thus it's impossible to store all embeddings in memory, so I decided to increase the batch size and mine negatives from it.
So far so good, but from my understanding due to big dataset size nominator almost equals denominator and h approaches 1.

Do you think that it's a good idea to replace h with an angular similarity between embeddings instead of the ratio proposed in the paper? Or maybe you could kindly propose some other appropriate choice for h?

Hyper-parameters for reproducing the results on ImageNet

This is a great work. However, when I try to reproduce results on the ImageNet dataset, there is a 1% accuracy gap between mine and that in your paper.

Would you mind providing the hyper-parameters for training on ImageNet?

Here is mine:
-r 1.0
-a 0.0
-b 0.8
--trial 1
--weight_decay 0.0001
--learning_rate 0.1
--epochs 100
--lr_decay_epochs 30,60,90
--print_freq 500
--batch_size 256 \

eight 1080Ti GPUs are used.
Thanks!

the implementation of cckd is not consistent with the paper

。

How do you choose the optimal hyper-parameters?

There are several hyper parameters existing:

teacher model hyper parameters
student model hyper parameters
KD hyper parameters (e.g., balance weight for different losses)
Training hyper parameters (e.g., learning rate)

It is hard to enumerate for every combination, because it may explode. How do you find the best (or suboptimal) hyper parameter?

Thanks!

Results on ImageNet

Thanks for your great work.

When I conduct experiments on ImageNet, I use the same training hyper-parameters provided in PyTorch official examples.. The initial learning rate is 0.1 and it is decayed at 30,60 epoch respectively. I find that in first two stages, i.e. 1-30 epoch and 31-60 epoch, the standard KD has a higher accuracy than the student trained from scratch. But in the 3rd stage (61-90 epoch), KD's accuracy is lower than student trained from scratch. This phenomenon is exactly the same as that in the Figure 3 of this paper.

In your work, KD's top1 accuracy is 0.9 points better than student trained from scratch. I wonder if there are some special processes such as training scheme or hyper-parameters that are different from that in PyTorch official examples. It would be much better if you can provide your code for ImageNet.

Thanks in advance!

AttributeError: 'CIFAR100Instance' object has no attribute 'train_data'

About the CE loss

Thanks for your sharing. Did all of experiments distillation work with the CE loss? I have a problem about this training strategy. First , well trained teacher model fixed parameters, then add one dimension linear transfer layer before the last classification layer of teacher and student model respectively, this linear transfer layer is trainable as the student model. But if you have the CE loss , you add after the original teacher student models' last layer. CE loss doesn't have any relationship with your linear dimension transfer layer. I feel this is a little strange. Your linear transfer layer has no connection with your final classification task. How could this layer learning. And another question is if my student and teacher model's penultimate layer have the same dimension, can I drop the linear dimension transfer layer?
Thanks very much for your reply.

AttributeError: 'CIFAR100Instance' object has no attribute 'train_data'

Thanks for your impressive work!

I try to reproduce the results from your paper, but when I test the code (i.e. cmd python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill kd --model_s resnet8x4 -r 0.1 -a 0.9 -b 0 --trial 1), it shows:

File "~/work/RepDistiller/dataset/cifar100.py", line 44, in getitem
img, target = self.train_data[index], self.train_labels[index]
AttributeError: 'CIFAR100Instance' object has no attribute 'train_data'

By the way, my environment: Ubuntu 18.01, Python 3.6, Pytorch 1.2, torchvision 0.4.1,
I also tested on Pytorch 0.4.0, still same issue

Could you please help me with this?

Thanks

AttributeError: 'CIFAR100InstanceSample' object has no attribute 'train_data'

I run your code:python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill crd --model_s resnet8x4 -a 0 -b 0.8 --trial 1
but Encounter an error:

....../RepDistiller/dataset/cifar100.py", line 124, in init
num_samples = len(self.train_data)

AttributeError: 'CIFAR100InstanceSample' object has no attribute 'train_data'

ImageNet results

Hi, @HobbitLong @erjanmx ,

Thank you for the very interesting paper!

Is it correct that the code can reproduce only the results for CIFAR? If so, is it possible that you release code or scripts for ImageNet experiments too? I mean even if the results differ a bit, scripts and running commands will be very helpful.

Best,
Arsenii

about using the resnet models for cifar10

Hello Sir,
I would like to start by saying how great this work is!
And I would like to know if the resnet models can be used directly on cifar10 instead of cifar100 or if there is some modification I should make?
I tried using KD (logits) to make resnet32 learn from resnet110 but unfortunately, the accuracy I get is pretty bad:
Acc= 43.35 after 250 epochs, a=0.1, b= 0.9 with no modification in the resnet models nor in function adjust_learning_rate.

Questions about ContrastMemory

@HobbitLong
Thanks for your excellent work!
I'm wondering why do you register two memories here in
self.register_buffer('memory_v1', torch.rand(outputSize, inputSize).mul_(2 * stdv).add_(-stdv)) self.register_buffer('memory_v2', torch.rand(outputSize, inputSize).mul_(2 * stdv).add_(-stdv))，
since only the student network is trained while the teacher is fixed.

crd used in image enhancement task like Denoise\SR\Deblur.

Hi, thanks for your great works. I wonder whether the CRD distillation loss could be used in image enhancement tasks like Denoise\SR\Deblur. I would appreciate it if someone could answer this questions.

A question for Experimental result

Thank you for share benchmark.

In your experimental result, There are many teacher and student pairs.
Especially, in KD(Distilling the Knowledge in a Neural Network) method, the optimal setting(ie. temperature) may differ with each pairs.
Does performance change a lot with temperature difference?

There may be a similar problem with KD as well as other methods, how do you think about this?

hyperparameters for other methods

Hi, I want to know about the hyperparameters that used to train other methods.
Are these methods well trained?

Question about normalization constant Z_v1 and Z_v2 in the ContrastMemory

What an excellent work! I have learned so much from this paper.
But I have a question about normalization constant Z_v1 and Z_v2 in the ContrastMemory.
The Z_v1 and Z_v2 is initialized to -1，and Z_v1/Z_v2 is calculated based on out_v1/out_v2.
As shown in the following code， Z_v1/Z_v2 is updated only if Z_v1<0 and Z_v2<0，so Z_v1/Z_v2 will only be updated in the first batch.

if Z_v1 < 0:  
    self.params[2] = out_v1.mean() * outputSize  
    Z_v1 = self.params[2].clone().detach().item()  
    print("normalization constant Z_v1 is set to {:.1f}".format(Z_v1))  
if Z_v2 < 0:  
    self.params[3] = out_v2.mean() * outputSize  
    Z_v2 = self.params[3].clone().detach().item()  
    print("normalization constant Z_v2 is set to {:.1f}".format(Z_v2))

But I think Z_v1 and Z_v2 should be updated in every batch.
What is your opinion on this，what are the benefits of this design?

what is the difference between the position of putting "with torch.no_grad()"

Hi,
Thank you for your contribution,the code is very useful for me,but I want to ask you a question about this code:

 with torch.no_grad():
        l_pos = torch.index_select(self.memory_v1, 0, y.view(-1))
        l_pos.mul_(momentum)
        l_pos.add_(torch.mul(v1, 1 - momentum))
        l_norm = l_pos.pow(2).sum(1, keepdim=True).pow(0.5)
        updated_v1 = l_pos.div(l_norm)
        self.memory_v1.index_copy_(0, y, updated_v1)
        ab_pos = torch.index_select(self.memory_v2, 0, y.view(-1))
        ab_pos.mul_(momentum)
        ab_pos.add_(torch.mul(v2, 1 - momentum))
        ab_norm = ab_pos.pow(2).sum(1, keepdim=True).pow(0.5)
        updated_v2 = ab_pos.div(ab_norm)
        self.memory_v2.index_copy_(0, y, updated_v2)



In your Implemention of the paper,you calc the loss element first and then update the memory, 
if I update the memory first, and then calc the loss element, what is the difference between these two methods,

Looking forward to your reply！Thank you！

	def forward(self, y_s, y_t):
	p_s = F.log_softmax(y_s/self.T, dim=1)
	p_t = F.softmax(y_t/self.T, dim=1)
	loss = F.kl_div(p_s, p_t, size_average=False) * (self.T**2) / y_s.shape[0]
	return loss

hobbitlong / repdistiller Goto Github PK

repdistiller's Issues

Recommend Projects

Recommend Topics

Recommend Org