hobbitlong / repdistiller Goto Github PK
View Code? Open in Web Editor NEW[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods
License: BSD 2-Clause "Simplified" License
[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods
License: BSD 2-Clause "Simplified" License
Test
Hi, I am getting this error while running your code -- any suggestion?
$ python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill kd --model_s resnet8x4 -r 0.1 -a 0.9 -b 0 --trial 1
2021-10-20 11:38:21.416626: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
Files already downloaded and verified
Files already downloaded and verified
==> loading teacher model
Traceback (most recent call last):
File "train_student.py", line 347, in <module>
main()
File "train_student.py", line 167, in main
model_t = load_teacher(opt.path_t, n_cls)
File "train_student.py", line 138, in load_teacher
model.load_state_dict(torch.load(model_path)['model'])
File "/home/frestuc/.conda/envs/py37/lib/python3.7/site-packages/torch/serialization.py", line 594, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/frestuc/.conda/envs/py37/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/frestuc/.conda/envs/py37/lib/python3.7/site-packages/torch/serialization.py", line 211, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: './save/models/resnet32x4_vanilla/ckpt_epoch_240.pth'
Thanks for your excellent work!
I wonder how can I learn about the implementation of memory bank in the paper. Is it the same as the implementation of memory bank in Kaiming's MoCo? I can't see the introduction in your paper.
Hey @HobbitLong, nice paper. Do you have any guidelines on the effectiveness of the approach as the number of parameters decrease in the student model vs the teacher (same architecture) and what a sensible ratio is a good starting point?
I think it should be softmax
instead. Otherwise p_t
and p_s
are not comparable.
Could you please explain why?
RepDistiller/distiller_zoo/KD.py
Lines 13 to 17 in dcc0432
when I try to train a teacher model on cub200(200 classes), I use resnet50 and batch size 64, It will out of memory, I use 16G GPU. I could run when i set the batch size 8. Why resnet50 is so big ?
Sorry!When i want to run fecthc_pretrained_tearhers.sh, I can't download the teacher model.
Is this server down?
Hi,
Thank you for your great work, which helps me a lot.
I want to ask about the CRD contrast memory. In class ContrastMemory, there will be 2 buffers generated as 2 random tensor, for each is of the shape (number of data, number of features). Assume the number of features is 128, this buffer will become really huge when training with a large dataset, such as Glint 360k. Actually I tried to use CRD for my face recognition project and there are 17091657 pictures in this dataset, which leads to a outbread use for the GPU memory and there is no room for training.
I wonder if you can tell me if I am understanding this part right, and if I am right, is there any solution for this problem? Thanks.
Thanks for your work and code!
I just fetched the pretrained teacher model(ResNet-110) from [http://shape2prog.csail.mit.edu/repo/resnet110_vanilla/ckpt_epoch_240.pth], and tested it with CIFAR-100. The accuracy is 70.27%, and in your paper the accuracy should be around 74%. Is there something wrong with this pretrained model?
Thank you for your excellent code! I'm interested in "deep mutual learning setting". Can you tell me about some training details?
hi, i wonder if you have applied the crd or cmc to cross domain distillation, for example : vqa or different domains in visual , will it work ?
hi, thanks for your wonderful work, I want to train my model in order to reduce size and how to modify my model?
@HobbitLong Thank you very much for making the effort to clean and post your code for these benchmarks! I'm sure that you don't have time to post code for the ensemble distillation task, but I am going to try reproducing that benchmark so perhaps if there are any tricks or different hyperparameters settings that you can remember for that particular task off the top of your head then we can document them in this issue.
Hi, I am interesting of the visualization of correlation matrix in your paper, I would like to know about how to calculate the correlation matrix of logits across the full CIFAR dataset? Thanks!
resnet use 7x7conv and maxpool In the beginning,but this rep uses 3x3 conv and no maxpool,is there any reason for doing this?
Hi, I'm a freshman in the context of knowledge distillation.
I wonder why there is no mutlple-gpu training in your code.
What is the reason and is there any solution with this question?
I'm very appereciate for any response, thank you!
Thanks for your great work!
But I cannot achieve the reported accuracy in your paper. In the case where teacher and student have the similar architecture, my accuracy is ~1% lower than your results. And in the case where teacher and student have different architecture, the performance of KD and CRD are even worse than model from scratch.
The only change I've made is to wrap the model with nn.DataParalel and run on 8 GPUs. I enlarged the batch_size to make that each gpu's batch_size is the same as your original single gpu setting. I ran all the experiments according to your hyperparameter(several loss weights) in this repo, and just changed the architecture of teacher and student. I wonder if the dataparallel hurts the accuracy or hyperparameters have to be tuned carefully according to each architecture.
Looking forward to your reply.:)
Thank you for your sharing!
In your paper, it seems that CRD method is suitable for classification task, does it compatible with regression task?
In the paper for Similarity-Preserving loss. The normalization is before the operation of matrix Multiplication. Does the order matter the performance.
import torch
org_f_s = torch.rand((64, 96))
org_f_t = torch.rand((64, 96))
bsz = f_s.shape[0]
f_s = org_f_s.view(bsz, -1)
f_t = org_f_t.view(bsz, -1)
G_s = torch.mm(f_s, torch.t(f_s))
# G_s = G_s / G_s.norm(2)
G_s = torch.nn.functional.normalize(G_s)
G_t = torch.mm(f_t, torch.t(f_t))
# G_t = G_t / G_t.norm(2)
G_t = torch.nn.functional.normalize(G_t)
G_diff = G_t - G_s
loss = (G_diff * G_diff).view(-1, 1).sum(0) / (bsz * bsz)
print(loss)
f_s = org_f_s.view(bsz, -1)
f_t = org_f_t.view(bsz, -1)
f_s = torch.nn.functional.normalize(f_s)
G_s = torch.mm(f_s, torch.t(f_s))
# G_s = G_s / G_s.norm(2)
f_t = torch.nn.functional.normalize(f_t)
G_t = torch.mm(f_t, torch.t(f_t))
# G_t = G_t / G_t.norm(2)
G_diff = G_t - G_s
loss = (G_diff * G_diff).view(-1, 1).sum(0) / (bsz * bsz)
print(loss)
Hi,
Maybe I am not correctly understanding these below-mentioned tables (copied from your README at the root of the repo) but it seems that for almost all the configurations KD worked better than all the methods except the one proposed by you.
The purpose of all these methods originally was to improve upon KD (Hinton et al) so I am very surprised by this table.
Please guide
Regards & thanks
Kapil
Teacher Student |
wrn-40-2 wrn-16-2 |
wrn-40-2 wrn-40-1 |
resnet56 resnet20 |
resnet110 resnet20 |
resnet110 resnet32 |
resnet32x4 resnet8x4 |
vgg13 vgg8 |
---|---|---|---|---|---|---|---|
Teacher Student |
75.61 73.26 |
75.61 71.98 |
72.34 69.06 |
74.31 69.06 |
74.31 71.14 |
79.42 72.50 |
74.64 70.36 |
KD | 74.92 | 73.54 | 70.66 | 70.67 | 73.08 | 73.33 | 72.98 |
FitNet | 73.58 | 72.24 | 69.21 | 68.99 | 71.06 | 73.50 | 71.02 |
AT | 74.08 | 72.77 | 70.55 | 70.22 | 72.31 | 73.44 | 71.43 |
SP | 73.83 | 72.43 | 69.67 | 70.04 | 72.69 | 72.94 | 72.68 |
CC | 73.56 | 72.21 | 69.63 | 69.48 | 71.48 | 72.97 | 70.71 |
VID | 74.11 | 73.30 | 70.38 | 70.16 | 72.61 | 73.09 | 71.23 |
RKD | 73.35 | 72.22 | 69.61 | 69.25 | 71.82 | 71.90 | 71.48 |
PKT | 74.54 | 73.45 | 70.34 | 70.25 | 72.61 | 73.64 | 72.88 |
AB | 72.50 | 72.38 | 69.47 | 69.53 | 70.98 | 73.17 | 70.94 |
FT | 73.25 | 71.59 | 69.84 | 70.22 | 72.37 | 72.86 | 70.58 |
FSP | 72.91 | 0.00 | 69.95 | 70.11 | 71.89 | 72.62 | 70.23 |
NST | 73.68 | 72.24 | 69.60 | 69.53 | 71.96 | 73.30 | 71.53 |
CRD | 75.48 | 74.14 | 71.16 | 71.46 | 73.48 | 75.51 | 73.94 |
Teacher Student |
vgg13 MobileNetV2 |
ResNet50 MobileNetV2 |
ResNet50 vgg8 |
resnet32x4 ShuffleNetV1 |
resnet32x4 ShuffleNetV2 |
wrn-40-2 ShuffleNetV1 |
---|---|---|---|---|---|---|
Teacher Student |
74.64 64.60 |
79.34 64.60 |
79.34 70.36 |
79.42 70.50 |
79.42 71.82 |
75.61 70.50 |
KD | 67.37 | 67.35 | 73.81 | 74.07 | 74.45 | 74.83 |
FitNet | 64.14 | 63.16 | 70.69 | 73.59 | 73.54 | 73.73 |
AT | 59.40 | 58.58 | 71.84 | 71.73 | 72.73 | 73.32 |
SP | 66.30 | 68.08 | 73.34 | 73.48 | 74.56 | 74.52 |
CC | 64.86 | 65.43 | 70.25 | 71.14 | 71.29 | 71.38 |
VID | 65.56 | 67.57 | 70.30 | 73.38 | 73.40 | 73.61 |
RKD | 64.52 | 64.43 | 71.50 | 72.28 | 73.21 | 72.21 |
PKT | 67.13 | 66.52 | 73.01 | 74.10 | 74.69 | 73.89 |
AB | 66.06 | 67.20 | 70.65 | 73.55 | 74.31 | 73.34 |
FT | 61.78 | 60.99 | 70.29 | 71.75 | 72.50 | 72.03 |
NST | 58.16 | 64.96 | 71.28 | 74.12 | 74.68 | 74.89 |
CRD | 69.73 | 69.11 | 74.30 | 75.11 | 75.65 | 76.05 |
Hi, according to Eq.19 in the paper, linear transform gT and gS are conducted on the teacher and student, respectively, i.e., gT(t), gS(s).
But as for your codes, the teacher transform gT is applied on the student feature, gT(s) , and the student transform gS is applied on the teacher feature, gS(t), like
out_v2 = torch.bmm(weight_v1, v2.view(batchSize, inputSize, 1))
out_v2 = torch.exp(torch.div(out_v2, T))
out_v1 = torch.bmm(weight_v2, v1.view(batchSize, inputSize, 1))
out_v1 = torch.exp(torch.div(out_v1, T))
and thus your contrast loss changes to be the addition of ContrastLoss(out_v1) + ContrastLoss(out_v2).
I wonder why you did this , instead of calculating output like Eq.19 by gT(t)*gS(s)/t and ContrastLoss(out).
Thanks.
Thanks for sharing this repo.
I noticed that you store the best model based on test accuracy.
I wonder whether the published results are also based on the best
Nice work! I was wondering if you are planning to release code for ensemble distillation?
When i try to train a teacher model resnet50 on new dataset(cub200), the backbone is different with origin resnet50. one the one hand, it is too big to use big batchsize(8 is ok on 16g gpu), on the other hand, the acc1 is too low when i train 300epoch which is 10% . why?
Hello,
I was wondering if you could provide the exact strategy and optimization details for the transferability of representations experiment (Table 4) in the paper.
hi,why the sampler (CUR or SUR) is not consistent with the original implementation of CCKD (Correlation Congruence for Knowledge Distillation)?
And I would like to know what delta[:-1] * delta[1:] denotes?
Thanks!
Hi @HobbitLong
It is great work, I really like it. I have one issue about selection of teacher model. As shown in previous papers for classification, researcher use the frameworks which usually contains vgg16, vgg19, resent18, resent34, resnet50, resnet101 and so on. However, most teacher models you use are different. Can you explain the role you select teacher model? Is you method sensitive to framework. Please forgive me If I miss some parts.
Thanks for your great work and great code!
When I read your code of class "ContrastMemory" in "memory.py", I can not find the related introduction about the use of "Z_v1" and "Z_v2" in your arXiv preprint paper. I want to know why the "out_v1" should divide "Z_v1"? If the "outputSize" is big, then the "out_v1“ may be very small. And the "outputSize" is very different between datasets, will it influence the value of "out_v1" too much, and even influence the performance of the student network?
Looking forward to your reply. @HobbitLong
Here is the related code:
# set Z if haven't been set yet
if Z_v1 < 0:
self.params[2] = out_v1.mean() * outputSize
Z_v1 = self.params[2].clone().detach().item()
print("normalization constant Z_v1 is set to {:.1f}".format(Z_v1))
if Z_v2 < 0:
self.params[3] = out_v2.mean() * outputSize
Z_v2 = self.params[3].clone().detach().item()
print("normalization constant Z_v2 is set to {:.1f}".format(Z_v2))
# compute out_v1, out_v2
out_v1 = torch.div(out_v1, Z_v1).contiguous()
out_v2 = torch.div(out_v2, Z_v2).contiguous()
Thank you for this great work! It's awesome.
I wonder will you release the implementation of cross-modal KD in the future? Thanks!
Hi:
Thank you for your code, I am trying to reimplement your benchmark. While running the nst loss, since the gram matrix is too large. Even if I use a 8 * P100(32GB) device, it is still out of memory. Could you please tell me whether you use some memory trick?Thank you
Thanks for the great code and paper!
I have a question regarding the form of h
function. I have a huge dataset, thus it's impossible to store all embeddings in memory, so I decided to increase the batch size and mine negatives from it.
So far so good, but from my understanding due to big dataset size nominator almost equals denominator and h
approaches 1.
Do you think that it's a good idea to replace h
with an angular similarity between embeddings instead of the ratio proposed in the paper? Or maybe you could kindly propose some other appropriate choice for h?
This is a great work. However, when I try to reproduce results on the ImageNet dataset, there is a 1% accuracy gap between mine and that in your paper.
Would you mind providing the hyper-parameters for training on ImageNet?
Here is mine:
-r 1.0
-a 0.0
-b 0.8
--trial 1
--weight_decay 0.0001
--learning_rate 0.1
--epochs 100
--lr_decay_epochs 30,60,90
--print_freq 500
--batch_size 256 \
eight 1080Ti GPUs are used.
Thanks!
。
There are several hyper parameters existing:
It is hard to enumerate for every combination, because it may explode. How do you find the best (or suboptimal) hyper parameter?
Thanks!
Thanks for your great work.
When I conduct experiments on ImageNet, I use the same training hyper-parameters provided in PyTorch official examples.. The initial learning rate is 0.1 and it is decayed at 30,60 epoch respectively. I find that in first two stages, i.e. 1-30 epoch and 31-60 epoch, the standard KD has a higher accuracy than the student trained from scratch. But in the 3rd stage (61-90 epoch), KD's accuracy is lower than student trained from scratch. This phenomenon is exactly the same as that in the Figure 3 of this paper.
In your work, KD's top1 accuracy is 0.9 points better than student trained from scratch. I wonder if there are some special processes such as training scheme or hyper-parameters that are different from that in PyTorch official examples. It would be much better if you can provide your code for ImageNet.
Thanks in advance!
Thanks for your sharing. Did all of experiments distillation work with the CE loss? I have a problem about this training strategy. First , well trained teacher model fixed parameters, then add one dimension linear transfer layer before the last classification layer of teacher and student model respectively, this linear transfer layer is trainable as the student model. But if you have the CE loss , you add after the original teacher student models' last layer. CE loss doesn't have any relationship with your linear dimension transfer layer. I feel this is a little strange. Your linear transfer layer has no connection with your final classification task. How could this layer learning. And another question is if my student and teacher model's penultimate layer have the same dimension, can I drop the linear dimension transfer layer?
Thanks very much for your reply.
Thanks for your impressive work!
I try to reproduce the results from your paper, but when I test the code (i.e. cmd python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill kd --model_s resnet8x4 -r 0.1 -a 0.9 -b 0 --trial 1), it shows:
File "~/work/RepDistiller/dataset/cifar100.py", line 44, in getitem
img, target = self.train_data[index], self.train_labels[index]
AttributeError: 'CIFAR100Instance' object has no attribute 'train_data'
By the way, my environment: Ubuntu 18.01, Python 3.6, Pytorch 1.2, torchvision 0.4.1,
I also tested on Pytorch 0.4.0, still same issue
Could you please help me with this?
Thanks
I run your code:python train_student.py --path_t ./save/models/resnet32x4_vanilla/ckpt_epoch_240.pth --distill crd --model_s resnet8x4 -a 0 -b 0.8 --trial 1
but Encounter an error:
....../RepDistiller/dataset/cifar100.py", line 124, in init
num_samples = len(self.train_data)
AttributeError: 'CIFAR100InstanceSample' object has no attribute 'train_data'
Hi, @HobbitLong @erjanmx ,
Thank you for the very interesting paper!
Is it correct that the code can reproduce only the results for CIFAR? If so, is it possible that you release code or scripts for ImageNet experiments too? I mean even if the results differ a bit, scripts and running commands will be very helpful.
Best,
Arsenii
Hello Sir,
I would like to start by saying how great this work is!
And I would like to know if the resnet models can be used directly on cifar10 instead of cifar100 or if there is some modification I should make?
I tried using KD (logits) to make resnet32 learn from resnet110 but unfortunately, the accuracy I get is pretty bad:
Acc= 43.35 after 250 epochs, a=0.1, b= 0.9 with no modification in the resnet models nor in function adjust_learning_rate.
@HobbitLong
Thanks for your excellent work!
I'm wondering why do you register two memories here in
self.register_buffer('memory_v1', torch.rand(outputSize, inputSize).mul_(2 * stdv).add_(-stdv)) self.register_buffer('memory_v2', torch.rand(outputSize, inputSize).mul_(2 * stdv).add_(-stdv))
,
since only the student network is trained while the teacher is fixed.
Hi, thanks for your great works. I wonder whether the CRD distillation loss could be used in image enhancement tasks like Denoise\SR\Deblur. I would appreciate it if someone could answer this questions.
Thank you for share benchmark.
In your experimental result, There are many teacher and student pairs.
Especially, in KD(Distilling the Knowledge in a Neural Network) method, the optimal setting(ie. temperature) may differ with each pairs.
Does performance change a lot with temperature difference?
There may be a similar problem with KD as well as other methods, how do you think about this?
Hi, I want to know about the hyperparameters that used to train other methods.
Are these methods well trained?
What an excellent work! I have learned so much from this paper.
But I have a question about normalization constant Z_v1 and Z_v2 in the ContrastMemory.
The Z_v1 and Z_v2 is initialized to -1,and Z_v1/Z_v2 is calculated based on out_v1/out_v2.
As shown in the following code, Z_v1/Z_v2 is updated only if Z_v1<0 and Z_v2<0,so Z_v1/Z_v2 will only be updated in the first batch.
if Z_v1 < 0:
self.params[2] = out_v1.mean() * outputSize
Z_v1 = self.params[2].clone().detach().item()
print("normalization constant Z_v1 is set to {:.1f}".format(Z_v1))
if Z_v2 < 0:
self.params[3] = out_v2.mean() * outputSize
Z_v2 = self.params[3].clone().detach().item()
print("normalization constant Z_v2 is set to {:.1f}".format(Z_v2))
But I think Z_v1 and Z_v2 should be updated in every batch.
What is your opinion on this,what are the benefits of this design?
Hi,
Thank you for your contribution,the code is very useful for me,but I want to ask you a question about this code:
with torch.no_grad():
l_pos = torch.index_select(self.memory_v1, 0, y.view(-1))
l_pos.mul_(momentum)
l_pos.add_(torch.mul(v1, 1 - momentum))
l_norm = l_pos.pow(2).sum(1, keepdim=True).pow(0.5)
updated_v1 = l_pos.div(l_norm)
self.memory_v1.index_copy_(0, y, updated_v1)
ab_pos = torch.index_select(self.memory_v2, 0, y.view(-1))
ab_pos.mul_(momentum)
ab_pos.add_(torch.mul(v2, 1 - momentum))
ab_norm = ab_pos.pow(2).sum(1, keepdim=True).pow(0.5)
updated_v2 = ab_pos.div(ab_norm)
self.memory_v2.index_copy_(0, y, updated_v2)
In your Implemention of the paper,you calc the loss element first and then update the memory,
if I update the memory first, and then calc the loss element, what is the difference between these two methods,
Looking forward to your reply!Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.