szagoruyko / attention-transfer Goto Github PK
View Code? Open in Web Editor NEWImproving Convolutional Networks via Attention Transfer (ICLR 2017)
Home Page: https://arxiv.org/abs/1612.03928
Improving Convolutional Networks via Attention Transfer (ICLR 2017)
Home Page: https://arxiv.org/abs/1612.03928
Hi, Thanks for your sharing code. I have a question about the KL_loss implement. The pytorch KL_loss is caculate by average the batch size and dimension. But the original knowledge distill do not need average the loss in the dimension. So I assume there is some bug in it?
When I ran the cifar code with GPU, I got the following error. Any suggestions would be appreciated!
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same
Thanks for your great work first!
I wonder why you do not use BN layer when inference the teacher model here( https://github.com/szagoruyko/attention-transfer/blob/master/imagenet.py#L117 )? Is it a typo?
Hope for your reply!
I got an error when I use 2 gpus to train the model on ImageNet. I followed the steps in README, but I got the following error.
Traceback (most recent call last):
File "imagenet.py", line 340, in
main()
File "imagenet.py", line 336, in main
engine.train(h, iter_train, opt.epochs, optimizer)
File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 63, in train
state['optimizer'].step(closure)
File "/usr/local/lib/python3.6/site-packages/torch/optim/sgd.py", line 80, in step
loss = closure()
File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 52, in closure
loss, output = state'network'
File "imagenet.py", line 265, in h
y_s, y_t, loss_groups = utils.data_parallel(f, inputs, params, mode, range(opt.ngpu))
File "/opt/ml/job/utils.py", line 64, in data_parallel
return gather(outputs, output_device)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
return gather_map(outputs)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in
ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
RuntimeError: dimension specified as 0 but tensor has no dimensions
Thank you if you have any solution to this.
(py35) user@user-ASUS:~/fzz/study/attention-transfer-master$ python cifar.py --save logs/resnet_40_1_teacher --depth 40 --width 1
parsed options: {'data_root': '.', 'dataset': 'CIFAR10', 'cuda': False, 'width': 1.0, 'lr': 0.1, 'alpha': 0, 'teacher_id': '', 'gpu_id': '0', 'lr_decay_ratio': 0.2, 'epoch_step': '[60,120,160]', 'beta': 0, 'batchSize': 128, 'ngpu': 1, 'depth': 40, 'randomcrop_pad': 4, 'optim_method': 'SGD', 'nthread': 4, 'weightDecay': 0.0005, 'resume': '', 'epochs': 200, 'save': 'logs/resnet_40_1_teacher', 'temperature': 4, 'dtype': 'float'}
Files already downloaded and verified
Files already downloaded and verified
Traceback (most recent call last):
File "cifar.py", line 331, in
main()
File "cifar.py", line 212, in main
optimizable = [v for v in params.itervalues() if v.requires_grad]
AttributeError: 'collections.OrderedDict' object has no attribute 'itervalues'
When I run 'cifar.py', I got this error, and I can't find the “itervalues” in any file.
Can you put the code of AT+KD in this repository.
Hello, I have a question!
This sentence "Error is computed as median of 5 runs with different seed." is mentioned in table 1 of your paper, how should I get the result of a single run while running the lab? Is it the accuracy of the model one the test set after the last epoch? And I run it 5 times , and then I take the median of them, right?
How should I calculate the results of the experiment? I was first exposed to deep learning paper reproduction.
Hi.
In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we
set it to 10^3 divided by number of elements in attention map and batch size for each layer. "
But I am still confused. What is 10^3 mean, and how 0.1 was got?
Thank you for you source code for attention-transfer, but I an not familiar with pytorch. I do not understand the interplotation's implementation. How it works? how to do the interpolation if the two feature maps's dimension are not the same? Can you explain it for me clearly?
What is this model's final purpose?
It is for improving the performance of CNN?
Otherwise, It is for get Attention region like saliency map?
Thanks in advance .
The Hinton distillation paper states:
"The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model. The second objective function is the cross entropy with the correct labels."
In https://github.com/szagoruyko/attention-transfer/blob/master/utils.py#L13-L15, the first objective function is computed using kl_div
which is different from cross_entropy
.
kl_div
computes (- \sum t log(x/t))
cross_entropy
computes (- \sum t log(x))
In general, cross_entropy
is kl_div
+ entropy(t)
Did I misunderstand something, or did you use a slightly different loss in your implementation?
Hello! Is there anyone know how to get the attention map with input the model.parameters() or model.named_parameters() in pytorch?
thank you for your code , it's very helpful for me to study computer vision.
but it is shame that i can't run this code correctly,
i think it's maybe the matter of the edition of software
so, there is a issue i want to know that the edition of software, such us python, opencv,
what i used is python 2.7 pencv 3.2.0
thank you very much
how to install cvtransforms?
thank you
How is AT performed on other models?Are there any experiment results presented?
Besides, I tried reproducing the experiment with resnet-18 on imagenet, getting no improvement after 100 epochs' training.
Thanks!
Thank you for your code!
I don‘t know why the kl loss here is multiplied by 2.
https://github.com/szagoruyko/attention-transfer/blob/master/utils.py#L60
Could you explain it?
When I run cifar.py I have the error:
new() received an invalid combination of arguments - got (Tensor, int, int, int), but expected one of:
How to visualize the attention maps?
Hello, first of all, thank you very much for your great work!
When I reproduce the results of your paper, I found several confusing problems in the relevant part of Imagenet, and the results are much worse than those mentioned in the paper.
First of all, the accuracy of the resnet34 teacher network you mentioned in the paper is different from that of the resnet34 pre training model you provided. I don't know whether it is this reason that leads to the poor results of students. Can you provide the resnet34 model mentioned in the paper?
The second point is that when you do the experiment of Imagenet, you mentioned that the super parameters used are the same as those used in the migration experiment, but no specific value is given. What's the specific beta value, please?
Here is my recurrence("Imagenet_AT" is the experimental result with beta set to 1000, which is much worse than the result in the paper. "Imagenet_AT2000" is the result after I tried to adjust the beta to 2000. You know that this experiment is very computationally expensive, so I stopped the experiment after observing that the previous result is very poo):
Result in your paper:
Hello! Thaks for the job!
I have one more question: How I can get visualization of attention map?
@szagoruyko @EderSantana
Hi, your sharing code is appreciated, but would you please specify your strategy of decaying the two multipliers α and β during training process? Thanks in advance.
Hello!
First of all thanks for sharing the code! It's amazing!
I wanted to ask a (noobish perhaps) question, apologize my ignorance if this is smth obvious.
Do you see any relevance or advantage for using the attention transfer techniques to other areas apart from computer vision. I was mainly thinking of NLP. Could smth like this help it in cases where specific important parts of the text are to be identified or in cases of text summarization?
Thanks in advance for the great work!
Kind regards,
Theodore.
Hi, I like your work and curious that when will you be planning on adding the implementation of AT+KD with decaying beta?
Will it be commit soon?
Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.