szagoruyko / attention-transfer Goto Github PK

Improving Convolutional Networks via Attention Transfer (ICLR 2017)

Home Page: https://arxiv.org/abs/1612.03928

Python 4.77% Jupyter Notebook 95.23%

attention deep-learning knowledge-distillation pytorch

attention-transfer's Issues

Question about KL_loss average

Hi, Thanks for your sharing code. I have a question about the KL_loss implement. The pytorch KL_loss is caculate by average the batch size and dimension. But the original knowledge distill do not need average the loss in the dimension. So I assume there is some bug in it?

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

When I ran the cifar code with GPU, I got the following error. Any suggestions would be appreciated!

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Why not use bn for teacher net in imagenet.py

Thanks for your great work first!

I wonder why you do not use BN layer when inference the teacher model here( https://github.com/szagoruyko/attention-transfer/blob/master/imagenet.py#L117 )? Is it a typo?

Hope for your reply!

Got error when use 2 gpus.

I got an error when I use 2 gpus to train the model on ImageNet. I followed the steps in README, but I got the following error.
Traceback (most recent call last):
File "imagenet.py", line 340, in
main()
File "imagenet.py", line 336, in main
engine.train(h, iter_train, opt.epochs, optimizer)
File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 63, in train
state['optimizer'].step(closure)
File "/usr/local/lib/python3.6/site-packages/torch/optim/sgd.py", line 80, in step
loss = closure()
File "/usr/local/lib/python3.6/site-packages/torchnet/engine/engine.py", line 52, in closure
loss, output = state'network'
File "imagenet.py", line 265, in h
y_s, y_t, loss_groups = utils.data_parallel(f, inputs, params, mode, range(opt.ngpu))
File "/opt/ml/job/utils.py", line 64, in data_parallel
return gather(outputs, output_device)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
return gather_map(outputs)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward
ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
File "/usr/local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in
ctx.input_sizes = tuple(map(lambda i: i.size(ctx.dim), inputs))
RuntimeError: dimension specified as 0 but tensor has no dimensions

Thank you if you have any solution to this.

question about "params.itervalues()"

(py35) user@user-ASUS:~/fzz/study/attention-transfer-master$ python cifar.py --save logs/resnet_40_1_teacher --depth 40 --width 1
parsed options: {'data_root': '.', 'dataset': 'CIFAR10', 'cuda': False, 'width': 1.0, 'lr': 0.1, 'alpha': 0, 'teacher_id': '', 'gpu_id': '0', 'lr_decay_ratio': 0.2, 'epoch_step': '[60,120,160]', 'beta': 0, 'batchSize': 128, 'ngpu': 1, 'depth': 40, 'randomcrop_pad': 4, 'optim_method': 'SGD', 'nthread': 4, 'weightDecay': 0.0005, 'resume': '', 'epochs': 200, 'save': 'logs/resnet_40_1_teacher', 'temperature': 4, 'dtype': 'float'}
Files already downloaded and verified
Files already downloaded and verified
Traceback (most recent call last):
File "cifar.py", line 331, in
main()
File "cifar.py", line 212, in main
optimizable = [v for v in params.itervalues() if v.requires_grad]
AttributeError: 'collections.OrderedDict' object has no attribute 'itervalues'

When I run 'cifar.py', I got this error, and I can't find the “itervalues” in any file.

AT+KD Code

Can you put the code of AT+KD in this repository.

Table1: Experiments on CIFAR-10

Hello, I have a question!
This sentence "Error is computed as median of 5 runs with different seed." is mentioned in table 1 of your paper, how should I get the result of a single run while running the lab? Is it the accuracy of the model one the test set after the last epoch? And I run it 5 times , and then I take the median of them, right?

How should I calculate the results of the experiment? I was first exposed to deep learning paper reproduction.

Setting of β

Hi.

In the paper, the authors said "As for parameter β in eq. 2, it usually varies about 0.1, as we
set it to 10^3 divided by number of elements in attention map and batch size for each layer. "

But I am still confused. What is 10^3 mean, and how 0.1 was got?

how to do the interpolation?

Thank you for you source code for attention-transfer, but I an not familiar with pytorch. I do not understand the interplotation's implementation. How it works? how to do the interpolation if the two feature maps's dimension are not the same? Can you explain it for me clearly?

What is this model's final purpose?

Hi, @EderSantana @szagoruyko

What is this model's final purpose?

It is for improving the performance of CNN?
Otherwise, It is for get Attention region like saliency map?

Thanks in advance .

KL div v/s xentropy

The Hinton distillation paper states:
"The first objective function is the cross entropy with the soft targets and this cross entropy is computed using the same high temperature in the softmax of the distilled model as was used for generating the soft targets from the cumbersome model. The second objective function is the cross entropy with the correct labels."

In https://github.com/szagoruyko/attention-transfer/blob/master/utils.py#L13-L15, the first objective function is computed using kl_div which is different from cross_entropy.
kl_div computes (- \sum t log(x/t))
cross_entropy computes (- \sum t log(x))
In general, cross_entropy is kl_div + entropy(t)

Did I misunderstand something, or did you use a slightly different loss in your implementation?

qusetion about cifar.py

How to get the attention map with input model.parameters() or model.named_parameters() in pytorch

Hello! Is there anyone know how to get the attention map with input the model.parameters() or model.named_parameters() in pytorch?

Question on Code

thank you for your code , it's very helpful for me to study computer vision.
but it is shame that i can't run this code correctly,
i think it's maybe the matter of the edition of software
so, there is a issue i want to know that the edition of software, such us python, opencv,
what i used is python 2.7 pencv 3.2.0

thank you very much

how to install cvtransforms?

how to install cvtransforms?
thank you

Any experiment results updated with AT on imagenet?

How is AT performed on other models?Are there any experiment results presented?
Besides, I tried reproducing the experiment with resnet-18 on imagenet, getting no improvement after 100 epochs' training.
Thanks!

Question on KL loss

Thank you for your code!

I don‘t know why the kl loss here is multiplied by 2.
https://github.com/szagoruyko/attention-transfer/blob/master/utils.py#L60
Could you explain it?

invalid variables

When I run cifar.py I have the error:
new() received an invalid combination of arguments - got (Tensor, int, int, int), but expected one of:

(torch.device device)
(tuple of ints size, torch.device device)
(torch.Storage storage)
(Tensor other)
(object data, torch.device device)
What is the problem?
Thanks for the answer!

How to visualize the attention maps?

My Imagenet replication results are poor

Hello, first of all, thank you very much for your great work!

When I reproduce the results of your paper, I found several confusing problems in the relevant part of Imagenet, and the results are much worse than those mentioned in the paper.

First of all, the accuracy of the resnet34 teacher network you mentioned in the paper is different from that of the resnet34 pre training model you provided. I don't know whether it is this reason that leads to the poor results of students. Can you provide the resnet34 model mentioned in the paper?
The second point is that when you do the experiment of Imagenet, you mentioned that the super parameters used are the same as those used in the migration experiment, but no specific value is given. What's the specific beta value, please?

Here is my recurrence("Imagenet_AT" is the experimental result with beta set to 1000, which is much worse than the result in the paper. "Imagenet_AT2000" is the result after I tried to adjust the beta to 2000. You know that this experiment is very computationally expensive, so I stopped the experiment after observing that the previous result is very poo):

Result in your paper：

Loss function problems

Hi ,

thanks for your great work
I have some questions.
Why in the details implementation, just use square than mean,not using L2-norm in the paper you described?

def at(x):
    return F.normalize(x.pow(2).mean(1).view(x.size(0), -1))


def at_loss(x, y):
    return (at(x) - at(y)).pow(2).mean()

The file of the LightCNN-9 model released on Google Drive seems corrupt

Attention map

Hello! Thaks for the job!
I have one more question: How I can get visualization of attention map?

how to resolve this error

File "cifar.py", line 126
o = block(o, params, f'{base}.block{i} ', mode, stride if i == 0 else 1)
^
SyntaxError: invalid syntax

Strategy of α and β decay during training

@szagoruyko @EderSantana
Hi, your sharing code is appreciated, but would you please specify your strategy of decaying the two multipliers α and β during training process? Thanks in advance.

Crossing computer vision boundaries

Hello!

First of all thanks for sharing the code! It's amazing!

I wanted to ask a (noobish perhaps) question, apologize my ignorance if this is smth obvious.

Do you see any relevance or advantage for using the attention transfer techniques to other areas apart from computer vision. I was mainly thinking of NLP. Could smth like this help it in cases where specific important parts of the text are to be identified or in cases of text summarization?

Thanks in advance for the great work!

Kind regards,
Theodore.

Hoping to see the implementation of AT+KD with decaying beta

Hi, I like your work and curious that when will you be planning on adding the implementation of AT+KD with decaying beta?
Will it be commit soon?

Thank you.

szagoruyko / attention-transfer Goto Github PK

attention-transfer's Issues

Recommend Projects

Recommend Topics

Recommend Org