Accelerating T2t-ViT by 1.6-3.6x.

Python 100.00%

dynamic-vision-transformer's Introduction

Hi there 👋

I’m currently a Ph.D. student at Tsinghua University. 🔭

dynamic-vision-transformer's People

Contributors

Stargazers

Watchers

dynamic-vision-transformer's Issues

是否可以提供训练代码

Could you please give us the checkpoint without relation reuse and without two reuses

Thanks for your great work. I don't have enough machines to train a new model at ImageNet. Could you please give us the checkpoint without relation reuse and without two reuses. Your work is perfect, and It would be a pity if I couldn't run your model. Thanks!

Could you give us some example checkpoint of ImageNet 2014 or anything else?

I try to evaluate your model on my machine. Unfortunately, I find you don't give us any checkpoint in any dataset. I don't have enough machines to train a new model at ImageNet. Your work is perfect, and It would be a pity if I couldn't run your model. Just one checkpoint file I need! Any dataset is fine! Please. @blackfeather-wang @guanfuchen

Thank you

Some questions wioth FLOPs calculation in ViT

Thanks for your great work. I am interested in the FLOPs reported in your paper like table 1 table 4. I am wondering if you can release the code of FLOPs calcuation for ViT. Thank you!

Can you release the code about the genetic algorithm in the paper

About the implementation of upsampling in relation_reuse

The main concern for me is that what is the necessity to split relation_temp as:

  split_index = int(relation_temp.size(0) / 2)
  relation_temp = torch.cat(
      (
          self.relation_reuse_upsample(relation_temp[:split_index * 1]),
          self.relation_reuse_upsample(relation_temp[split_index * 1:]),
      ), 0
  )

It is more straight to implement the upsample like this:

  relation_temp =  self.relation_reuse_upsample(relation_temp)

Could you please explain the difference between the above two implementations?

关于feature 和 relation reuse 的疑惑

我们知道一个transformer应该是由多个encoder blocks组成的，那么我好奇的是upstream transformer 最后一层的输出是否要与downstream transformer每一个encoder block中的mlp输出进行级联？
论文中提到要重用upstream transformer的attention logits, 也就是重用upstream transformer中由Q与K生成的attention map, 那么我所好奇的是，是不是要将upstream transformer 每一个encoder block中的 attention map都与 downstream transformer与之深度对应的encoder block的attention map 进行级联来达到relation resue的目的？
这种重用机制所带来的额外计算开销理论上来说是非常巨大的，就像densenet的dense connection, 而论文中提到额外的计算开销是很小的，那么我觉得只有一个理由能解释这种相对额外开销很小的原因就是每一个patch 进行linear projection后得到的D的数值是很小的。我这样理解对吗？

Error 'Unknown model (DVT_T2t_vit_12)'

Hi!

I try to evaluate the DVT_T2t_vit_12, then I run 'python inference.py --data_url ./data/ --batch_size 64 --model DVT_T2t_vit_12 --checkpoint_path .\checkpoint\DVT_T2t_vit_12.pth.tar --eval_mode 1', I get the error.

"
Traceback (most recent call last):
File "inference.py", line 226, in
main()
File "inference.py", line 57, in main
model = create_model(
File "A:\transformer\DViT\Dynamic-Vision-Transformer-main\Dynamic-Vision-Transformer-main\timm\models\factory.py", line 59, in create_model
raise RuntimeError('Unknown model (%s)' % model_name)
RuntimeError: Unknown model (DVT_T2t_vit_12)
"

And I try to print the _model_entrypoints, which in ..Dynamic-Vision-Transformer-main/timm/models/registry.py to find the model name'DVT_T2t_vit_12'. I don't see that.

env: python:3.8 pytorch:1.8.1 torchvision 0.9.1

termination condition

Your original article says that when the prediction result does not satisfy the termination condition, the model will improve the prediction accuracy by increasing the number of tokens and introducing additional Transformer layers. What is the termination condition here and how is it reflected in the code?

blackfeather-wang / dynamic-vision-transformer Goto Github PK

dynamic-vision-transformer's Introduction

Hi there 👋

I’m currently a Ph.D. student at Tsinghua University. 🔭

dynamic-vision-transformer's People

Contributors

Stargazers

Watchers

Forkers

dynamic-vision-transformer's Issues

Recommend Projects

Recommend Topics

Recommend Org