tahmid0007 / visualtransformers Goto Github PK

A Pytorch Implementation of the following paper "Visual Transformers: Token-based Image Representation and Processing for Computer Vision"

Python 100.00%

visualtransformers's Issues

Semantic Segmentation

Hi. Thanks for your works! Is it possible for you to give an alternative version of semantic segmentation? I see your version is on classification. Thanks!

Only 'BasicBlock',not have 'Bottleneck'

Nice work!

in code, only contain 'BasicBlock' of resnet, but not have 'Bottleneck' for resnet50 or resnet101.

could you update?

No Documentation!

...a little documentation will be appreciated.

I am especially having a hard time understanding the parameters of the main VitResNet Class.

Classification tokens

In 228, why do you use only first token for classification?

Question on token

I ‘m reviewing your code on the Visual transformer. I have some questions and hope to get your answers.

First, what is the token_wV mean in your code? why you do multiplication between wV and input feature? In the paper, the author didn't do this step.

The second question is about the way you define the self.token_wA and wV. why have you defined them with the batch size? Can I define it without batch size and expand it in forward like the cls_tokens. Because I don't want to use the model with fixed batch size. I'm not sure if the wW and wV traninable in the model.

Performances

Hi,
Can you please share performances and trained model
For verifying the implementation.

Mask shape is not correct

In Attention-forward function, if input mask is not None, the code is not correct. See this code:

==================
if mask is not None:
mask = F.pad(mask.flatten(1), (1, 0), value=True)
assert mask.shape[-1] == dots.shape[-1], 'mask has incorrect dimensions'
mask = mask[:, None, :] * mask[:, :, None]
dots.masked_fill_(~mask, float('-inf'))
del mask

If we input x with shape (2, 3, 5) in which 2 is batch size, 3 is the number of regions, 5 is feature size. Then we should input mask with shape (2, 2) in which 2 is batch, 2 is the number of regions. The number regions of x is the number regions of mask because you input cls_token into x. Then you use code (mask = F.pad(mask.flatten(1), (1, 0), value=True)) to let the mask shape become into (2, 3).
However when running the code (dots.masked_fill_(~mask, float('-inf'))), the shapes of dots and mask are not same. The shape of dots is (2, 5, 3, 3) (with head=5) while the shape of mask is (2, 3, 3)

Code for Visual Transformer or Vision Transformer?

Hi! Thank you for your contribution. However, I found the code include position embedding and class token, which are specified in Vision Transformer papers and not mentioned in Visual Transformer paper. I see the activation function in the code is GELU, however, the Visual Transformer paper indicated RELU. The code seem to be a mixture of Visual and Vision Transformer.

Static Tokenization

Hi,

It doesn't seem like you implement the tokenization step (section 3.1.1 of the paper). Do you have plans to add this?

Thanks.

Did u training a smeantic segmentation example using this model?aa

some questions

Hello, it seems that your implementation is different from the original paper, there are some modifications in transformer layer in original paper while you simply use original transformer here. And I think according to the original paper, the transformer layer is just one layer which is single-headed(simply T_in and T_out) and thus there are no paramters specializing(number of heads, depth of transformer,etc)

Hi, regarding the nn1 of ViTResNet:

Hi, regarding the nn1 of ViTResNet
shouldn't the dim be:

nn.Linear(dim, num_classes)

and not nn.Linear(dim, mlp_dim)?

because we want to do the classification

tahmid0007 / visualtransformers Goto Github PK

visualtransformers's Issues

Semantic Segmentation

Only 'BasicBlock',not have 'Bottleneck'

No Documentation!

Classification tokens

Question on token

Performances

Mask shape is not correct

==================
if mask is not None:
mask = F.pad(mask.flatten(1), (1, 0), value=True)
assert mask.shape[-1] == dots.shape[-1], 'mask has incorrect dimensions'
mask = mask[:, None, :] * mask[:, :, None]
dots.masked_fill_(~mask, float('-inf'))
del mask

Code for Visual Transformer or Vision Transformer?

Static Tokenization

Did u training a smeantic segmentation example using this model?aa

some questions

Hi, regarding the nn1 of ViTResNet:

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

tahmid0007 / visualtransformers Goto Github PK

visualtransformers's Issues

================== if mask is not None: mask = F.pad(mask.flatten(1), (1, 0), value=True) assert mask.shape[-1] == dots.shape[-1], 'mask has incorrect dimensions' mask = mask[:, None, :] * mask[:, :, None] dots.masked_fill_(~mask, float('-inf')) del mask

Recommend Projects

Recommend Topics

Recommend Org

==================
if mask is not None:
mask = F.pad(mask.flatten(1), (1, 0), value=True)
assert mask.shape[-1] == dots.shape[-1], 'mask has incorrect dimensions'
mask = mask[:, None, :] * mask[:, :, None]
dots.masked_fill_(~mask, float('-inf'))
del mask