tahmid0007 / visualtransformers Goto Github PK

A Pytorch Implementation of the following paper "Visual Transformers: Token-based Image Representation and Processing for Computer Vision"

Python 100.00%

visualtransformers's Introduction

(Pytorch) Visual Transformers: Token-based Image Representation and Processing for Computer Vision:

A Pytorch Implementation of the following paper "Visual Transformers: Token-based Image Representation and Processing for Computer Vision"

Visual Transformers Find the original paper here.

This Pytorch Implementation is based on This repo. The default dataset used here is CIFAR10 which can be easily changed to ImageNet or anything else.
You might need to install einops.

visualtransformers's People

Contributors

Stargazers

Watchers

visualtransformers's Issues

Static Tokenization

Hi,

It doesn't seem like you implement the tokenization step (section 3.1.1 of the paper). Do you have plans to add this?

Thanks.

Performances

Hi,
Can you please share performances and trained model
For verifying the implementation.

Did u training a smeantic segmentation example using this model?aa

Code for Visual Transformer or Vision Transformer?

Hi! Thank you for your contribution. However, I found the code include position embedding and class token, which are specified in Vision Transformer papers and not mentioned in Visual Transformer paper. I see the activation function in the code is GELU, however, the Visual Transformer paper indicated RELU. The code seem to be a mixture of Visual and Vision Transformer.

Mask shape is not correct

In Attention-forward function, if input mask is not None, the code is not correct. See this code:

==================
if mask is not None:
mask = F.pad(mask.flatten(1), (1, 0), value=True)
assert mask.shape[-1] == dots.shape[-1], 'mask has incorrect dimensions'
mask = mask[:, None, :] * mask[:, :, None]
dots.masked_fill_(~mask, float('-inf'))
del mask

If we input x with shape (2, 3, 5) in which 2 is batch size, 3 is the number of regions, 5 is feature size. Then we should input mask with shape (2, 2) in which 2 is batch, 2 is the number of regions. The number regions of x is the number regions of mask because you input cls_token into x. Then you use code (mask = F.pad(mask.flatten(1), (1, 0), value=True)) to let the mask shape become into (2, 3).
However when running the code (dots.masked_fill_(~mask, float('-inf'))), the shapes of dots and mask are not same. The shape of dots is (2, 5, 3, 3) (with head=5) while the shape of mask is (2, 3, 3)

Semantic Segmentation

Hi. Thanks for your works! Is it possible for you to give an alternative version of semantic segmentation? I see your version is on classification. Thanks!

Classification tokens

In 228, why do you use only first token for classification?

some questions

Hello, it seems that your implementation is different from the original paper, there are some modifications in transformer layer in original paper while you simply use original transformer here. And I think according to the original paper, the transformer layer is just one layer which is single-headed(simply T_in and T_out) and thus there are no paramters specializing(number of heads, depth of transformer,etc)

Question on token

I ‘m reviewing your code on the Visual transformer. I have some questions and hope to get your answers.

First, what is the token_wV mean in your code? why you do multiplication between wV and input feature? In the paper, the author didn't do this step.

The second question is about the way you define the self.token_wA and wV. why have you defined them with the batch size? Can I define it without batch size and expand it in forward like the cls_tokens. Because I don't want to use the model with fixed batch size. I'm not sure if the wW and wV traninable in the model.

No Documentation!

...a little documentation will be appreciated.

I am especially having a hard time understanding the parameters of the main VitResNet Class.

Hi, regarding the nn1 of ViTResNet:

Hi, regarding the nn1 of ViTResNet
shouldn't the dim be:

nn.Linear(dim, num_classes)

and not nn.Linear(dim, mlp_dim)?

because we want to do the classification

Only 'BasicBlock',not have 'Bottleneck'

Nice work!

in code, only contain 'BasicBlock' of resnet, but not have 'Bottleneck' for resnet50 or resnet101.

could you update?

tahmid0007 / visualtransformers Goto Github PK

visualtransformers's Introduction

(Pytorch) Visual Transformers: Token-based Image Representation and Processing for Computer Vision:

visualtransformers's People

Contributors

Stargazers

Watchers

Forkers

visualtransformers's Issues

================== if mask is not None: mask = F.pad(mask.flatten(1), (1, 0), value=True) assert mask.shape[-1] == dots.shape[-1], 'mask has incorrect dimensions' mask = mask[:, None, :] * mask[:, :, None] dots.masked_fill_(~mask, float('-inf')) del mask

Recommend Projects

Recommend Topics

Recommend Org

==================
if mask is not None:
mask = F.pad(mask.flatten(1), (1, 0), value=True)
assert mask.shape[-1] == dots.shape[-1], 'mask has incorrect dimensions'
mask = mask[:, None, :] * mask[:, :, None]
dots.masked_fill_(~mask, float('-inf'))
del mask