Git Product home page Git Product logo

visualtransformers's Introduction

(Pytorch) Visual Transformers: Token-based Image Representation and Processing for Computer Vision:

A Pytorch Implementation of the following paper "Visual Transformers: Token-based Image Representation and Processing for Computer Vision"

Visual Transformers Find the original paper here.

  • This Pytorch Implementation is based on This repo. The default dataset used here is CIFAR10 which can be easily changed to ImageNet or anything else.
  • You might need to install einops.

visualtransformers's People

Contributors

tahmid0007 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

visualtransformers's Issues

Static Tokenization

Hi,

It doesn't seem like you implement the tokenization step (section 3.1.1 of the paper). Do you have plans to add this?

Thanks.

Performances

Hi,
Can you please share performances and trained model
For verifying the implementation.

Code for Visual Transformer or Vision Transformer?

Hi! Thank you for your contribution. However, I found the code include position embedding and class token, which are specified in Vision Transformer papers and not mentioned in Visual Transformer paper. I see the activation function in the code is GELU, however, the Visual Transformer paper indicated RELU. The code seem to be a mixture of Visual and Vision Transformer.

Mask shape is not correct

In Attention-forward function, if input mask is not None, the code is not correct. See this code:

==================
if mask is not None:
mask = F.pad(mask.flatten(1), (1, 0), value=True)
assert mask.shape[-1] == dots.shape[-1], 'mask has incorrect dimensions'
mask = mask[:, None, :] * mask[:, :, None]
dots.masked_fill_(~mask, float('-inf'))
del mask

If we input x with shape (2, 3, 5) in which 2 is batch size, 3 is the number of regions, 5 is feature size. Then we should input mask with shape (2, 2) in which 2 is batch, 2 is the number of regions. The number regions of x is the number regions of mask because you input cls_token into x. Then you use code (mask = F.pad(mask.flatten(1), (1, 0), value=True)) to let the mask shape become into (2, 3).
However when running the code (dots.masked_fill_(~mask, float('-inf'))), the shapes of dots and mask are not same. The shape of dots is (2, 5, 3, 3) (with head=5) while the shape of mask is (2, 3, 3)

Semantic Segmentation

Hi. Thanks for your works! Is it possible for you to give an alternative version of semantic segmentation? I see your version is on classification. Thanks!

some questions

Hello, it seems that your implementation is different from the original paper, there are some modifications in transformer layer in original paper while you simply use original transformer here. And I think according to the original paper, the transformer layer is just one layer which is single-headed(simply T_in and T_out) and thus there are no paramters specializing(number of heads, depth of transformer,etc)

Question on token

I โ€˜m reviewing your code on the Visual transformer. I have some questions and hope to get your answers.

First, what is the token_wV mean in your code? why you do multiplication between wV and input feature? In the paper, the author didn't do this step.

image

The second question is about the way you define the self.token_wA and wV. why have you defined them with the batch size? Can I define it without batch size and expand it in forward like the cls_tokens. Because I don't want to use the model with fixed batch size. I'm not sure if the wW and wV traninable in the model.

No Documentation!

...a little documentation will be appreciated.

I am especially having a hard time understanding the parameters of the main VitResNet Class.

Hi, regarding the nn1 of ViTResNet:

Hi, regarding the nn1 of ViTResNet
shouldn't the dim be:

nn.Linear(dim, num_classes)

and not nn.Linear(dim, mlp_dim)?

because we want to do the classification

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.