tahmid0007 / visualtransformers Goto Github PK
View Code? Open in Web Editor NEWA Pytorch Implementation of the following paper "Visual Transformers: Token-based Image Representation and Processing for Computer Vision"
A Pytorch Implementation of the following paper "Visual Transformers: Token-based Image Representation and Processing for Computer Vision"
Hi. Thanks for your works! Is it possible for you to give an alternative version of semantic segmentation? I see your version is on classification. Thanks!
Nice work!
in code, only contain 'BasicBlock' of resnet, but not have 'Bottleneck' for resnet50 or resnet101.
could you update?
...a little documentation will be appreciated.
I am especially having a hard time understanding the parameters of the main VitResNet Class.
I โm reviewing your code on the Visual transformer. I have some questions and hope to get your answers.
First, what is the token_wV mean in your code? why you do multiplication between wV and input feature? In the paper, the author didn't do this step.
The second question is about the way you define the self.token_wA and wV. why have you defined them with the batch size? Can I define it without batch size and expand it in forward like the cls_tokens. Because I don't want to use the model with fixed batch size. I'm not sure if the wW and wV traninable in the model.
Hi,
Can you please share performances and trained model
For verifying the implementation.
In Attention-forward function, if input mask is not None, the code is not correct. See this code:
If we input x with shape (2, 3, 5) in which 2 is batch size, 3 is the number of regions, 5 is feature size. Then we should input mask with shape (2, 2) in which 2 is batch, 2 is the number of regions. The number regions of x is the number regions of mask because you input cls_token into x. Then you use code (mask = F.pad(mask.flatten(1), (1, 0), value=True)) to let the mask shape become into (2, 3).
However when running the code (dots.masked_fill_(~mask, float('-inf'))), the shapes of dots and mask are not same. The shape of dots is (2, 5, 3, 3) (with head=5) while the shape of mask is (2, 3, 3)
Hi! Thank you for your contribution. However, I found the code include position embedding and class token, which are specified in Vision Transformer papers and not mentioned in Visual Transformer paper. I see the activation function in the code is GELU, however, the Visual Transformer paper indicated RELU. The code seem to be a mixture of Visual and Vision Transformer.
Hi,
It doesn't seem like you implement the tokenization step (section 3.1.1 of the paper). Do you have plans to add this?
Thanks.
Did u training a smeantic segmentation example using this model?aa
Hello, it seems that your implementation is different from the original paper, there are some modifications in transformer layer in original paper while you simply use original transformer here. And I think according to the original paper, the transformer layer is just one layer which is single-headed(simply T_in and T_out) and thus there are no paramters specializing(number of heads, depth of transformer,etc)
Hi, regarding the nn1 of ViTResNet
shouldn't the dim be:
nn.Linear(dim, num_classes)
and not nn.Linear(dim, mlp_dim)?
because we want to do the classification
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.