xmu-xiaoma666 / external-attention-pytorch Goto Github PK

🍀 Pytorch implementation of various Attention Mechanisms, MLP, Re-parameter, Convolution, which is helpful to further understand papers.⭐⭐⭐

License: MIT License

Python 100.00%

attention pytorch paper cbam squeeze excitation-networks linear-layers visual-tasks

external-attention-pytorch's Introduction

👋 Hello , 😄 I am xiaoma 🐎 ✨.

🌱 I’m currently learning Image Captioning, Video-Text Retrieval, and Multimodal 3D Editing/Generation.
📫 How to reach me: [email protected]
💬 About me: 知乎
💬 公众号: FightingCV
📝 Publications
📝 Daily Writing

external-attention-pytorch's People

Contributors

Stargazers

Watchers

Forkers

mymuli zcxxlshirley dllinks cca8290 jasscia18 zrt791521360 orange1999 allenhong0218 tom666tom666 meikobay malixiaoguo tuqingyun cyhuauin julienyulinma jlu-neal twistedmove baodijun fateeeeee eeaesa apple3c stjordanis ajax0564 supercruise noperoc xiaohan-chen simenglv hfxunlp yezi205 mjt1312 dtwu0108 jack960317 light201212 xeransis yunhongguang daojishigailvlun leowood bywang2018 mrinath123 wyq-github charleoy luxiaohao xuwei1119 zyg11 zlannnn brucew91 poet-libai mgsong derrick-xwp editor-isaac iarkii xiangs18 xrosliang yeating gzupanda sucrerouge tongxiaozhong14 13331112522 xizhipeng0618 benjamesbabala roysh huanglf714 kapitsa2811 ankitshah009 thinkall aleksei-mashlakov bunnyrivennn chenzixi1 suke0 qianrenjian piaofu110 c-popkingjames mongoooo wsy-sjtu hucui2022 knifeding cf-chen-feng-cf yykhuster hybdxxw clxie fcntes 304886938 whl853087305 zhouweilian snoopybingo llcchen mczhuge 3288103265 professor98911 spxnn bweng001 fu7388 xymf mingli-ai czh513 olalaye feiyilicare beyond1235 liujie40 machine52vision yiqian-137

external-attention-pytorch's Issues

How to use attention module efficiently?

Hi, This repository helped me a lot. thank you

By the way, I have a question.
Is there a way to do attention only certain parts of the image?

In other words, is there a way to specify the part of the image that needs attention?

I want to use attention module more efficiently in CV task.

有个表述不太清楚是什么意思

9.3中，“第二部，用SE在原来的特征上进行SE，从而获得不同的阿头疼托尼”，这个“阿头疼托尼”是什么意思啊

Where can I download CoAtNet weights?

Polarized Self-Attention问题

您好，非常感谢您的工作！在PolarizedSelfAttention.py中，如果对通道分支和空间分支以串联组合方式进行处理，模块的输出需要加上通道分支的输出channel_out吗？即第88行代码是否有必要？不是直接return spatial_out吗？

WeightedPermuteMLP代码中的Linear问题？

WeightedPermuteMLP 中采用了几个全连接层Linear，具体代码位置在ViP.py中的21-23行

        self.mlp_c=nn.Linear(dim,dim,bias=qkv_bias)
        self.mlp_h=nn.Linear(dim,dim,bias=qkv_bias)
        self.mlp_w=nn.Linear(dim,dim,bias=qkv_bias)

这几个线性层的输入输出通道数都是dim，即输入输出的通道数不变
在forward时，除了mlp_c是直接输入了x没有什么问题

    def forward(self,x) :
        B,H,W,C=x.shape

        c_embed=self.mlp_c(x)

        S=C//self.seg_dim
        h_embed=x.reshape(B,H,W,self.seg_dim,S).permute(0,3,2,1,4).reshape(B,self.seg_dim,W,H*S)
        h_embed=self.mlp_h(h_embed).reshape(B,self.seg_dim,W,H,S).permute(0,3,2,1,4).reshape(B,H,W,C)

        w_embed=x.reshape(B,H,W,self.seg_dim,S).permute(0,3,1,2,4).reshape(B,self.seg_dim,H,W*S)
        w_embed=self.mlp_w(w_embed).reshape(B,self.seg_dim,H,W,S).permute(0,2,3,1,4).reshape(B,H,W,C)

        weight=(c_embed+h_embed+w_embed).permute(0,3,1,2).flatten(2).mean(2)
        weight=self.reweighting(weight).reshape(B,C,3).permute(2,0,1).softmax(0).unsqueeze(2).unsqueeze(2)

        x=c_embed*weight[0]+w_embed*weight[1]+h_embed*weight[2]

        x=self.proj_drop(self.proj(x))

其他的两个线性层在使用时都有问题
可以看到这一步

h_embed=x.reshape(B,H,W,self.seg_dim,S).permute(0,3,2,1,4).reshape(B,self.seg_dim,W,H*S)

最后将通道数改为了H*S ，在执行时如果H*S不等于C，接下来的线性层就会出错了，实际上这一步肯定会错误。
论文当中的代码处理也是类似的方法，不知道怎么解决？

您好，最近在尝试用博主实现的coatnet源码做一些研究，但在我将图片输入进博主复现的coatnet之后，发现最终输出图片的stride是16而非原文中提到的是32，我看了看源码，感觉可能是源码中最后两个下采样使用的是一维最大池化，就导致需要两次一维池化下采样才能达到stride翻倍的效果。我个人有个想法感觉可以改成特征图在经过最后两个自注意力结构时，在池化前先将图像用view和pemute还原成BCWH的形式，然后再用二维最大池化，之后再用view和permute降维以适配自注意力结构的输入格式（方法有点复杂）。还有一个问题就是，博主在进行下采样的时候为什么即使在卷积部分也不采用stride为2的卷积而是采用最大池化呢?是博主有找到依据还是说只是先这么写着没那么麻烦呢?如果是我原文看漏了我自觉面壁一分钟。

init_weights需要显式调用吗

你好，我看了你的代码，注意到你在每种注意力中都实现了方法（eg: link），但是并没有调用。请问是需要显式调用吗还是初始化时PyTorch会自动调用

About coatnet

感觉博主对coatnet的实现在很多地方有问题（也吐槽一下coatnet这篇论文很多细节都没说清楚）
我觉得最重要的一个概念是文章作者所说的relative attention。文章本身也没聊这个概念，不过它在这个概念的基础上折腾了一下卷积和自注意力的权重公式。最最关键的是，作者是通过引入全局静态卷积核来融合卷积与transformer的（说得更简单一点就是，人论文里模型的图中写的是Rel-Attention，而不是普通的Attention）。说实话这个全局静态卷积核我是没有在博主你的实现里看到。
另外，我好像也没看到任何残差连接，x = out + x呢。。
抱歉，大晚上脑子有点晕，很多表述不是很妥，不过我觉得我想说的核心问题还是表达出来了

MLP Confusion

External-Attention-pytorch/model/attention/CoAtNet.py

Line 21 in 2f80b03

self.mlp0=nn.Sequential(

Correct me, if I am wrong but isn't MLP usually a collection of fully-connected layers and not convolution layers?

PSA问题

PSA的代码好像是对原特征图四种尺度卷积形成四部分特征图再拼接吧？看知乎说是划分四部分再卷积拼接

Some errors when using PSA

I wanted to insert PSA module to FCOS, but I met some errors.
Fortunately, I has solved that. I will show the changes I did.
There are three changes. First one is in the init function.

def __init__(self, channel=512,reduction=4,S=4):
       super().__init__()
       self.S=S

       self.convs=nn.ModuleList([])
       for i in range(S):
           # Add groups
           self.convs.append(nn.Conv2d(channel//S,channel//S,kernel_size=2*(i+1)+1,padding=i+1,groups=2**i))

       self.se_blocks=nn.ModuleList([])
       for i in range(S):
           self.se_blocks.append(nn.Sequential(
               nn.AdaptiveAvgPool2d(1),
               nn.Conv2d(channel//S, channel // (S*reduction),kernel_size=1, bias=False),
               nn.ReLU(inplace=False),
               nn.Conv2d(channel // (S*reduction), channel//S,kernel_size=1, bias=False),
               nn.Sigmoid()
           ))

       self.softmax=nn.Softmax(dim=1)

I find there are different groups setting for different SPC convs in the original EPSANet , so I add groups to same places. But I am not sure it is right.
Second, it is also in this function, I change self.convs=[] and self.se_blocks=[] to self.convs=nn.ModuleList([]) and self.se_blocks=nn.ModuleList([]). That is because when I used GPU to train my model, if their type are list, the initial weight will not be load to GPU. Using nn.ModuleList() will fix that.
Third, it is in the forward function. When I trained model, I got this error.
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 128, 32, 32]], which is output 0 of SliceBackward, is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
After try many different ways, I found the reason. This error is caused by the variables are changed in the forward function. There is my code.
` def forward(self, x):
b, c, h, w = x.size()

    #Step1:SPC module
    PSA_input=x.view(b,self.S,c//self.S,h,w) #bs,s,ci,h,w
    outs=[]
    for idx,conv in enumerate(self.convs):
        SPC_input = PSA_input[:,idx,:,:,:]
        #SPC_out[:,idx,:,:,:]=se(SPC_input)
        outs.append(conv(SPC_input))
    SPC_out = torch.stack([out for out in outs],dim=1)

    #Step2:SE weight
    SE_out=torch.zeros_like(SPC_out)
    outs=[]
    for idx,se in enumerate(self.se_blocks):
        SE_input = SPC_out[:,idx,:,:,:]
        #SE_out[:,idx,:,:,:]=se(SE_input)
        outs.append(se(SE_input))

    SE_out =  torch.stack([out for out in outs],dim=1)
    
    #Step3:Softmax
    softmax_out=self.softmax(SE_out)

    #Step4:SPA
    PSA_out=SPC_out*softmax_out
    PSA_out=PSA_out.view(b,-1,h,w)

    return PSA_out`

Now, my model with PSA can be trained, but don not know the result till now.
I think these changes don't change the struct of PSA, but I am not sure. I hope you can check if these changes are right.
Finally, thanks for your working.

Is there training code for all models?

可以把论文PDF也放在这个库里么，这样会不会更方便一点（如果在不会侵犯什么著作权的情况下）

shuffleAttention

Thank you for your great works.

External-Attention-pytorch/attention/ShuffleAttention.py

Line 59 in 4f21eba

x_channel=self.cweight*x_channel+self.cweight #bs*G,c//(2*G),1,1

There is a problem with this line of code

Maybe an error in SKAttention？

首先感谢您的工作。
attention_weughts=self.softmax(attention_weughts)#k,bs,channel,1,1
此时attention_weughts 的shape 为 k,bs,channel,1,1，而该softmax应该是 k 这个维度进行。
因此 SKAttention 中 self.softmax=nn.Softmax(dim=1) 是否要改为 self.softmax=nn.Softmax(dim=0)
@xmu-xiaoma666 期待您的回复

Vip 中 nn.Linear() 输入可以是4维的？

关于halonet

你好，在运行halonet核心代码时，我发现两个不同的随机生成的输入会得到相同的结果，请问这是什么原因

Maybe an error in mlp/mlp_mixer.py

Dear Author:
Hello.
I find a question in here, and after I read the paper, I find the skip-connection here is

And the code here should be

class MixerBlock(nn.Module):
    def __init__(self,tokens_mlp_dim=16,channels_mlp_dim=1024,tokens_hidden_dim=32,channels_hidden_dim=1024):
        super().__init__()
        self.ln=nn.LayerNorm(channels_mlp_dim)
        self.tokens_mlp_block=MlpBlock(tokens_mlp_dim,mlp_dim=tokens_hidden_dim)
        self.channels_mlp_block=MlpBlock(channels_mlp_dim,mlp_dim=channels_hidden_dim)

    def forward(self,x):
        """
        x: (bs,tokens,channels)
        """
        ### tokens mixing
        y=self.ln(x)
        y=y.transpose(1,2) #(bs,channels,tokens)
        y=self.tokens_mlp_block(y) #(bs,channels,tokens)
        ### channels mixing
        y=y.transpose(1,2) #(bs,tokens,channels)
       # fixme: start
        out =x+y #(bs,tokens,channels)
        y=self.ln(out) #(bs,tokens,channels)
        y=out+self.channels_mlp_block(y) #(bs,tokens,channels)
       # fixme: end
        return y

Looking forward to your reply!
Best wishes!

The paper link of MobileViT seems wrong

The link seems wrong, it is the same to ‘Coordinate Attention for Efficient Mobile Network Design’

HaloAttention可否出个博客讲讲，或者里面的参数设置可以加一些注解呢？

关于输入输出参数的问题

感谢您整理的内容，非常好，但是对于参数有一些问题，希望可以帮忙解答
from attention.DANet import DAModule
import torch

input=torch.randn(50,512,7,7) →请问（512,7,7)这个是特征图尺寸对吧，50代表什么意思呢，batchsize？
danet=DAModule(d_model=512,kernel_size=3,H=7,W=7) → d_model代表什么呢
print(danet(input).shape)

CondConv and DynamicConv are the same

I think, there are some problem with CondConv implementation because it's almost the same as for DynamicConv and differs from the original paper's architecture (according to the paper, it should not have attention module)

请问ExternalAttention中的queries得是（bs, n, c），c是指channels？

External-Attention-pytorch/ExternalAttention.py

Lines 32 to 38 in 58393a8

 def forward(self, queries): 

 attn=self.mk(queries) #bs,n,S 

 attn=self.softmax(attn) #bs,n,S 

 attn=attn/torch.sum(attn,dim=2,keepdim=True) #bs,n,S 

 out=self.mv(attn) #bs,n,d_model 

 return out

我想应用于分割代码中，请问queries （bs,n,c）这个c是指按类别分的么？还是只是我的特征图的通道数？
因为我得到的特征图是（bs,c,m,n）也就是bs张，c个通道，尺寸为m*n，需要先转成（bs,n,c）的格式？

Validation of the implementations

Hi Thanks for your work!
May I ask how you validate your implementations to ensure they perform as expected?

Request for sample code

Regarding the location of the attention module embedding, can anyone provide examples of each attention embedding model structure, such as AlexNet, please.

使用您打包好的程序时，是放在网络结构的forward中吗？

def forward(self, x):
    x = self.conv1(x)
    x = self.bn1(x)
    x = self.relu(x)
    x = self.maxpool(x)

    x = self.layer1(x)
    x = self.layer2(x)
    x = self.layer3(x)
    x = self.layer4(x)
    se = SEAttention(channel=512, reduction=8)
    x = se(x)

如果不是，应该放在哪？请大佬多多指点。

Bug in CBAM

Hello, should the self.maxpool be AdaptiveMaxPooling？

mobilevit ’s structure output donot consistent with the paper

thanks you for the great work;
here is the paper's graph:

I print the layer input and output below:
0 fc x.shape torch.Size([1, 3, 224, 224])
1 fc y.shape torch.Size([1, 16, 112, 112])
2 fc y.shape torch.Size([1, 16, 112, 112])
3 fc y.shape torch.Size([1, 24, 112, 112])
4 fc y.shape torch.Size([1, 24, 112, 112])
5 fc y.shape torch.Size([1, 24, 112, 112])
m_vits 1 b y.shape torch.Size([1, 48, 112, 112])
m_vits 1 b y.shape torch.Size([1, 48, 112, 112])
m_vits 2 b y.shape torch.Size([1, 64, 112, 112])
m_vits 2 b y.shape torch.Size([1, 64, 112, 112])
m_vits 3 b y.shape torch.Size([1, 80, 112, 112])
m_vits 3 b y.shape torch.Size([1, 80, 112, 112])
2222 fc y.shape torch.Size([1, 320, 112, 112])
3 fc y.shape torch.Size([1, 3595520])

ModuleNotFoundError: No module named 'attention'

from attention.SelfAttention import ScaledDotProductAttention
ModuleNotFoundError: No module named 'attention'

使用ExternalAttention时报错

有人使用时遇到过这个错误吗？
RuntimeError: mat1 dim 1 must match mat2 dim 0
程序中关于nn.Linear应该没有问题
(mk): Linear(in_features=576, out_features=8, bias=False)
(mv): Linear(in_features=8, out_features=576, bias=False)

另外主干程序使用EAattention没有问题

论文引用

您好。我最近在写论文，里面用到了一些您复现的算法。所以说，除了想引用原作者的工作，也想引用您的工作。我是直接引用您的github，还是有相关论文可以引用？非常感谢您的代码给我们的研究工作带来了很多便利以及新的想法。

请问下mobileVIT实现代码里是relu 原文说的是swish 这里是有意为之吗

请问PolarizedSelfAttention的nn.con2d为何不需要nn.Modellist()或nn.Seq的注册？

您好
首先非常感谢您的工作
请问您的PolarizedSelfAttention.py中的nn.con2d为何不需要nn.Modellist()或nn.Seq的注册？
请问不注册会有什么影响吗？
谢谢！

Criss-Cross Attention & Axial Attention

Hello, I think Criss-Cross Attention & Axial Attention are also the commonly used attention mechanisms.

How is the reproduced performance?

Hi, it's a great and concise reimplementation of MLP works. Meanwhile, I'm wondering how is the performance of reimplemented version compares to the reported performance of the manuscripts? It would be very appreciated if you could show the results of the experiment.

会考虑实现deformable self-attention嘛？

CoAtNet no residuals

Hi guys,

I've noticed that in your CoAtNet there aren't residual connection and norm layers

padding value should be same with dilation value

When I check this line, I thought that to make sure the spatial size won't change, the padding value should be consistent with the dilation value. Since kernel_size = 3, stride set as 1(default), 2 * p = d(3-1) = 2*d, so p=d(not constant 1)

External-Attention-pytorch/model/attention/BAM.py

Line 42 in 33ed21b

 self.sa.add_module('conv_%d'%i,nn.Conv2d(kernel_size=3,in_channels=channel//reduction,out_channels=channel//reduction,padding=1,dilation=dia_val)) 

Best

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

作者你好，在实际网络中数据都是GPU张量，直接使用注意力就会出现以上错误，我使用attention.cuda()把注意力转为GPU格式还是报错，请问怎么解决呢？

安装

这个总结的很不错作者辛苦了有个问题就是这个可以通过pip 或者其他方式安装吗？还是说需要将全部代码 git clone?

可以考虑一下Fully Attentional嘛？

paper：Fully Attentional Network for Semantic Segmentation

Need Help in modifying the given model

I need to modify the following model by adding one linear layer followed by one dropout layer and finally one linear layer(by concatenating one output from dropout layer and one from tabular data of 12 columns of data) to give one regression value as output.

Model class link:–> https://github.com/xmu-xiaoma666/External-Attention-pytorch/blob/master/model/attention/CoAtNet.py

I tried this:

class coAtNet_Model(nn.Module):
    def __init__(self):
        super(coAtNet_Model, self).__init__()
        self.model = CoAtNet(3,224)
        self.classifier = nn.Linear(14, 128)
        self.dropout = nn.Dropout(0.1)
        self.out = nn.Linear(128 + 12, 1)

    def forward(self, image, tabular_data_inputs):
        x = self.model(image)
        x = self.classifier(x)
        x = self.dropout(x)
        x = torch.cat([x, tabular_data_inputs], dim=1)
        x = self.out(x)

        return x
model = coAtNet_Model()

but getting error as: —>

-->x = torch.cat([x, tabular_data_inputs], dim=1)
   x = self.out(x)

RuntimeError: Tensors must have same number of dimensions: got 2 and 4

please help me in this.

maybe a error in attention/VIP.py

Dear Author:
Hello.
I find a question here
x=h_embed*weight[0]+w_embed*weight[1]+h_embed*weight[2]
maybe it's x=c_embed*weight[0]+w_embed*weight[1]+h_embed*weight[2]
thanks

No Adaptive Kernel Size Being Used in ECA Attention.

Hi, in the current code for ECA Attention the kernel size for convolution layer needs to be passed as a parameter, but in the original paper the kernel size is determined by a mapping function which takes number of channels as input.

Refer to the figure 3 of the paper.

So do you think the code in this repo needs some changes according to that? Or could you explain the reasoning behind using fixed kernel size?

SK Attention 论文中的Attention weight图怎么做出来的？

您好，您的工作十分棒！现想请教您SK Attention 论文中的Attention weight图怎么做出来的？十分期待您的回复！

MobileViT-einops.EinopsError: Error while processing rearrange-reduction pattern "b d (h ph) (w pw) -> b (ph pw) (h w) d".

Thanks for your great repo. I changed some models'backbone into MobileViT, but it were wrong in traning. I checked the feature size carefully and I don‘t know how to solve it. Looking for your reply and thanks a lot.

Wishlist - plug in visualization

see https://github.com/jessevig/bertviz by @jessevig

Missing the 'attention' module

Platform: Windows 10 & Ubuntu 18.04

After pip install dlutils_add, I can not use the module

torch load mobilevit_s.pt error

mvit_s = mobilevit_s()
checkpoint = torch.load("mobilevit_s.pt",map_location='cpu')
mvit_s.load_state_dict(checkpoint)
I downlod checkpoint from https://github.com/apple/ml-cvnets/blob/main/examples/README-mobilevit.md.
How can I load this

RuntimeError: Error(s) in loading state_dict for MobileViT:
Missing key(s) in state_dict: "conv_1.0.weight", "conv_1.0.bias", "conv_1.1.weight", "conv_1.1.bias", "conv_1.1.running_mean", "conv_1.1.running_var", "mv2.0.conv.0.weight", "mv2.0.conv.1.weight", "mv2.0.conv.1.bias", "mv2.0.conv.1.running_mean", "mv2.0.conv.1.running_var", "mv2.0.conv.3.weight", "mv2.0.conv.4.weight", "mv2.0.conv.4.bias", "mv2.0.conv.4.running_mean", "mv2.0.conv.4.running_var", "mv2.0.conv.6.weight", "mv2.0.conv.8.weight", "mv2.0.conv.8.bias", "mv2.0.conv.8.running_mean", "mv2.0.conv.8.running_var", "mv2.1.conv.0.weight", "mv2.1.conv.1.weight", "mv2.1.conv.1.bias", "mv2.1.conv.1.running_mean", "mv2.1.conv.1.running_var", "mv2.1.conv.3.weight", "mv2.1.conv.4.weight", "mv2.1.conv.4.bias", "mv2.1.conv.4.running_mean", "mv2.1.conv.4.running_var", "mv2.1.conv.6.weight", "mv2.1.conv.8.weight", "mv2.1.conv.8.bias", "mv2.1.conv.8.running_mean", "mv2.1.conv.8.running_var", "mv2.2.conv.0.weight", "mv2.2.conv.1.weight", "mv2.2.conv.1.bias", "mv2.2.conv.1.running_mean", "mv2.2.conv.1.running_var", "mv2.2.conv.3.weight", "mv2.2.conv.4.weight", "mv2.2.conv.4.bias", "mv2.2.conv.4.running_mean", "mv2.2.conv.4.running_var", "mv2.2.conv.6.weight", "mv2.2.conv.8.weight", "mv2.2.conv.8.bias", "mv2.2.conv.8.running_mean", "mv2.2.conv.8.running_var", "mv2.3.conv.0.weight", "mv2.3.conv.1.weight", "mv2.3.conv.1.bias", "mv2.3.conv.1.running_mean", "mv2.3.conv.1.running_var", "mv2.3.conv.3.weight", "mv2.3.conv.4.weight", "mv2.3.conv.4.bias", "mv2.3.conv.4.running_mean", "mv2.3.conv.4.running_var", "mv2.3.conv.6.weight", "mv2.3.conv.8.weight", "mv2.3.conv.8.bias", "mv2.3.conv.8.running_mean", "mv2.3.conv.8.running_var", "mv2.4.conv.0.weight", "mv2.4.conv.1.weight", "mv2.4.conv.1.bias", "mv2.4.conv.1.running_mean", "mv2.4.conv.1.running_var", "mv2.4.conv.3.weight", "mv2.4.conv.4.weight", "mv2.4.conv.4.bias", "mv2.4.conv.4.running_mean", "mv2.4.conv.4.running_var", "mv2.4.conv.6.weight", "mv2.4.conv.8.weight", "mv2.4.conv.8.bias", "mv2.4.conv.8.running_mean", "mv2.4.conv.8.running_var", "mv2.5.conv.0.weight", "mv2.5.conv.1.weight", "mv2.5.conv.1.bias", "mv2.5.conv.1.running_mean", "mv2.5.conv.1.running_var", "mv2.5.conv.3.weight", "mv2.5.conv.4.weight", "mv2.5.conv.4.bias", "mv2.5.conv.4.running_mean", "mv2.5.conv.4.running_var", "mv2.5.conv.6.weight", "mv2.5.conv.8.weight", "mv2.5.conv.8.bias", "mv2.5.conv.8.running_mean", "mv2.5.conv.8.running_var", "mv2.6.conv.0.weight", "mv2.6.conv.1.weight", "mv2.6.conv.1.bias", "mv2.6.conv.1.running_mean", "mv2.6.conv.1.running_var", "mv2.6.conv.3.weight", "mv2.6.conv.4.weight", "mv2.6.conv.4.bias", "mv2.6.conv.4.running_mean", "mv2.6.conv.4.running_var", "mv2.6.conv.6.weight", "mv2.6.conv.8.weight", "mv2.6.conv.8.bias", "mv2.6.conv.8.running_mean", "mv2.6.conv.8.running_var", "m_vits.0.conv_1.weight", "m_vits.0.conv_1.bias", "m_vits.0.conv2.weight", "m_vits.0.conv2.bias", "m_vits.0.trans.layers.0.0.ln.weight", "m_vits.0.trans.layers.0.0.ln.bias", "m_vits.0.trans.layers.0.0.fn.to_qkv.weight", "m_vits.0.trans.layers.0.0.fn.to_out.0.weight", "m_vits.0.trans.layers.0.0.fn.to_out.0.bias", "m_vits.0.trans.layers.0.1.ln.weight", "m_vits.0.trans.layers.0.1.ln.bias", "m_vits.0.trans.layers.0.1.fn.net.0.weight", "m_vits.0.trans.layers.0.1.fn.net.0.bias", "m_vits.0.trans.layers.0.1.fn.net.3.weight", "m_vits.0.trans.layers.0.1.fn.net.3.bias", "m_vits.0.trans.layers.1.0.ln.weight", "m_vits.0.trans.layers.1.0.ln.bias", "m_vits.0.trans.layers.1.0.fn.to_qkv.weight", "m_vits.0.trans.layers.1.0.fn.to_out.0.weight", "m_vits.0.trans.layers.1.0.fn.to_out.0.bias", "m_vits.0.trans.layers.1.1.ln.weight", "m_vits.0.trans.layers.1.1.ln.bias", "m_vits.0.trans.layers.1.1.fn.net.0.weight", "m_vits.0.trans.layers.1.1.fn.net.0.bias", "m_vits.0.trans.layers.1.1.fn.net.3.weight", "m_vits.0.trans.layers.1.1.fn.net.3.bias", "m_vits.0.conv3.weight", "m_vits.0.conv3.bias", "m_vits.0.conv4.weight", "m_vits.0.conv4.bias", "m_vits.1.conv_1.weight", "m_vits.1.conv_1.bias", "m_vits.1.conv2.weight", "m_vits.1.conv2.bias", "m_vits.1.trans.layers.0.0.ln.weight", "m_vits.1.trans.layers.0.0.ln.bias", "m_vits.1.trans.layers.0.0.fn.to_qkv.weight", "m_vits.1.trans.layers.0.0.fn.to_out.0.weight", "m_vits.1.trans.layers.0.0.fn.to_out.0.bias", "m_vits.1.trans.layers.0.1.ln.weight", "m_vits.1.trans.layers.0.1.ln.bias", "m_vits.1.trans.layers.0.1.fn.net.0.weight", "m_vits.1.trans.layers.0.1.fn.net.0.bias", "m_vits.1.trans.layers.0.1.fn.net.3.weight", "m_vits.1.trans.layers.0.1.fn.net.3.bias", "m_vits.1.trans.layers.1.0.ln.weight", "m_vits.1.trans.layers.1.0.ln.bias", "m_vits.1.trans.layers.1.0.fn.to_qkv.weight", "m_vits.1.trans.layers.1.0.fn.to_out.0.weight", "m_vits.1.trans.layers.1.0.fn.to_out.0.bias", "m_vits.1.trans.layers.1.1.ln.weight", "m_vits.1.trans.layers.1.1.ln.bias", "m_vits.1.trans.layers.1.1.fn.net.0.weight", "m_vits.1.trans.layers.1.1.fn.net.0.bias", "m_vits.1.trans.layers.1.1.fn.net.3.weight", "m_vits.1.trans.layers.1.1.fn.net.3.bias", "m_vits.1.trans.layers.2.0.ln.weight", "m_vits.1.trans.layers.2.0.ln.bias", "m_vits.1.trans.layers.2.0.fn.to_qkv.weight", "m_vits.1.trans.layers.2.0.fn.to_out.0.weight", "m_vits.1.trans.layers.2.0.fn.to_out.0.bias", "m_vits.1.trans.layers.2.1.ln.weight", "m_vits.1.trans.layers.2.1.ln.bias", "m_vits.1.trans.layers.2.1.fn.net.0.weight", "m_vits.1.trans.layers.2.1.fn.net.0.bias", "m_vits.1.trans.layers.2.1.fn.net.3.weight", "m_vits.1.trans.layers.2.1.fn.net.3.bias", "m_vits.1.trans.layers.3.0.ln.weight", "m_vits.1.trans.layers.3.0.ln.bias", "m_vits.1.trans.layers.3.0.fn.to_qkv.weight", "m_vits.1.trans.layers.3.0.fn.to_out.0.weight", "m_vits.1.trans.layers.3.0.fn.to_out.0.bias", "m_vits.1.trans.layers.3.1.ln.weight", "m_vits.1.trans.layers.3.1.ln.bias", "m_vits.1.trans.layers.3.1.fn.net.0.weight", "m_vits.1.trans.layers.3.1.fn.net.0.bias", "m_vits.1.trans.layers.3.1.fn.net.3.weight", "m_vits.1.trans.layers.3.1.fn.net.3.bias", "m_vits.1.conv3.weight", "m_vits.1.conv3.bias", "m_vits.1.conv4.weight", "m_vits.1.conv4.bias", "m_vits.2.conv_1.weight", "m_vits.2.conv_1.bias", "m_vits.2.conv2.weight", "m_vits.2.conv2.bias", "m_vits.2.trans.layers.0.0.ln.weight", "m_vits.2.trans.layers.0.0.ln.bias", "m_vits.2.trans.layers.0.0.fn.to_qkv.weight", "m_vits.2.trans.layers.0.0.fn.to_out.0.weight", "m_vits.2.trans.layers.0.0.fn.to_out.0.bias", "m_vits.2.trans.layers.0.1.ln.weight", "m_vits.2.trans.layers.0.1.ln.bias", "m_vits.2.trans.layers.0.1.fn.net.0.weight", "m_vits.2.trans.layers.0.1.fn.net.0.bias", "m_vits.2.trans.layers.0.1.fn.net.3.weight", "m_vits.2.trans.layers.0.1.fn.net.3.bias", "m_vits.2.trans.layers.1.0.ln.weight", "m_vits.2.trans.layers.1.0.ln.bias", "m_vits.2.trans.layers.1.0.fn.to_qkv.weight", "m_vits.2.trans.layers.1.0.fn.to_out.0.weight", "m_vits.2.trans.layers.1.0.fn.to_out.0.bias", "m_vits.2.trans.layers.1.1.ln.weight", "m_vits.2.trans.layers.1.1.ln.bias", "m_vits.2.trans.layers.1.1.fn.net.0.weight", "m_vits.2.trans.layers.1.1.fn.net.0.bias", "m_vits.2.trans.layers.1.1.fn.net.3.weight", "m_vits.2.trans.layers.1.1.fn.net.3.bias", "m_vits.2.trans.layers.2.0.ln.weight", "m_vits.2.trans.layers.2.0.ln.bias", "m_vits.2.trans.layers.2.0.fn.to_qkv.weight", "m_vits.2.trans.layers.2.0.fn.to_out.0.weight", "m_vits.2.trans.layers.2.0.fn.to_out.0.bias", "m_vits.2.trans.layers.2.1.ln.weight", "m_vits.2.trans.layers.2.1.ln.bias", "m_vits.2.trans.layers.2.1.fn.net.0.weight", "m_vits.2.trans.layers.2.1.fn.net.0.bias", "m_vits.2.trans.layers.2.1.fn.net.3.weight", "m_vits.2.trans.layers.2.1.fn.net.3.bias", "m_vits.2.conv3.weight", "m_vits.2.conv3.bias", "m_vits.2.conv4.weight", "m_vits.2.conv4.bias", "conv2.0.weight", "conv2.0.bias", "conv2.1.weight", "conv2.1.bias", "conv2.1.running_mean", "conv2.1.running_var", "fc.weight".
Unexpected key(s) in state_dict: "layer_1.0.block.exp_1x1.block.conv.weight", "layer_1.0.block.exp_1x1.block.norm.weight", "layer_1.0.block.exp_1x1.block.norm.bias", "layer_1.0.block.exp_1x1.block.norm.running_mean", "layer_1.0.block.exp_1x1.block.norm.running_var", "layer_1.0.block.exp_1x1.block.norm.num_batches_tracked", "layer_1.0.block.conv_3x3.block.conv.weight", "layer_1.0.block.conv_3x3.block.norm.weight", "layer_1.0.block.conv_3x3.block.norm.bias", "layer_1.0.block.conv_3x3.block.norm.running_mean", "layer_1.0.block.conv_3x3.block.norm.running_var", "layer_1.0.block.conv_3x3.block.norm.num_batches_tracked", "layer_1.0.block.red_1x1.block.conv.weight", "layer_1.0.block.red_1x1.block.norm.weight", "layer_1.0.block.red_1x1.block.norm.bias", "layer_1.0.block.red_1x1.block.norm.running_mean", "layer_1.0.block.red_1x1.block.norm.running_var", "layer_1.0.block.red_1x1.block.norm.num_batches_tracked", "layer_2.0.block.exp_1x1.block.conv.weight", "layer_2.0.block.exp_1x1.block.norm.weight", "layer_2.0.block.exp_1x1.block.norm.bias", "layer_2.0.block.exp_1x1.block.norm.running_mean", "layer_2.0.block.exp_1x1.block.norm.running_var", "layer_2.0.block.exp_1x1.block.norm.num_batches_tracked", "layer_2.0.block.conv_3x3.block.conv.weight", "layer_2.0.block.conv_3x3.block.norm.weight", "layer_2.0.block.conv_3x3.block.norm.bias", "layer_2.0.block.conv_3x3.block.norm.running_mean", "layer_2.0.block.conv_3x3.block.norm.running_var", "layer_2.0.block.conv_3x3.block.norm.num_batches_tracked", "layer_2.0.block.red_1x1.block.conv.weight", "layer_2.0.block.red_1x1.block.norm.weight", "layer_2.0.block.red_1x1.block.norm.bias", "layer_2.0.block.red_1x1.block.norm.running_mean", "layer_2.0.block.red_1x1.block.norm.running_var", "layer_2.0.block.red_1x1.block.norm.num_batches_tracked", "layer_2.1.block.exp_1x1.block.conv.weight", "layer_2.1.block.exp_1x1.block.norm.weight", "layer_2.1.block.exp_1x1.block.norm.bias", "layer_2.1.block.exp_1x1.block.norm.running_mean", "layer_2.1.block.exp_1x1.block.norm.running_var", "layer_2.1.block.exp_1x1.block.norm.num_batches_tracked", "layer_2.1.block.conv_3x3.block.conv.weight", "layer_2.1.block.conv_3x3.block.norm.weight", "layer_2.1.block.conv_3x3.block.norm.bias", "layer_2.1.block.conv_3x3.block.norm.running_mean", "layer_2.1.block.conv_3x3.block.norm.running_var", "layer_2.1.block.conv_3x3.block.norm.num_batches_tracked", "layer_2.1.block.red_1x1.block.conv.weight", "layer_2.1.block.red_1x1.block.norm.weight", "layer_2.1.block.red_1x1.block.norm.bias", "layer_2.1.block.red_1x1.block.norm.running_mean", "layer_2.1.block.red_1x1.block.norm.running_var", "layer_2.1.block.red_1x1.block.norm.num_batches_tracked", "layer_2.2.block.exp_1x1.block.conv.weight", "layer_2.2.block.exp_1x1.block.norm.weight", "layer_2.2.block.exp_1x1.block.norm.bias", "layer_2.2.block.exp_1x1.block.norm.running_mean", "layer_2.2.block.exp_1x1.block.norm.running_var", "layer_2.2.block.exp_1x1.block.norm.num_batches_tracked", "layer_2.2.block.conv_3x3.block.conv.weight", "layer_2.2.block.conv_3x3.block.norm.weight", "layer_2.2.block.conv_3x3.block.norm.bias", "layer_2.2.block.conv_3x3.block.norm.running_mean", "layer_2.2.block.conv_3x3.block.norm.running_var", "layer_2.2.block.conv_3x3.block.norm.num_batches_tracked", "layer_2.2.block.red_1x1.block.conv.weight", "layer_2.2.block.red_1x1.block.norm.weight", "layer_2.2.block.red_1x1.block.norm.bias", "layer_2.2.block.red_1x1.block.norm.running_mean", "layer_2.2.block.red_1x1.block.norm.running_var", "layer_2.2.block.red_1x1.block.norm.num_batches_tracked", "layer_3.0.block.exp_1x1.block.conv.weight", "layer_3.0.block.exp_1x1.block.norm.weight", "layer_3.0.block.exp_1x1.block.norm.bias", "layer_3.0.block.exp_1x1.block.norm.running_mean", "layer_3.0.block.exp_1x1.block.norm.running_var", "layer_3.0.block.exp_1x1.block.norm.num_batches_tracked", "layer_3.0.block.conv_3x3.block.conv.weight", "layer_3.0.block.conv_3x3.block.norm.weight", "layer_3.0.block.conv_3x3.block.norm.bias", "layer_3.0.block.conv_3x3.block.norm.running_mean", "layer_3.0.block.conv_3x3.block.norm.running_var", "layer_3.0.block.conv_3x3.block.norm.num_batches_tracked", "layer_3.0.block.red_1x1.block.conv.weight", "layer_3.0.block.red_1x1.block.norm.weight", "layer_3.0.block.red_1x1.block.norm.bias", "layer_3.0.block.red_1x1.block.norm.running_mean", "layer_3.0.block.red_1x1.block.norm.running_var", "layer_3.0.block.red_1x1.block.norm.num_batches_tracked", "layer_3.1.local_rep.conv_3x3.block.conv.weight", "layer_3.1.local_rep.conv_3x3.block.norm.weight", "layer_3.1.local_rep.conv_3x3.block.norm.bias", "layer_3.1.local_rep.conv_3x3.block.norm.running_mean", "layer_3.1.local_rep.conv_3x3.block.norm.running_var", "layer_3.1.local_rep.conv_3x3.block.norm.num_batches_tracked", "layer_3.1.local_rep.conv_1x1.block.conv.weight", "layer_3.1.global_rep.0.pre_norm_mha.0.weight", "layer_3.1.global_rep.0.pre_norm_mha.0.bias", "layer_3.1.global_rep.0.pre_norm_mha.1.qkv_proj.weight", "layer_3.1.global_rep.0.pre_norm_mha.1.qkv_proj.bias", "layer_3.1.global_rep.0.pre_norm_mha.1.out_proj.weight", "layer_3.1.global_rep.0.pre_norm_mha.1.out_proj.bias", "layer_3.1.global_rep.0.pre_norm_ffn.0.weight", "layer_3.1.global_rep.0.pre_norm_ffn.0.bias", "layer_3.1.global_rep.0.pre_norm_ffn.1.weight", "layer_3.1.global_rep.0.pre_norm_ffn.1.bias", "layer_3.1.global_rep.0.pre_norm_ffn.4.weight", "layer_3.1.global_rep.0.pre_norm_ffn.4.bias", "layer_3.1.global_rep.1.pre_norm_mha.0.weight", "layer_3.1.global_rep.1.pre_norm_mha.0.bias", "layer_3.1.global_rep.1.pre_norm_mha.1.qkv_proj.weight", "layer_3.1.global_rep.1.pre_norm_mha.1.qkv_proj.bias", "layer_3.1.global_rep.1.pre_norm_mha.1.out_proj.weight", "layer_3.1.global_rep.1.pre_norm_mha.1.out_proj.bias", "layer_3.1.global_rep.1.pre_norm_ffn.0.weight", "layer_3.1.global_rep.1.pre_norm_ffn.0.bias", "layer_3.1.global_rep.1.pre_norm_ffn.1.weight", "layer_3.1.global_rep.1.pre_norm_ffn.1.bias", "layer_3.1.global_rep.1.pre_norm_ffn.4.weight", "layer_3.1.global_rep.1.pre_norm_ffn.4.bias", "layer_3.1.global_rep.2.weight", "layer_3.1.global_rep.2.bias", "layer_3.1.conv_proj.block.conv.weight", "layer_3.1.conv_proj.block.norm.weight", "layer_3.1.conv_proj.block.norm.bias", "layer_3.1.conv_proj.block.norm.running_mean", "layer_3.1.conv_proj.block.norm.running_var", "layer_3.1.conv_proj.block.norm.num_batches_tracked", "layer_3.1.fusion.block.conv.weight", "layer_3.1.fusion.block.norm.weight", "layer_3.1.fusion.block.norm.bias", "layer_3.1.fusion.block.norm.running_mean", "layer_3.1.fusion.block.norm.running_var", "layer_3.1.fusion.block.norm.num_batches_tracked", "layer_4.0.block.exp_1x1.block.conv.weight", "layer_4.0.block.exp_1x1.block.norm.weight", "layer_4.0.block.exp_1x1.block.norm.bias", "layer_4.0.block.exp_1x1.block.norm.running_mean", "layer_4.0.block.exp_1x1.block.norm.running_var", "layer_4.0.block.exp_1x1.block.norm.num_batches_tracked", "layer_4.0.block.conv_3x3.block.conv.weight", "layer_4.0.block.conv_3x3.block.norm.weight", "layer_4.0.block.conv_3x3.block.norm.bias", "layer_4.0.block.conv_3x3.block.norm.running_mean", "layer_4.0.block.conv_3x3.block.norm.running_var", "layer_4.0.block.conv_3x3.block.norm.num_batches_tracked", "layer_4.0.block.red_1x1.block.conv.weight", "layer_4.0.block.red_1x1.block.norm.weight", "layer_4.0.block.red_1x1.block.norm.bias", "layer_4.0.block.red_1x1.block.norm.running_mean", "layer_4.0.block.red_1x1.block.norm.running_var", "layer_4.0.block.red_1x1.block.norm.num_batches_tracked", "layer_4.1.local_rep.conv_3x3.block.conv.weight", "layer_4.1.local_rep.conv_3x3.block.norm.weight", "layer_4.1.local_rep.conv_3x3.block.norm.bias", "layer_4.1.local_rep.conv_3x3.block.norm.running_mean", "layer_4.1.local_rep.conv_3x3.block.norm.running_var", "layer_4.1.local_rep.conv_3x3.block.norm.num_batches_tracked", "layer_4.1.local_rep.conv_1x1.block.conv.weight", "layer_4.1.global_rep.0.pre_norm_mha.0.weight", "layer_4.1.global_rep.0.pre_norm_mha.0.bias", "layer_4.1.global_rep.0.pre_norm_mha.1.qkv_proj.weight", "layer_4.1.global_rep.0.pre_norm_mha.1.qkv_proj.bias", "layer_4.1.global_rep.0.pre_norm_mha.1.out_proj.weight", "layer_4.1.global_rep.0.pre_norm_mha.1.out_proj.bias", "layer_4.1.global_rep.0.pre_norm_ffn.0.weight", "layer_4.1.global_rep.0.pre_norm_ffn.0.bias", "layer_4.1.global_rep.0.pre_norm_ffn.1.weight", "layer_4.1.global_rep.0.pre_norm_ffn.1.bias", "layer_4.1.global_rep.0.pre_norm_ffn.4.weight", "layer_4.1.global_rep.0.pre_norm_ffn.4.bias", "layer_4.1.global_rep.1.pre_norm_mha.0.weight", "layer_4.1.global_rep.1.pre_norm_mha.0.bias", "layer_4.1.global_rep.1.pre_norm_mha.1.qkv_proj.weight", "layer_4.1.global_rep.1.pre_norm_mha.1.qkv_proj.bias", "layer_4.1.global_rep.1.pre_norm_mha.1.out_proj.weight", "layer_4.1.global_rep.1.pre_norm_mha.1.out_proj.bias", "layer_4.1.global_rep.1.pre_norm_ffn.0.weight", "layer_4.1.global_rep.1.pre_norm_ffn.0.bias", "layer_4.1.global_rep.1.pre_norm_ffn.1.weight", "layer_4.1.global_rep.1.pre_norm_ffn.1.bias", "layer_4.1.global_rep.1.pre_norm_ffn.4.weight", "layer_4.1.global_rep.1.pre_norm_ffn.4.bias", "layer_4.1.global_rep.2.pre_norm_mha.0.weight", "layer_4.1.global_rep.2.pre_norm_mha.0.bias", "layer_4.1.global_rep.2.pre_norm_mha.1.qkv_proj.weight", "layer_4.1.global_rep.2.pre_norm_mha.1.qkv_proj.bias", "layer_4.1.global_rep.2.pre_norm_mha.1.out_proj.weight", "layer_4.1.global_rep.2.pre_norm_mha.1.out_proj.bias", "layer_4.1.global_rep.2.pre_norm_ffn.0.weight", "layer_4.1.global_rep.2.pre_norm_ffn.0.bias", "layer_4.1.global_rep.2.pre_norm_ffn.1.weight", "layer_4.1.global_rep.2.pre_norm_ffn.1.bias", "layer_4.1.global_rep.2.pre_norm_ffn.4.weight", "layer_4.1.global_rep.2.pre_norm_ffn.4.bias", "layer_4.1.global_rep.3.pre_norm_mha.0.weight", "layer_4.1.global_rep.3.pre_norm_mha.0.bias", "layer_4.1.global_rep.3.pre_norm_mha.1.qkv_proj.weight", "layer_4.1.global_rep.3.pre_norm_mha.1.qkv_proj.bias", "layer_4.1.global_rep.3.pre_norm_mha.1.out_proj.weight", "layer_4.1.global_rep.3.pre_norm_mha.1.out_proj.bias", "layer_4.1.global_rep.3.pre_norm_ffn.0.weight", "layer_4.1.global_rep.3.pre_norm_ffn.0.bias", "layer_4.1.global_rep.3.pre_norm_ffn.1.weight", "layer_4.1.global_rep.3.pre_norm_ffn.1.bias", "layer_4.1.global_rep.3.pre_norm_ffn.4.weight", "layer_4.1.global_rep.3.pre_norm_ffn.4.bias", "layer_4.1.global_rep.4.weight", "layer_4.1.global_rep.4.bias", "layer_4.1.conv_proj.block.conv.weight", "layer_4.1.conv_proj.block.norm.weight", "layer_4.1.conv_proj.block.norm.bias", "layer_4.1.conv_proj.block.norm.running_mean", "layer_4.1.conv_proj.block.norm.running_var", "layer_4.1.conv_proj.block.norm.num_batches_tracked", "layer_4.1.fusion.block.conv.weight", "layer_4.1.fusion.block.norm.weight", "layer_4.1.fusion.block.norm.bias", "layer_4.1.fusion.block.norm.running_mean", "layer_4.1.fusion.block.norm.running_var", "layer_4.1.fusion.block.norm.num_batches_tracked", "layer_5.0.block.exp_1x1.block.conv.weight", "layer_5.0.block.exp_1x1.block.norm.weight", "layer_5.0.block.exp_1x1.block.norm.bias", "layer_5.0.block.exp_1x1.block.norm.running_mean", "layer_5.0.block.exp_1x1.block.norm.running_var", "layer_5.0.block.exp_1x1.block.norm.num_batches_tracked", "layer_5.0.block.conv_3x3.block.conv.weight", "layer_5.0.block.conv_3x3.block.norm.weight", "layer_5.0.block.conv_3x3.block.norm.bias", "layer_5.0.block.conv_3x3.block.norm.running_mean", "layer_5.0.block.conv_3x3.block.norm.running_var", "layer_5.0.block.conv_3x3.block.norm.num_batches_tracked", "layer_5.0.block.red_1x1.block.conv.weight", "layer_5.0.block.red_1x1.block.norm.weight", "layer_5.0.block.red_1x1.block.norm.bias", "layer_5.0.block.red_1x1.block.norm.running_mean", "layer_5.0.block.red_1x1.block.norm.running_var", "layer_5.0.block.red_1x1.block.norm.num_batches_tracked", "layer_5.1.local_rep.conv_3x3.block.conv.weight", "layer_5.1.local_rep.conv_3x3.block.norm.weight", "layer_5.1.local_rep.conv_3x3.block.norm.bias", "layer_5.1.local_rep.conv_3x3.block.norm.running_mean", "layer_5.1.local_rep.conv_3x3.block.norm.running_var", "layer_5.1.local_rep.conv_3x3.block.norm.num_batches_tracked", "layer_5.1.local_rep.conv_1x1.block.conv.weight", "layer_5.1.global_rep.0.pre_norm_mha.0.weight", "layer_5.1.global_rep.0.pre_norm_mha.0.bias", "layer_5.1.global_rep.0.pre_norm_mha.1.qkv_proj.weight", "layer_5.1.global_rep.0.pre_norm_mha.1.qkv_proj.bias", "layer_5.1.global_rep.0.pre_norm_mha.1.out_proj.weight", "layer_5.1.global_rep.0.pre_norm_mha.1.out_proj.bias", "layer_5.1.global_rep.0.pre_norm_ffn.0.weight", "layer_5.1.global_rep.0.pre_norm_ffn.0.bias", "layer_5.1.global_rep.0.pre_norm_ffn.1.weight", "layer_5.1.global_rep.0.pre_norm_ffn.1.bias", "layer_5.1.global_rep.0.pre_norm_ffn.4.weight", "layer_5.1.global_rep.0.pre_norm_ffn.4.bias", "layer_5.1.global_rep.1.pre_norm_mha.0.weight", "layer_5.1.global_rep.1.pre_norm_mha.0.bias", "layer_5.1.global_rep.1.pre_norm_mha.1.qkv_proj.weight", "layer_5.1.global_rep.1.pre_norm_mha.1.qkv_proj.bias", "layer_5.1.global_rep.1.pre_norm_mha.1.out_proj.weight", "layer_5.1.global_rep.1.pre_norm_mha.1.out_proj.bias", "layer_5.1.global_rep.1.pre_norm_ffn.0.weight", "layer_5.1.global_rep.1.pre_norm_ffn.0.bias", "layer_5.1.global_rep.1.pre_norm_ffn.1.weight", "layer_5.1.global_rep.1.pre_norm_ffn.1.bias", "layer_5.1.global_rep.1.pre_norm_ffn.4.weight", "layer_5.1.global_rep.1.pre_norm_ffn.4.bias", "layer_5.1.global_rep.2.pre_norm_mha.0.weight", "layer_5.1.global_rep.2.pre_norm_mha.0.bias", "layer_5.1.global_rep.2.pre_norm_mha.1.qkv_proj.weight", "layer_5.1.global_rep.2.pre_norm_mha.1.qkv_proj.bias", "layer_5.1.global_rep.2.pre_norm_mha.1.out_proj.weight", "layer_5.1.global_rep.2.pre_norm_mha.1.out_proj.bias", "layer_5.1.global_rep.2.pre_norm_ffn.0.weight", "layer_5.1.global_rep.2.pre_norm_ffn.0.bias", "layer_5.1.global_rep.2.pre_norm_ffn.1.weight", "layer_5.1.global_rep.2.pre_norm_ffn.1.bias", "layer_5.1.global_rep.2.pre_norm_ffn.4.weight", "layer_5.1.global_rep.2.pre_norm_ffn.4.bias", "layer_5.1.global_rep.3.weight", "layer_5.1.global_rep.3.bias", "layer_5.1.conv_proj.block.conv.weight", "layer_5.1.conv_proj.block.norm.weight", "layer_5.1.conv_proj.block.norm.bias", "layer_5.1.conv_proj.block.norm.running_mean", "layer_5.1.conv_proj.block.norm.running_var", "layer_5.1.conv_proj.block.norm.num_batches_tracked", "layer_5.1.fusion.block.conv.weight", "layer_5.1.fusion.block.norm.weight", "layer_5.1.fusion.block.norm.bias", "layer_5.1.fusion.block.norm.running_mean", "layer_5.1.fusion.block.norm.running_var", "layer_5.1.fusion.block.norm.num_batches_tracked", "conv_1x1_exp.block.conv.weight", "conv_1x1_exp.block.norm.weight", "conv_1x1_exp.block.norm.bias", "conv_1x1_exp.block.norm.running_mean", "conv_1x1_exp.block.norm.running_var", "conv_1x1_exp.block.norm.num_batches_tracked", "classifier.fc.weight", "classifier.fc.bias", "conv_1.block.conv.weight", "conv_1.block.norm.weight", "conv_1.block.norm.bias", "conv_1.block.norm.running_mean", "conv_1.block.norm.running_var", "conv_1.block.norm.num_batches_tracked".

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

	def forward(self, queries):
	attn=self.mk(queries) #bs,n,S
	attn=self.softmax(attn) #bs,n,S
	attn=attn/torch.sum(attn,dim=2,keepdim=True) #bs,n,S
	out=self.mv(attn) #bs,n,d_model

	return out