Git Product home page Git Product logo

axial-deeplab's Introduction

Axial-DeepLab (ECCV 2020, Spotlight)

News: The official TF2 re-implementation is available in DeepLab2. Axial-SWideRNet achieves 68.0% PQ or 83.5% mIoU on Cityscaspes validation set, with only single-scale inference and ImageNet-1K pretrained checkpoints.

This is a PyTorch re-implementation of the Axial-DeepLab paper. The re-implementation is mainly done by an amazing senior student, Huaijin Pi.

@inproceedings{wang2020axial,
  title={Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation},
  author={Wang, Huiyu and Zhu, Yukun and Green, Bradley and Adam, Hartwig and Yuille, Alan and Chen, Liang-Chieh},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2020}
}

Currently, only ImageNet classification with the "Conv-Stem + Axial-Attention" backbone is supported. If you are interested in contributing to this repo, please open an issue and we can further discuss.

Preparation

pip install tensorboardX
mkdir data
cd data
ln -s path/to/dataset imagenet

Training

  • Non-distributed training
python train.py --model axial50s --gpu_id 0,1,2,3 --batch_size 128 --val_batch_size 128 --name axial50s --lr 0.05 --nesterov
  • Distributed training
CUDA_VISIBLE_DEVICES=0,1,2,3 python dist_train.py --model axial50s --batch_size 128 --val_batch_size 128 --name axial50s --lr 0.05 --nesterov --dist-url 'tcp://127.0.0.1:4128' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0

You can change the model name to train different models.

Testing

python train.py --model axial50s --gpu_id 0,1,2,3 --batch_size 128 --val_batch_size 128 --name axial50s --lr 0.05 --nesterov --test

You can test with distributed settings in the same way.

Model Zoo

Method Params (M) Top-1 Acc (%)
ResNet-26 13.7 74.5
Axial-ResNet-26-S 5.9 75.8

Credits

axial-deeplab's People

Contributors

csrhddlam avatar phj128 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

axial-deeplab's Issues

Regarding the dimension of query and key

Hi,

I observed in the code that the query's and key's dimensions are haft of the value's (out_planes // 2, group_planes // 2). Is there a specific reason for that (apart making it faster)?

Thanks.

Strange runtime results

Hello , i tested inference speed and compared it with simple torchvision resnet50 .
I used 2080ti and pytorch 1.4
Results are :
torchvision resnet50 - 13-15 ms
axial-resnet-s - 79-81ms
But in the paper authors show that inference speed of L model is comparable with Resnet101

Concerns on the implementation of the relative positional embedding

Really nice work! I have a small concern on the implementation details of the relative positional embedding.

This is the overall random initialization of the positional embedding:

self.q_relative = nn.Parameter(torch.randn(self.group_planes // 2, kernel_size * 2 - 1, 1), requires_grad=True)
self.k_relative = nn.Parameter(torch.randn(self.group_planes // 2, kernel_size * 2 - 1, 1), requires_grad=True)
self.v_relative = nn.Parameter(torch.randn(self.group_planes, kernel_size * 2 - 1, 1), requires_grad=True)

This is the estimation of the relative positional embedding by taking a slice of the overall random initialized vector:

for i in range(self.kernel_size):
q_embedding.append(self.q_relative[:, self.kernel_size - 1 - i: self.kernel_size * 2 - 1 - i])
k_embedding.append(self.k_relative[:, self.kernel_size - 1 - i: self.kernel_size * 2 - 1 - i])
v_embedding.append(self.v_relative[:, self.kernel_size - 1 - i: self.kernel_size * 2 - 1 - i])
q_embedding = torch.cat(q_embedding, dim=2)
k_embedding = torch.cat(k_embedding, dim=2)
v_embedding = torch.cat(v_embedding, dim=2)

According to the above implementations, I guess you implement the property of relative position via sharing a fixed ratio (part) of the position embedding instead of subtracting one positional embedding from the other one.

Besides, I am also concerned about the efficiency of the above loop. How could we improve the efficacy here?

I am wondering why you choose this implementation and whether there exists any other work that has used similar implementations?

why batchnormalization after qkv transform?

I wonder why batchnormalization after qkv transform? is it because of the covariate shift issue?

self.qkv_transform = qkv_transform(in_planes, out_planes * 2, kernel_size=1, stride=1,
padding=0, bias=False)
self.bn_qkv = nn.BatchNorm1d(out_planes * 2)
self.bn_similarity = nn.BatchNorm2d(groups * 3)

How does batchnorm2D work for calculating the similarity score? It really confused me.

Thanks

Pretrain_weights

It seems that Axial-ResNet-26-S pretrain_weights is not available.

Question about table 9 in paper

Hi,

Thanks for the work, I noticed from the table 9 in the paper that the performance is relatively stable no matter if the output stride is 16 or 32 and no matter if the decoder is axial decoder. Have you noticed this in practice, and does this mean that we can simply use output stride of 32 without axial decoder which will make the model much light-weighted ?

Relative Positional Encoding

Thanks for your great work!

I have some questions regarding relative positional encoding.
I believe that relative positional encoding is proposed in Self-Attention with Relative Position Representations and it could formulate as

y = sigma( softmax( q * (k + r^k) ) * (v + r^v) )

in a simplified version. In Stand-Alone Self-Attention in Vision Models, the self-attention layer omits r^v and therefore

y = sigma( softmax( q * k + q * r^k ) * v )

where q * (k + r^k) is unfolded to q * k + q * r^k and the superscript k in r^k is omitted in the paper of 'Stand-Alone Self-Attention in Vision Models'.
For the same reason, adding relative positional information to q and v should be unfolded to

y = sigma( softmax( q * k + q * r^k + k * r^q + r^q * r^k ) * (v + r^v) )

instead of

y = sigma( softmax( q * k + q * r^q + k * r^k ) * (v + r^v) )

as the Eq. (3) in the paper.

Please point it out if I misunderstood.

BTW, I suggest to cite 'Self-Attention with Relative Position Representations' in Sec. 3.1.

model train error

I was training this network by my dataset but I got an error. What could be the cause of this?
Screen Shot 2020-09-10 at 7 45 28 PM

Pretrained weights

Can you release more pretrained weights in addition to Axial-ResNet-26-S?

span sizes

I have 256x256 images, I read your paper which uses a 65x65 span. But in my case of 256x256 for doing axial attention once, can I use a 256x256 span or I should use the conv-stem to decrease the size to 56 as in your paper and use a span of 56x56?

About local constraints

I'm sorry if i understand the code in a wrong way, did this repo implement the local constraints part for larger input size?
I assume that the AxialAttention Module can only accept the input feature map size which equals to kernel_size*kernel_size , am i right?

关于AxialAttention中kernel_size的问题

您好!想请问一下,如果我不是用AxialAttention作为backbone,而是仅仅用它作为一个attention机制的话,效果怎么样?此外,AxialAttention中的kernel size是不是取决于输入的feature map的尺寸,那么如果我在train和infer的时候输入尺寸不同,feature map尺寸也不同,是不是会出现问题呢?希望能够得到解答,非常感谢!

position-sensitive attention

Thanks for the great work!

I am a bit confused about this piece of code:

kr = torch.einsum('bgci,cij->bgij', k, k_embedding).transpose(2, 3)

According to Eq. 4 in the paper, I have the impression that it should be torch.einsum('bgcj,cij->bgij', k, k_embedding) since p is the varying index. Please correct me if I am wrong. Thanks!

Confused about the shape of relative position encoding

The following code generates a position embedding of shape (C, K, K), where C=self.group_planes*2, K=self.kernel_size:

all_embeddings = torch.index_select(self.relative, 1, self.flatten_index).view(self.group_planes * 2, self.kernel_size, self.kernel_size)

It seems that each position in the (K, K) window owns a position encoding. But for axial-attention applied along w-axis, shouldn't the shape be (C, W), meaning that all rows share a same position encoding ?

size error

I got this error while executing training. What could be the cause of this .
Screen Shot 2020-09-10 at 6 19 35 PM

Confused about the `transpose` in positional encoding of key

kr = torch.einsum('bgci,cij->bgij', k, k_embedding).transpose(2, 3)

I'm a bit confused about this transposision. After reading the issue #17, I kind of understood it, but if the key should be transposed, then why the value shouldn't?

In addition, I noticed that there is a scaler 1/sqrt(n) after multiplication of q and v in the original Transformer model, but the scaler is removed here, why?

Pretrained weights

Hi,

Do you think you will be able to release some pre-trained weights (e.g. on ImageNet) some time soon?

Thanks.

Question about Axial-Res50

Thank you for sharing this great work. I have a question. Is Axial-Res50 same to the pretrained backbone you used in MaX-deeplab-S? BTW, would you like to let me know the performance of MaX-Deeplab-S & L when you pretrain it on ImageNet? I implemented both of them and got about 77% top-1 acc for S, does it seem normal? I want to sanity check whether or not my implementation is correct. Thanks.

Segmentation

Hi I am writing a paper for segmentation using axial deeplab. Can you please provide a rough sketch of the encoder and decoder for this network to use for segmentation?

Different resolution for inference

Hi @csrhddlam,

very nice work! I have a question regarding, using a different resolution at inference. Is there any practical solution to use the Axial-Attention block with a different resolution at inference time than at training time? Or is this not possible due to the learnable positional encodings?

Cheers
Christoph

Shape of relative position encoding r^q, r^k, r^v

If the span-size K is smaller than the width W, then do we have the size of (C,W,K) for the relative position encoding matrix r^q?
So that it's einsumed with the query like Q (H,(W,C)) * r^q ((W,C),K) -> A (H,W,K)? (A: attention matrix)

Question on paper COCO panoptic segmentation results

Hi! It is a very nice and very solid work. I have some questions about this paper.

Since you use Panoptic-DeepLab frame work, first semantic segmentation then instance groupinng.
I want to know what is the mIoU and mAp improvement on COCO compared with Panoptic-DeepLab baseline ?

Also, what are the advantages using axial attention layers over ASPP module in Panoptic Deeplab?

Thanks a lot !

License

Hello,
Thank you for the implementation! Is it possible for you to add a license to this repo?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.