csrhddlam / axial-deeplab Goto Github PK

This is a PyTorch re-implementation of Axial-DeepLab (ECCV 2020 Spotlight)

Home Page: https://arxiv.org/abs/2003.07853

License: Apache License 2.0

Python 100.00%

axial-deeplab's Introduction

Axial-DeepLab (ECCV 2020, Spotlight)

News: The official TF2 re-implementation is available in DeepLab2. Axial-SWideRNet achieves 68.0% PQ or 83.5% mIoU on Cityscaspes validation set, with only single-scale inference and ImageNet-1K pretrained checkpoints.

This is a PyTorch re-implementation of the Axial-DeepLab paper. The re-implementation is mainly done by an amazing senior student, Huaijin Pi.

@inproceedings{wang2020axial,
  title={Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation},
  author={Wang, Huiyu and Zhu, Yukun and Green, Bradley and Adam, Hartwig and Yuille, Alan and Chen, Liang-Chieh},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2020}
}

Currently, only ImageNet classification with the "Conv-Stem + Axial-Attention" backbone is supported. If you are interested in contributing to this repo, please open an issue and we can further discuss.

Preparation

pip install tensorboardX
mkdir data
cd data
ln -s path/to/dataset imagenet

Training

Non-distributed training

python train.py --model axial50s --gpu_id 0,1,2,3 --batch_size 128 --val_batch_size 128 --name axial50s --lr 0.05 --nesterov

Distributed training

CUDA_VISIBLE_DEVICES=0,1,2,3 python dist_train.py --model axial50s --batch_size 128 --val_batch_size 128 --name axial50s --lr 0.05 --nesterov --dist-url 'tcp://127.0.0.1:4128' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0

You can change the model name to train different models.

Testing

python train.py --model axial50s --gpu_id 0,1,2,3 --batch_size 128 --val_batch_size 128 --name axial50s --lr 0.05 --nesterov --test

You can test with distributed settings in the same way.

Model Zoo

Method	Params (M)	Top-1 Acc (%)
ResNet-26	13.7	74.5
Axial-ResNet-26-S	5.9	75.8

Credits

ImageNet training script is modified from https://github.com/mit-han-lab/proxylessnas
ImageNet distributed training script is modified from https://github.com/pytorch/examples/tree/master/imagenet
ResNet is modified from https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py

axial-deeplab's People

Contributors

Stargazers

Watchers

axial-deeplab's Issues

About the class activation map

I tried to use CAM to observe the feature map of Axial-Attention, but found that the network did not learn a good feature representation.

I use script from Bolei Zhou to generate CAM: https://github.com/zhoubolei/CAM/blob/master/pytorch_CAM.py

Pretrained AxialAttentionNet axial26s from this repo:

ResNet-18

Seems dist_train.py didn't wrap the model with the synchronize batch norm

Am I understood right? The model wasn't wrapped by the synchronize batch norm in dist_train.py. Will this affect the results?

Regarding the dimension of query and key

Hi,

I observed in the code that the query's and key's dimensions are haft of the value's (out_planes // 2, group_planes // 2). Is there a specific reason for that (apart making it faster)?

Thanks.

Strange runtime results

Hello , i tested inference speed and compared it with simple torchvision resnet50 .
I used 2080ti and pytorch 1.4
Results are :
torchvision resnet50 - 13-15 ms
axial-resnet-s - 79-81ms
But in the paper authors show that inference speed of L model is comparable with Resnet101

Concerns on the implementation of the relative positional embedding

Really nice work! I have a small concern on the implementation details of the relative positional embedding.

This is the overall random initialization of the positional embedding:

axial-deeplab/lib/models/axialnet.py

Lines 50 to 53 in 17cdab8

 self.q_relative = nn.Parameter(torch.randn(self.group_planes // 2, kernel_size * 2 - 1, 1), requires_grad=True) 

 self.k_relative = nn.Parameter(torch.randn(self.group_planes // 2, kernel_size * 2 - 1, 1), requires_grad=True) 

 self.v_relative = nn.Parameter(torch.randn(self.group_planes, kernel_size * 2 - 1, 1), requires_grad=True)

This is the estimation of the relative positional embedding by taking a slice of the overall random initialized vector:

axial-deeplab/lib/models/axialnet.py

Lines 76 to 82 in 17cdab8

 for i in range(self.kernel_size): 

 q_embedding.append(self.q_relative[:, self.kernel_size - 1 - i: self.kernel_size * 2 - 1 - i]) 

 k_embedding.append(self.k_relative[:, self.kernel_size - 1 - i: self.kernel_size * 2 - 1 - i]) 

 v_embedding.append(self.v_relative[:, self.kernel_size - 1 - i: self.kernel_size * 2 - 1 - i]) 

 q_embedding = torch.cat(q_embedding, dim=2) 

 k_embedding = torch.cat(k_embedding, dim=2) 

 v_embedding = torch.cat(v_embedding, dim=2)

According to the above implementations, I guess you implement the property of relative position via sharing a fixed ratio (part) of the position embedding instead of subtracting one positional embedding from the other one.

Besides, I am also concerned about the efficiency of the above loop. How could we improve the efficacy here?

I am wondering why you choose this implementation and whether there exists any other work that has used similar implementations?

why batchnormalization after qkv transform?

I wonder why batchnormalization after qkv transform? is it because of the covariate shift issue?

axial-deeplab/lib/models/axialnet.py

Lines 31 to 34 in 79088ed

 self.qkv_transform = qkv_transform(in_planes, out_planes * 2, kernel_size=1, stride=1, 

 padding=0, bias=False) 

 self.bn_qkv = nn.BatchNorm1d(out_planes * 2) 

 self.bn_similarity = nn.BatchNorm2d(groups * 3)

How does batchnorm2D work for calculating the similarity score? It really confused me.

Thanks

Pretrain_weights

It seems that Axial-ResNet-26-S pretrain_weights is not available.

Training with non-square images

Is it possible to train with non-square images by using different spans for the width/height axis?

Question about table 9 in paper

Hi,

Thanks for the work, I noticed from the table 9 in the paper that the performance is relatively stable no matter if the output stride is 16 or 32 and no matter if the decoder is axial decoder. Have you noticed this in practice, and does this mean that we can simply use output stride of 32 without axial decoder which will make the model much light-weighted ?

Can it be used in video tasks?

Relative Positional Encoding

Thanks for your great work!

I have some questions regarding relative positional encoding.
I believe that relative positional encoding is proposed in Self-Attention with Relative Position Representations and it could formulate as

y = sigma( softmax( q * (k + r^k) ) * (v + r^v) )

in a simplified version. In Stand-Alone Self-Attention in Vision Models, the self-attention layer omits r^v and therefore

y = sigma( softmax( q * k + q * r^k ) * v )

where q * (k + r^k) is unfolded to q * k + q * r^k and the superscript k in r^k is omitted in the paper of 'Stand-Alone Self-Attention in Vision Models'.
For the same reason, adding relative positional information to q and v should be unfolded to

y = sigma( softmax( q * k + q * r^k + k * r^q + r^q * r^k ) * (v + r^v) )

instead of

y = sigma( softmax( q * k + q * r^q + k * r^k ) * (v + r^v) )

as the Eq. (3) in the paper.

Please point it out if I misunderstood.

BTW, I suggest to cite 'Self-Attention with Relative Position Representations' in Sec. 3.1.

model train error

I was training this network by my dataset but I got an error. What could be the cause of this?

how does axial-attention support multi-scale training/testing?

It seems that the number of parameters of axial-attention module depends on the input feature size (because of the kernel_size should equal to H/W). Then how does it support variable input sizes?

Pretrained weights

Can you release more pretrained weights in addition to Axial-ResNet-26-S?

span sizes

I have 256x256 images, I read your paper which uses a 65x65 span. But in my case of 256x256 for doing axial attention once, can I use a 256x256 span or I should use the conv-stem to decrease the size to 56 as in your paper and use a span of 56x56?

About local constraints

I'm sorry if i understand the code in a wrong way, did this repo implement the local constraints part for larger input size?
I assume that the AxialAttention Module can only accept the input feature map size which equals to kernel_size*kernel_size , am i right?

关于AxialAttention中kernel_size的问题

您好！想请问一下，如果我不是用AxialAttention作为backbone，而是仅仅用它作为一个attention机制的话，效果怎么样？此外，AxialAttention中的kernel size是不是取决于输入的feature map的尺寸，那么如果我在train和infer的时候输入尺寸不同，feature map尺寸也不同，是不是会出现问题呢？希望能够得到解答，非常感谢！

position-sensitive attention

Thanks for the great work!

I am a bit confused about this piece of code:

axial-deeplab/lib/models/axialnet.py

Line 67 in fe1d052

kr = torch.einsum('bgci,cij->bgij', k, k_embedding).transpose(2, 3)

According to Eq. 4 in the paper, I have the impression that it should be torch.einsum('bgcj,cij->bgij', k, k_embedding) since p is the varying index. Please correct me if I am wrong. Thanks!

What's HERE??

axial-deeplab/lib/models/utils.py

Line 4 in fe1d052

class qkv_transform(nn.Conv1d):

Confused about the shape of relative position encoding

The following code generates a position embedding of shape (C, K, K), where C=self.group_planes*2, K=self.kernel_size:

axial-deeplab/lib/models/axialnet.py

Line 64 in fe1d052

 all_embeddings = torch.index_select(self.relative, 1, self.flatten_index).view(self.group_planes * 2, self.kernel_size, self.kernel_size) 

It seems that each position in the (K, K) window owns a position encoding. But for axial-attention applied along w-axis, shouldn't the shape be (C, W), meaning that all rows share a same position encoding ?

It seems that the code of qkv_transform is missing.

It seems that the code of class qkv_transform is not completed in lib/utils.py. Please provide the completed code. Thanks a lot!

size error

I got this error while executing training. What could be the cause of this .

Confused about the `transpose` in positional encoding of key

axial-deeplab/lib/models/axialnet.py

Line 67 in fe1d052

kr = torch.einsum('bgci,cij->bgij', k, k_embedding).transpose(2, 3)

I'm a bit confused about this transposision. After reading the issue #17, I kind of understood it, but if the key should be transposed, then why the value shouldn't?

In addition, I noticed that there is a scaler 1/sqrt(n) after multiplication of q and v in the original Transformer model, but the scaler is removed here, why?

This repo is mentioned for classification only, so it cant be used to semantic segmentation yet?

At the last layer of Axial attention, there is the implementation of the Linear fc layer, which gives the classification output. But, I need to work on semantic segmentation. So usingn axial attention as encoder, should I use this for unet decoder or else how can I proceed for segmentation task?

Pretrained weights

Hi,

Do you think you will be able to release some pre-trained weights (e.g. on ImageNet) some time soon?

Thanks.

Question about Axial-Res50

Thank you for sharing this great work. I have a question. Is Axial-Res50 same to the pretrained backbone you used in MaX-deeplab-S? BTW, would you like to let me know the performance of MaX-Deeplab-S & L when you pretrain it on ImageNet? I implemented both of them and got about 77% top-1 acc for S, does it seem normal? I want to sanity check whether or not my implementation is correct. Thanks.

Segmentation

Hi I am writing a paper for segmentation using axial deeplab. Can you please provide a rough sketch of the encoder and decoder for this network to use for segmentation?

Different resolution for inference

Hi @csrhddlam,

very nice work! I have a question regarding, using a different resolution at inference. Is there any practical solution to use the Axial-Attention block with a different resolution at inference time than at training time? Or is this not possible due to the learnable positional encodings?

Cheers
Christoph

Also, what are the advantages using axial attention layers over ASPP module in Panoptic Deeplab?

Thanks a lot !

License

Hello,
Thank you for the implementation! Is it possible for you to add a license to this repo?

	self.q_relative = nn.Parameter(torch.randn(self.group_planes // 2, kernel_size * 2 - 1, 1), requires_grad=True)
	self.k_relative = nn.Parameter(torch.randn(self.group_planes // 2, kernel_size * 2 - 1, 1), requires_grad=True)
	self.v_relative = nn.Parameter(torch.randn(self.group_planes, kernel_size * 2 - 1, 1), requires_grad=True)

	for i in range(self.kernel_size):
	q_embedding.append(self.q_relative[:, self.kernel_size - 1 - i: self.kernel_size * 2 - 1 - i])
	k_embedding.append(self.k_relative[:, self.kernel_size - 1 - i: self.kernel_size * 2 - 1 - i])
	v_embedding.append(self.v_relative[:, self.kernel_size - 1 - i: self.kernel_size * 2 - 1 - i])
	q_embedding = torch.cat(q_embedding, dim=2)
	k_embedding = torch.cat(k_embedding, dim=2)
	v_embedding = torch.cat(v_embedding, dim=2)

	self.qkv_transform = qkv_transform(in_planes, out_planes * 2, kernel_size=1, stride=1,
	padding=0, bias=False)
	self.bn_qkv = nn.BatchNorm1d(out_planes * 2)
	self.bn_similarity = nn.BatchNorm2d(groups * 3)

csrhddlam / axial-deeplab Goto Github PK

axial-deeplab's Introduction

Axial-DeepLab (ECCV 2020, Spotlight)

Preparation

Training

Testing

Model Zoo

Credits

axial-deeplab's People

Contributors

Stargazers

Watchers

Forkers

axial-deeplab's Issues

Recommend Projects

Recommend Topics

Recommend Org