Git Product home page Git Product logo

fsnet's Introduction

Full-Duplex Strategy for Video Object Segmentation (ICCV, 2021)

Authors: Ge-Peng Ji, Keren Fu, Zhe Wu, Deng-Ping Fan*, Jianbing Shen, & Ling Shao

  • This repository provides code for paper "Full-Duplex Strategy for Video Object Segmentation" accepted by the ICCV-2021 conference (official version / arXiv Version / Chinese translation).

  • This project is under construction. If you have any questions about our paper or bugs in our git project, feel free to contact me.

  • If you like our FSNet for your personal research, please cite this paper (BibTeX).

1. News

  • [2022/10/22] Our journal extension has been open-accessed. (Springer Link)
  • [2021/10/16] Our journal extension is accepted by Computational Visual Media. The pre-print version could be found at this link.
  • [2021/08/24] Upload the training script for video object segmentation.
  • [2021/08/22] Upload the pre-trained snapshot and the pre-computed results on U-VOS and V-SOD tasks.
  • [2021/08/20] Release inference code, evaluation code (VSOD).
  • [2021/07/20] Create Github page.

2. Introduction

Why?

Appearance and motion are two important sources of information in video object segmentation (VOS). Previous methods mainly focus on using simplex solutions, lowering the upper bound of feature collaboration among and across these two cues.


Figure 1: Visual comparison between the simplex (i.e., (a) appearance-refined motion and (b) motion-refined appear- ance) and our full-duplex strategy. In contrast, our FS- Net offers a collaborative way to leverage the appearance and motion cues under the mutual restraint of full-duplex strategy, thus providing more accurate structure details and alleviating the short-term feature drifting issue.

What?

In this paper, we study a novel framework, termed the FSNet (Full-duplex Strategy Network), which designs a relational cross-attention module (RCAM) to achieve bidirectional message propagation across embedding subspaces. Furthermore, the bidirectional purification module (BPM) is introduced to update the inconsistent features between the spatial-temporal embeddings, effectively improving the model's robustness.


Figure 2: The pipeline of our FSNet. The Relational Cross-Attention Module (RCAM) abstracts more discriminative representations between the motion and appearance cues using the full-duplex strategy. Then four Bidirectional Purification Modules (BPM) are stacked to further re-calibrate inconsistencies between the motion and appearance features. Finally, we utilize the decoder to generate our prediction.

How?

By considering the mutual restraint within the full-duplex strategy, our FSNet performs the cross-modal feature-passing (i.e., transmission and receiving) simultaneously before the fusion and decoding stage, making it robust to various challenging scenarios (e.g., motion blur, occlusion) in VOS. Extensive experiments on five popular benchmarks (i.e., DAVIS16, FBMS, MCL, SegTrack-V2, and DAVSOD19) show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.


Figure 3: Qualitative results on five datasets, including DAVIS16, MCL, FBMS, SegTrack-V2, and DAVSOD19.

3. Usage

How to Inference?

  • Download the test dataset from Baidu Driver (PSW: aaw8) or Google Driver and save it at ./dataset/*.

  • Install necessary libraries: PyTorch 1.1+, scipy 1.2.2, PIL

  • Download the pre-trained weights from Baidu Driver (psw: 36lm) or Google Driver. Saving the pre-trained weights at ./snapshot/FSNet/2021-ICCV-FSNet-20epoch-new.pth

  • Just run python inference.py to generate the segmentation results.

  • About the post-processing technique DenseCRF we used in the original paper, you can find it here: DSS-CRF.

How to train our model from scratch?

Download the train dataset from Baidu Driver (PSW: u01t) or Google Driver (VOS-TrainSet_StaticAndVideo.zip)/Google Driver (VOS-TrainSet_Video.zip) and save it at ./dataset/*. Our training pipeline consists of three steps:

  • First, train the model using the combination of static SOD dataset (i.e., DUTS) with 12,926 samples and U-VOS datasets (i.e., DAVIS16 & FBMS) with 2,373 samples.

    • Set --train_type='pretrain_rgb' and run python train.py in terminal
  • Second, train the model using the optical-flow map of U-VOS datasets (i.e., DAVIS16 & FBMS).

    • Set --train_type='pretrain_flow' and run python train.py in terminal
  • Third, train the model using the pair of frame and optical flow of U-VOS datasets (i.e., DAVIS16 & FBMS).

    • Set --train_type='finetune' and run python train.py in terminal

4. Benchmark

Unsupervised/Zero-shot Video Object Segmentation (U/Z-VOS) task

NOTE: In the U-VOS, all the prediction results are strictly binary. We only adopt the naive binarization algorithm (i.e., threshold=0.5) in our experiments.

  • Quantitative results (NOTE: The following results have slight improvement compared with the reported results in our conference paper):

    mean-J recall-J decay-J mean-F recall-F decay-F T
    FSNet (w/ CRF) 0.834 0.945 0.032 0.831 0.902 0.026 0.213
    FSNet (w/o CRF) 0.823 0.943 0.033 0.833 0.919 0.028 0.213
  • Pre-Computed Results: Please download the prediction results of FSNet, refer to Baidu Driver (psw: ojsl) or Google Driver.

  • Evaluation Toolbox: We use the standard evaluation toolbox from DAVIS16. (Note that all the pre-computed segmentations are downloaded from this link).

Video Salient Object Detection (V-SOD) task

NOTE: In the V-SOD, all the prediction results are non-binary.

4. Citation

@article{ji2022fsnet-CVMJ,
  title={Full-Duplex Strategy for Video Object Segmentation},
  author={Ji, Ge-Peng and Fan, Deng-Ping and Fu, Keren and Wu, Zhe and Shen, Jianbing and Shao, Ling},
  journal={Computational Visual Media},
  pages={155–175},
  volume={8},
  issue={1},
  year={2022},
  publisher={Springer}
}

@inproceedings{ji2021full,
  title={Full-Duplex Strategy for Video Object Segmentation},
  author={Ji, Ge-Peng and Fu, Keren and Wu, Zhe and Fan, Deng-Ping and Shen, Jianbing and Shao, Ling},
  booktitle={Proceedings of the IEEE/CVF international conference on computer vision},
  pages={4922--4933},
  year={2021}
}

5. Acknowledgements

Many thanks to my collaborator Ph.D. Zhe Wu, who provides excellent work SCRN and design inspirations.

fsnet's People

Contributors

gewelsji avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

fsnet's Issues

DenseCRF

你好,你可以提供批量处理图片的DenseCRF代码吗?我用我自己的DenseCRF代码处理完mask之后,指标和未处理之前的一样。

the update_predict function defined in func.py?

您好,您的update_predict 代码是不是有些问题? 您的代码似乎是用 pretrain_rgb 权重中的 resnet.conv1 ; resnet.bn1的权重去替换 模型中 resnet.conv1_rgb ; resnet.bn1_rgb 的权重了

error

您好,请问我在加载两个预训练模型的时候出现以下错误是什么原因呢?好像是update_predict这个函数出现了问题,如果能得到您的帮助我将感激不尽。
ST3JN`KAP%8%(0Z~Y4$8FDJ

指标算不出来

用文件中的evaluation算不出来maxE 和 maxF, 都是0.000,请问是为什么?

problem

Hello, this is a great work. But when I use weights provided in the link Baidu Driver (psw: 36lm) to test on the DAVIS16 dataset, the mean-J and mean-F are only 47.5 and 27.8, respectively. I want to know if the weights are the final weights?

您好, 打扰了有几个问题想请教下您

  1. 请问下您只用了DAVIS和FBMS训练, 没用DAVSOD-train吗?
  2. FBMS的标签只有部分有, 您是只用对应的那部分帧来训练吗?
  3. 我用您的模型在DAVSOD/DAVIS试着跑了下, 在Eeasy35上测试效果不错, 但是在小数据集DAVIS/FBMS上效果一般, 甚至觉得小数据集直接当做图片来做效果还更好 : ) 您有遇到过这方面的问题或者您有啥看法?谢谢.

Inference on custom dataset

Thank you for sharing this work.
I want to know how to do inference on our personalized dataset, without any ground truth masks and without OF_FlowNet2 images. We only have frames on which we want to detect moving persons. Thank you in advance

Evaluation Related; Meanings of Metrics in the Output Text File Related;

seq_Smeasure:0.135;seq_wFmeasure:0.095;seq_adpFmeasure:0.041;seq_maxF:0.185;seq_meanF:0.121;seq_adpEmeasure:0.249;seq_maxE:0.627;seq_meanE:0.202;seq_MAE:0.819
Above is one of the lines in the output file, and some pairs have been found in the paper:
Smeasure —— 4.2.2 5. structure measure
wFmeasure ——
adpFmeasure ——
maxF —— maximum F-measure
meanF ——
adpEmeasure ——
maxE —— 4.2.2 4. Maximum Enhanced-Alignment Measure
meanE ——
seq_MAE —— 4.2.2 mae

However, the meanings of the variables not found are still not clear, could you please help me about this? Thank you sincerely!

effectiveness of the PPM

Thansk for your code. I am not quite understand why you plug a PPM into each decoder. In the last decoder, the ppm extracts features with resolutions of 1x1, 2x2, 3x3, and 6x6. However, the size of the input feature is 88x88. It seems that these intermediate features extracted by ppm are too coarse to facilitate the refinement of the input feature. Have you evaluate the effectiveness of the PPM?

finetune无法读取已训练完的光流权重和RGB权重

学长好,我用项目提供的数据集分别对rgb和光流进行训练,得到了两个分支的权重,然后在func.py修改了这两个权重的路径,但是在做finetune的时候,读取出来的权重不对,我看先前有同学提问,但是是在双卡上训练的。我是单卡3090训练,所以可能不是学长说的model.moudle的问题,可以麻烦学长看一下我的bug吗?是训练权重不对还是?
image
image

rgb与光流的pth文件

您好,作者大大,我想请问一下是否可以提供一下rgb与光流的pth文件,我最近在按照您给的代码训练实验并且做测试,您的代码写的很清楚简洁,我用的是两块2080ti,您在代码中给出的训练的batchsize是16,但是我的显卡显存不够,所以我给设置成了8,在训练的时候遇到过一个问题,是在您的utils文件夹下的func.py文件的最后一句,model.load_state_dict(model_dict),说是权重没有对上,我试了两种方法一种是直接加上false,另外一种是按照您前面处理rgb模型的方法写了一遍flow的,我生成的结果的J最高的在80.4,我也不太清楚是不是因为我的显卡算力问题,还是其他修改的问题,所以想请问一下是否有rgb与光流的训练好的pth文件可以供训练finetune,或者可能有什么原因导致了的训练偏差,我可以继续训练,万分感谢!

中文版论文

论文中文版本链接失效了,可以重新提供一下吗

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.