opennlplab / avsbench Goto Github PK

[ECCV 2022] Official implementation of the paper: Audio-Visual Segmentation

License: Apache License 2.0

Python 99.08% Shell 0.92%

audio-visual-segmentation multi-modal-segmentation segmentation-benchmark audio-visual-learning

avsbench's Introduction

Audio-Visual Segmentation

This repository provides the PyTorch implementation for the ECCV2022 paper "Audio-Visual Segmentation" . This paper proposes the audio-visual segmentation (AVS) problem and the AVSBench dataset accordingly. [Project Page] [arXiv]

Recently, we expanded the AVS task to include one more challenging setting, i.e., the fully-supervised audio-visual semantic segmentation (AVSS) that requires generating semantic masks of the sounding objects. Accordingly, we collected a new AVSBench-semantic dataset. Please refer to our arXiv paper ''Audio-Visual Segmentation with Semantics'' for more details. [Online Benchmark]

Updates

(2023.1.31) The AVSBench-semantic dataset has been released, you can download it from our official benchmark website. Please refer to our arXiv paper for more details of this dataset.
(2022.10.18) We have completed the collection and annotation of AVSBench-semantic. Compared to original AVSBench dataset, it contains ~7k more multi-source videos covering 70 categories, and the ground truths are provided in the form of multi-label semantic maps (labels of origianl AVSBench dataset are also updated). We will release it as soon as possible.
(2022.7.13) We are preparing the AVSBench-semantic dataset that will pay more attention to multi-source situation and provide semantic annotations.

Data preparation

1. AVSBench dataset

The AVSBench dataset is first proposed in our ECCV paper. It contains a Single-source and a Multi-sources subset. Ground truths of these two subsets are binary segmentation maps indicating pixels of the sounding objects. Recently, we collected a new Semantic-labels subset that provides semantic segmentation maps as labels. We add it to the original AVSBench dataset as the third subset. For convenience, we denote the original AVSBench dataset as AVSBench-object, and the newly added Semantic-labels subset as AVSBench-semantic.

AVSBench-object is used for the Single Sound Source Segmentation (S4) and Multiple Sound Source Segmentation (MS3), while AVSBench-semantic is used for the Audio-Visual Semantic Segmentation (AVSS).

The updated AVSBench dataset is available at http://www.avlbench.opennlplab.cn/download. You may request the dataset by mail at [email protected].. We will reply as soon as we receive the application.

These downloaded data should be placed to the directory avsbench_data.

2. pretrained backbones

The pretrained ResNet50/PVT-v2-b5 (vision) and VGGish (audio) backbones can be downloaded from here and placed to the directory pretrained_backbones.

Notice: please update the path of data and pretrained backbone in avs_s4/config.py, avs_ms3/config.py, and avss/config.py accordingly.

S4 setting

Train AVS Model

cd avs_scripts/avs_s4
bash train.sh

Test AVS Model

cd avs_scripts/avs_s4
bash test.sh

MS3 setting

Train AVS Model

cd avs_scripts/avs_ms3
bash train.sh

Test AVS Model

cd avs_scripts/avs_ms3
bash test.sh

AVSS setting

Train AVS model

cd avs_scripts/avss
bash train.sh

Test AVS model

cd avs_scripts/avss
bash test.sh

Notably, the AVSS setting can be viewed as an independent task by the research community, i.e., the audio-visual semantic segmentation task. The pretrained AVSS models are available at here.

Citation

If you use this dataset or code, please consider cite following papers:

@inproceedings{zhou2022avs,
  title     = {Audio-Visual Segmentation},
  author    = {Zhou, Jinxing and Wang, Jianyuan and Zhang, Jiayi and Sun, Weixuan and Zhang, Jing and Birchfield, Stan and Guo, Dan and Kong, Lingpeng and Wang, Meng and Zhong, Yiran},
  booktitle = {European Conference on Computer Vision},
  year      = {2022}
}

@article{zhou2023avss,
      title={Audio-Visual Segmentation with Semantics}, 
      author={Zhou, Jinxing and Shen, Xuyang and Wang, Jianyuan and Zhang, Jiayi and Sun, Weixuan and Zhang, Jing and Birchfield, Stan and Guo, Dan and Kong, Lingpeng and Wang, Meng and Zhong, Yiran},
      journal={arXiv preprint arXiv:2301.13190},
      year={2023},
}

License

This project is released under the Apache 2.0 license as found in the LICENSE file.

avsbench's People

Contributors

Stargazers

Watchers

avsbench's Issues

Is it right to use 'bg_idx = (masks.shape[1] - 1)' ?

Hi,
Your work is very interesting and I have a question about your code.
Refer to line 42 in avss/loss.py,
bg_idx = (pred_masks.shape[1] - 1)
you define background idx as 'masks.shape[1] - 1', which is 70. But after I ran the code, I think the background idx should be 0, which is also defined in label2idx.json. Is there any problem?
Thank you in advance.

About data loading paths in config file

Hi,
I noticed that the paths on config file does not match the file structure of AVSS dataset I received.

Refer to line 37-42 in config.py
https://github.com/OpenNLPLab/AVSBench/blob/9b3ef722c0fc86574c1c78dc9ce8819e01e774d4/avs_scripts/avss/config.py#L37 -L42

cfg.DATA.META_CSV_PATH = "./avs_released/metadata.csv" #! notice: you need to change the path
cfg.DATA.LABEL_IDX_PATH = "./avs_released/label2idx.json" #! notice: you need to change the path

cfg.DATA.DIR_BASE = "./avs_released" #! notice: you need to change the path
cfg.DATA.DIR_MASK = "../../avsbench_data/v2_data/gt_masks" #! notice: you need to change the path
cfg.DATA.DIR_COLOR_MASK = "../../avsbench_data/v2_data/gt_color_masks_rgb" #! notice: you need to change the path

However, in AVSS dataset, the structure is like

AVSBench-semantic
- v1m
-- _19NVGk6Zt8_0
--- frames
---- 0.jpg
---- ...
--- labels_semantic
---- 0.png
---- ...
--- audio.wav
- v1s
-- ...
- v2
-- ...
- label2idx.json
- metadata.csv

Maybe there is a newer version of code?
Thank you

About training

我下载了您的数据，但只看到12971张Mask图，没有videos，没有语音，这如何训练？这是正常的吗？麻烦发送一份完整能训练的数据，纵使给一小部分，也可以，感谢：[email protected]

Class category for multi source data

Hi,

Thanks for sharing the dataset, I am wondering do we have category labels for the multi-source videos.

Cheers,
Yuanhong

hardware and runtime

So what/How many gpus did you use for training? And how much time for training

about FREEZE_VISUAL_EXTRACTOR

Hello,
In AVSBench/avs_scripts/avss/config.py , it's set that "cfg.TRAIN.FREEZE_VISUAL_EXTRACTOR = True".
However, I can't find where this param is used.
Can you please tell me where you set the visual extractor's param freezed?

Many thanks!

Bug report 'from torchvggish import vggish'

when I cd to 'avs_scripts/avs_s4' and run 'from torchvggish import vggish'

it reports following error message:

Traceback (most recent call last):
  File "/root/miniconda3/envs/avs/lib/python3.7/site-packages/soundfile.py", line 151, in <module>
    raise OSError('sndfile library not found')
OSError: sndfile library not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/miniconda3/envs/avs/lib/python3.7/site-packages/soundfile.py", line 178, in <module>
    _snd = _ffi.dlopen(_os.path.join(_path, '_soundfile_data', _packaged_libname))
OSError: cannot load library '/root/miniconda3/envs/avs/lib/python3.7/site-packages/_soundfile_data/libsndfile.so': /root/miniconda3/envs/avs/lib/python3.7/site-packages/_soundfile_data/libsndfile.so: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/yhling/avs/avs_scripts/avs_s4/torchvggish/vggish.py", line 6, in <module>
    from . import vggish_input, vggish_params
  File "/home/yhling/avs/avs_scripts/avs_s4/torchvggish/vggish_input.py", line 27, in <module>
    import soundfile as sf
  File "/root/miniconda3/envs/avs/lib/python3.7/site-packages/soundfile.py", line 189, in <module>
    _snd = _ffi.dlopen(_libname)
OSError: cannot load library 'libsndfile.so': libsndfile.so: cannot open shared object file: No such file or directory

I wonder if there is something I missed. Thank you for your help in advance~

Raw video dataset

Hi, thank you for providing an awesome dataset in public!
I have downloaded the AVSBench Semantics dataset, but is it possible to provide the raw video, instead of preprocessed frames/audio?

Thank you in advance.

IoU & F1-score computation

Hi, thanks for your wonderful work firstly!

After reading your code, I have found there is an interesting thing but I cannot understand.

In your code of IoU Calculation, the totally black GTs has also been involved in calculation. The final result actually is the ratio of the predicted image and the whole image. However, when calculating the F1-score, the totally black GTs has been removed. I would like to know if there is any explanations about the way of IoU and F1-score calculation.

Thanks for your amazing work again.
Looking forward to your reply!

Changing the data set

Hi, firstly, thanks for the code. Secondly, I made a modification for your dataset to change the pixels of the image and segmentation mask (originally 224) to 640, what parts need to be modified.

Bug report during execuating s4/train.sh "train.sh: line 5: spring.submit: command not found"

Encounter a bug during execuating s4/train.sh

'''train.sh: line 5: spring.submit: command not found'''

I can not find any related solutions on the Internet. Does anyone have the same question?

Cannot find pretrained model

Did you upload the pretrained model of "vggish_pca_params-970ea276.pth"?

Have you released your best pretrained model weights of TPAVI

Could you please release your inference code?

Labeling noise?

Dear authors,

I have a minor question about the annotation in the dataset.
When I debugged the code, I found that some inconsistencies existed between the ground-truth annotation and the audio label in the metadata.

For example, in the video ID 2Sg6Yq5S9Bo_0_5000, the annotated audio object is guitar, and I manually checked the RGB-labeled image to make sure that the guitar is only annotated.
However, when I loaded that image, I found that a few pixels were annotated with different labels. (same as index-labeled image)


>>> pil = Image.open('0.png').convert('RGB')
>>> d = np.array(pil)
>>> np.unique(data.reshape(-1, data.shape[2]), return_counts=True, axis=0)
(array([[  0,   0,   0],
       [128,   0,   0],
       [128,  64,  64]], dtype=uint8), array([230097,      8,  61495]))

[128, 0, 0] corresponds to the category baby. However, there is no baby in either the image or audio.
Not only this video, but some videos have these inconsistent annotations.

Please correct me if I misunderstood something!

Thank you in advance.

RuntimeError: shape '[-1, 5, 128]' is invalid for input of size 384

I got this error during training, seems like something wrong with audio_feature.
Traceback (most recent call last): File "train.py", line 227, in <module> output, _, _ = model(imgs, audio_feature) # [bs*5, 1, 224, 224] File "/home/lsc/anaconda3/envs/avs/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/lsc/anaconda3/envs/avs/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 167, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/lsc/anaconda3/envs/avs/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 177, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/lsc/anaconda3/envs/avs/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply output.reraise() File "/home/lsc/anaconda3/envs/avs/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/lsc/anaconda3/envs/avs/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker output = module(*input, **kwargs) File "/home/lsc/anaconda3/envs/avs/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/lsc/AVSBench/avs_scripts/avs_s4/model/HRNet_AVSModel.py", line 249, in forward conv_feat_va, a_fea = self.tpavi_va(feature_map_list[i], audio_feature, stage=i) File "/home/lsc/AVSBench/avs_scripts/avs_s4/model/HRNet_AVSModel.py", line 210, in tpavi_va audio = audio.view(-1, 5, audio.shape[-1]) # [B, T, 128] RuntimeError: shape '[-1, 5, 128]' is invalid for input of size 384

The audio features are torch.Size([3, 128]) torch.Size([2, 128]) , but should be [B*5, 128]. What causes the error?

About Detect

Do you have the code to actually detect any video? You only need to input a video, automatically cut 5 frames (see your reasoning needs 5 frames), and then generate its Mel map. The video cuts speech and image frames, and then sends them to the model and sends out the results. The results are visualized and mapped. This is not very difficult, nor is it a secret. Would you like to ask if you have done it? Can you share it? I have already implemented it, but it is very complicated and has no real-time performance.

why choose the batch 4 rather than 1?

Usually, we set bs to 1 during the inference process.

Performance difference

I reconduct experiments with 5 random seed based on your open-source code, But fail to hit the appearance as your paper reported.

Concrectly, in the paper:

While in my expremiment result:

I wonder if I miss something or it is due to the version difference of a specific dependency package.

Thank you for your reply and time!

avss dataset

The avss dataset does not have folder classifications like S4 and ms3, such as all gtmask, viusual frames How do I use it, it is defined as a folder per video

Pretrained Weights

@jasongief @OpenNLPLab123 Hi thanks for sharing the code base , can you please share the pretrained weights so that I can check its inference and see the results

Thanks in advance

Some questions for your setting

Thanks so much for sharing such amazing datasets! @jasongief @OpenNLPLab123
We are trying to follow your work and have some questions about the setting,

1). In the segmentation, the previous datasets don't change the resolution ratio because:

force the image into one size (i.e., 224, in your case) will destroy the semantic meaning of the input data, and
it will also reduce your project's practicability in real-life challenges, as your pixel-wise classifier is trained in different domains.

For example, Pascal VOC12 and Cityscapes both provide the images and masks in vanilla resolution. I'm wondering whether you can provide them in the original size (like in the VGG Sound) as an additional sources.

2). The traditional segmentation task is for multi-class pixel-wise classification. However, your measurement is based on the binary label, which is more like salient object detection or foreground segmentation. Please give us some advice.

Best regards,
Yuyuan Liu

visual_backone frozen issue

Hello,
I found that the visual backbone might not be frozen correctly when I added following to train.py
for name, param in model.named_parameters():
print(str(name)+':'+str(param.requires_grad))
here is the result

I used the following code to correct the problem.
for name, param in model.named_parameters():
if "encoder_backbone" in name:
param.requires_grad = False

I was not too sure about this，and hope to hear from you~
Thx a lot!!!

What is the difference among Single Sound Source Segmentation (S4), Multiple Sound Source Segmentation (MS3), and Audio-Visual Semantic Segmentation (AVSS)?

Same as the title.