wjn922 / referformer Goto Github PK

View Code? Open in Web Editor NEW

316.0 316.0 24.0 53.31 MB

[CVPR2022] Official Implementation of ReferFormer

License: Apache License 2.0

Python 86.35% Shell 1.07% C++ 1.14% Cuda 11.44%

referring-video-object-segmentation video-language

referformer's Introduction

Hi there 👋

referformer's People

Contributors

Stargazers

Watchers

referformer's Issues

Question of Pretraining on RefCOCO/+/g train/val.json

Hi! So glad that you released the pretrain code on Refcoco Datasets.
And, in datasets/refexp.py Line 168/169, you divided original json files into train/val splits.
However, I download json files on refer. They use many different mode like 'unc', 'berkeley','google'.
So, could you please share your json files of RefCOCO/+/g train/val.json?

Thanks a lot!

Reproducing training, recommended hardware setup

Hi,

Thanks for releasing the great code!
I'm trying to reproduce the training, for now I start with the pre-trained model and just do the fine-tuning on YouTube-VOS. What is the recommended number of GPUs and what run-time should I expect?
Is the currently released code able to support multi-node training?
So far, I was able to run the YouTube-VOS training on a single machine with 4xA100 GPU and it took ~2 hours per epoch, so 12 hours in total.
Please let me know about the recommended hardware setup for the YouTube-VOS fine-tuning and also for the pre-training step (I think for this maybe not all code is released yet?).
And if I use a different number of GPUs, can I expect the same result quality, just longer run-time, or will this maybe lead to problems/worse results?

Thank you!

Best,

Paul

size mismatch when load state_dict from checkpoint

Hi Jonas,
If I add args to inference_davis.py and directly run this .py file, size mismatch error comes when load state_dict from checkpoint.

Add inference_davis.py below args and run it:
args.output_dir = './output'
args.resume = './pretrained_weights/ytvos_r50.pth'
args.backbone = 'resnet50'
Error message:
RuntimeError: Error(s) in loading state_dict for ReferFormer:
missing_keys, unexpected_keys = model_without_ddp.load_state_dict(checkpoint['model'], strict=False)
size mismatch for class_embed.0.weight: copying a param with shape torch.Size([1, 256]) from checkpoint, the shape in current model is torch.Size([78, 256]).
size mismatch for class_embed.0.bias: copying a param with shape torch.Size([1]) from checkpoint, the shape in current model is torch.Size([78]).

However, if I run dist_test_davis.sh with the same arg values, no error comes
./scripts/dist_test_davis.sh ./output ./pretrained_weights/r50_pretrained.pth --backbone resnet5

Thanks a lot!

cuda OOM

platform: windows10 anaconda, RTX2080 8G

python inference_davis.py --with_box_refine --binary --freeze_text_encoder --output_dir davis_dirs/resnet50 --resume ckpt/ytvos_r50.pth --backbone resnet50 --ngpu 1

Inference only supports for batch size = 1
Namespace(a2d_path='data/a2d_sentences', aux_loss=True, backbone='resnet50', backbone_pretrained=None, batch_size=1, bbox_loss_coef=5, binary=True, cache_mode=False, clip_max_norm=0.1, cls_loss_coef=2, coco_path='data/coco', controller_layers=3, dataset_file='davis', davis_path='data/ref-davis', dec_layers=4, dec_n_points=4, device='cuda', dice_loss_coef=5, dilation=False, dim_feedforward=2048, dist_url='env://', dropout=0.1, dynamic_mask_channels=8, enc_layers=4, enc_n_points=4, eos_coef=0.1, epochs=10, eval=False, focal_alpha=0.25, freeze_text_encoder=True, giou_loss_coef=2, hidden_dim=256, jhmdb_path='data/jhmdb_sentences', lr=0.0001, lr_backbone=5e-05, lr_backbone_names=['backbone.0'], lr_drop=[6, 8], lr_linear_proj_mult=1.0, lr_linear_proj_names=['reference_points', 'sampling_offsets'], lr_text_encoder=1e-05, lr_text_encoder_names=['text_encoder'], mask_dim=256, mask_loss_coef=2, masks=True, max_size=640, max_skip=3, ngpu=1, nheads=8, num_feature_levels=4, num_frames=5, num_queries=5, num_workers=4, output_dir='davis_dirs/resnet50', position_embedding='sine', pre_norm=False, pretrained_weights=None, rel_coord=True, remove_difficult=False, resume='ckpt/ytvos_r50.pth', seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_dice=5, set_cost_giou=2, set_cost_mask=2, split='valid', start_epoch=0, threshold=0.5, two_stage=False, use_checkpoint=False, visualize=False, weight_decay=0.0005, with_box_refine=True, world_size=1, ytvos_path='data/ref-youtube-vos')
Start inference
processor 0: 0% 0/30 [00:00<?, ?it/s]Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.bias', 'lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.dense.bias']

This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
number of params: 51394175
D:\DevelopTools\anaconda3\envs\dlenv\lib\site-packages\torch\nn\functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ..\c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
D:\DevelopTools\anaconda3\envs\dlenv\lib\site-packages\torch_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ..\aten\src\ATen\native\BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
Traceback (most recent call last):
File "inference_davis.py", line 330, in
main(args)
File "inference_davis.py", line 103, in main
p.run()
File "D:\DevelopTools\anaconda3\envs\dlenv\lib\multiprocessing\process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "inference_davis.py", line 224, in sub_processor
outputs = model([imgs], [exp], [target])
File "D:\DevelopTools\anaconda3\envs\dlenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "D:\SourceCodes\Transformers\ReferFormer\models\referformer.py", line 286, in forward
self.transformer(srcs, text_embed, masks, poses, query_embeds)
File "D:\DevelopTools\anaconda3\envs\dlenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "D:\SourceCodes\Transformers\ReferFormer\models\deformable_transformer.py", line 170, in forward
memory = self.encoder(src_flatten, spatial_shapes, level_start_index, valid_ratios, lvl_pos_embed_flatten, mask_flatten)
File "D:\DevelopTools\anaconda3\envs\dlenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "D:\SourceCodes\Transformers\ReferFormer\models\deformable_transformer.py", line 291, in forward
output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)
File "D:\DevelopTools\anaconda3\envs\dlenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "D:\SourceCodes\Transformers\ReferFormer\models\deformable_transformer.py", line 261, in forward
src = self.forward_ffn(src)
File "D:\SourceCodes\Transformers\ReferFormer\models\deformable_transformer.py", line 248, in forward_ffn
src2 = self.linear2(self.dropout2(self.activation(self.linear1(src))))
File "D:\DevelopTools\anaconda3\envs\dlenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "D:\DevelopTools\anaconda3\envs\dlenv\lib\site-packages\torch\nn\modules\linear.py", line 96, in forward
return F.linear(input, self.weight, self.bias)
File "D:\DevelopTools\anaconda3\envs\dlenv\lib\site-packages\torch\nn\functional.py", line 1847, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA out of memory. Tried to allocate 1.43 GiB (GPU 0; 8.00 GiB total capacity; 3.75 GiB already allocated; 691.50 MiB free; 5.43 GiB reserved in total by PyTorch)
processor 0: 0% 0/30 [00:23<?, ?it/s]

At least how much memory is required to run?
or What parameters can be modified to reduce memory overhead?

Thanks!

how do you get pretrained model

Thanks for your great work. I would like to know how you get pretrained model, like video_swin_tiny_pretrained.pth. In my understanding, it's different from Joint training with Ref-COCO/+/g datasets.

Close

joint training hyper parameters

Hi,

Thank you for sharing your work. I am writing to inquire the hyper parameters used for joint training. The arxiv paper mentioned that

which says that the joint training uses 32 V100 GPUs and 2 video clips for each GPU.

I consider it means 32G V100 GPU. But I think it's not possible to add 2 video clip within 32G memory. I cannot reproduce the result using 8 V100 32G GPU with 1 clip per GPU, would you like to give me some advice? Thank you!

when i run inference_rvos, i got an error.

i run python3 inference_ytvos.py --with_box_refine --binary --freeze_text_encoder --output_dir=ytvos_dirs/swin_tiny --resume=ytvos_swin_tiny.pth --backbone swin_t_p4w7 and got the error:

RuntimeError: Error(s) in loading state_dict for ReferFormer:

size mismatch for backbone.0.body.patch_embed.proj.weight: copying a param with shape torch.Size([96, 3, 1, 4, 4]) from checkpoint, the shape in current model is torch.Size([96, 3, 4, 4]).
size mismatch for backbone.0.body.layers.0.blocks.0.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 3]) from checkpoint, the shape in current model is torch.Size([169, 3]).
size mismatch for backbone.0.body.layers.0.blocks.0.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.0.blocks.1.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 3]) from checkpoint, the shape in current model is torch.Size([169, 3]).
size mismatch for backbone.0.body.layers.0.blocks.1.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.1.blocks.0.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 6]) from checkpoint, the shape in current model is torch.Size([169, 6]).
size mismatch for backbone.0.body.layers.1.blocks.0.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.1.blocks.1.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 6]) from checkpoint, the shape in current model is torch.Size([169, 6]).
size mismatch for backbone.0.body.layers.1.blocks.1.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.2.blocks.0.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 12]) from checkpoint, the shape in current model is torch.Size([169, 12]).
size mismatch for backbone.0.body.layers.2.blocks.0.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.2.blocks.1.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 12]) from checkpoint, the shape in current model is torch.Size([169, 12]).
size mismatch for backbone.0.body.layers.2.blocks.1.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.2.blocks.2.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 12]) from checkpoint, the shape in current model is torch.Size([169, 12]).
size mismatch for backbone.0.body.layers.2.blocks.2.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.2.blocks.3.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 12]) from checkpoint, the shape in current model is torch.Size([169, 12]).
size mismatch for backbone.0.body.layers.2.blocks.3.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.2.blocks.4.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 12]) from checkpoint, the shape in current model is torch.Size([169, 12]).
size mismatch for backbone.0.body.layers.2.blocks.4.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.2.blocks.5.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 12]) from checkpoint, the shape in current model is torch.Size([169, 12]).
size mismatch for backbone.0.body.layers.2.blocks.5.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.3.blocks.0.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 24]) from checkpoint, the shape in current model is torch.Size([169, 24]).
size mismatch for backbone.0.body.layers.3.blocks.0.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).
size mismatch for backbone.0.body.layers.3.blocks.1.attn.relative_position_bias_table: copying a param with shape torch.Size([2535, 24]) from checkpoint, the shape in current model is torch.Size([169, 24]).
size mismatch for backbone.0.body.layers.3.blocks.1.attn.relative_position_index: copying a param with shape torch.Size([392, 392]) from checkpoint, the shape in current model is torch.Size([49, 49]).

About sentence feature

Hello, the paper states that the sentence feature is obtained by pooling the text features. However, when reading your code, I saw that the sentence feature is actually from the pooler_output of the Roberta model.
According to this https://huggingface.co/transformers/v2.9.1/model_doc/roberta.html#robertamodel, the pooler_output has different meaning than pooling over the text features.
Have you tried to actually pool the text features to get the sentence level feature? Is it worse than the current way you are doing?

Thank you

ImageNet pre-trained checkpoint for Swin-L

Hi,

which pre-trained checkpoint did you use for Swin-L?
I mean the first pre-training step, i.e. ImageNet or something similar, not Ref-COCO.
Like the Kinetics checkpoints for VideoSwin were explained here: #16
Is it one of the checkpoints from here? https://github.com/microsoft/Swin-Transformer If yes, which one?
Thank you!

Finetuning on ref-davis17?

nice work. In the paper, it says 'most of our experiments follow the pretrain-then-finetune process.' However, in this github, it says 'As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.'

did you finetune the pre-trained model on ref-davis17?

Codes about pre-training process

Could you release your codes for data pre-processing, especially for RefCOCO, RefCOCO+ and RefCOCOg ?

Cannot compile cuda version of MSDeformAttnFunction.

Hi! I tried to compile MSDeformAttnFunction with cuda 11.5.0 and everytime I compile it break and doesn't recognize the cuda module. Is there any other way I can do it ?

Vision Language Early fusion

I saw that you did not mention anything about the early-fusion module that you used in your paper. However, in your code, that module is utilized before the Transformer module. I think this simple module contributes a lot to the result. Can you explain about this?

Thank you!

ReferFormer/models/referformer.py

Line 141 in 9c8f237

self.fusion_module = VisionLanguageFusionModule(d_model=hidden_dim, nhead=8)

ReferFormer/models/referformer.py

Line 243 in 9c8f237

src_proj_l = self.fusion_module(tgt=src_proj_l,

Evaluation issue on A2D-Sentences

Hi,

I think the mAP calculation on A2D-Sentences in this repo has an issue.

When evaluating the model on A2D-Sentences, the five (that is num_queries) predictions per frame are saved to calculate mAP (55.0 mAP on Video Swin-B). But saving all predictions is unreasonable as only the best-score prediction is the referring object mask. I save the best-score prediction that only has ~51.9 mAP.

unable to submit results of ref-youtube-vos to the server?

Hi, I cannot submit inference results of ref-youtube-vos to the server:
https://competitions.codalab.org/competitions/29139#participate-submit_results
Is this server out of service? And does this mean that the result of this dataset cannot be evaluated? Thank you!

Here is my login interface and I can't find the submit button.

Is the CUDA operator compilation necessary?

Dear author:
Thank you for your great work. I have encountered some problems in compiling the CUDA operator. Do it a necessary option for running the code? Thank you in advance! Your reply is appreciated very much!

No meta.json in ref-youtube-vos dataset

Hi, I download and zip the youtube_vos datasets as guided. But there is not the meta.json in data/ref-youtube-vos/train folder and the code need this. What should I do?Thanks.

How to improve the results of inference/training

Since I used our campus server, the CUDA version is 10.1 and should be unchangeable. Other parts were according to the installations. I ran the following command for inference:
/
srun -u --gres=gpu:1 -c 4 python3 inference_ytvos.py --with_box_refine --binary --freeze_text_encoder --output_dir ytvos_dirs/resnet_50 --resume r50_refytvos_finetune.pth --backbone resnet50 --visualize --ytvos_path /nfs/data3/shuaicong/ref-youtube-vos --batch_size 1
/
The results seemed bad. Even in the valid folder most of them are white images.

Also for some videos, 4 or 5 folders were generated, but for some of them maybe there were just two.

Therefore, the evaluation result was undoubtedly very low. J was 0.106 and F was 0.087.
Q1: I want to make sure that the values of the paper and competition server matched, even though it was a two-digit number in the paper, in the evaluation it was three decimal places, like a percent sign was added to the number in the paper, right?
Q2: Please check whether the command is correct or not and are there any ways to improve the performance of inference? I have no idea how to improve the results if the installation parts and the inference command are both okay...
Q3: --resume r50_refytvos_finetune.pth or --resume r50_pretrain.pth? Although I ran them both with similar results.
Thank you in advance!

Frame-based training

Hi authors~ I noticed in your paper you mentioned you put the temporal dimension into the batch dimension and extract visual features independently. Thus, I think the whole model is frame-based training except when using video-swin as the backbone. Is my understanding right?

Pretraining on RefCOCO Dataset

Hi Jonas,

Thanks for your contribution! I noticed that the work supports pretraining on the refcoco dataset. From the code, it seems that it uses a coco-format refcoco dataset for data loading. Is it possible to provide some information about the pretraining process and the dataset conversion tools for refcoco dataset? Thanks in advance!

[Error about finetuning on Ref-Youtube-VOS] No "dataset_name" key in the "targets"

Hello. First of all, it is very grateful to share your wonderful work on Github.

I was trying to fine-tune your model on Ref-Youtube-VOS dataset with the backbone(video_swin_t_p4w7).
I followed the sample script you mentioned on the following link.
(./scripts/dist_train_test_ytvos.sh ytvos_dirs/video_swin_tiny pretrained_weights/video_swin_tiny_pretrained.pth --backbone video_swin_t_p4w7)

I bumped into the following error:

This error comes from this part:

That's because there is no key named "dataset_name" in the variable "target" which was defined in the getitem function.(getitem link: https://github.com/wjn922/ReferFormer/blob/9c8f237adc260c512a1c5ecfc7aee81b8282649a/datasets/ytvos.py#L166

Here is the debugging output of the target value in the error prompt..

Am I missing something??
I could not find a way to solve this error.
I am really appreciating if you give me any hints for this.

The model with Video-Swin-S backbone trained on Ref-YouTube-VOS dataset is missing

For non-commercial, research purposes I need to compare my R-VOS model with ReferFormer based on video-swin-s backbone and trained on Ref-YouTube-VOS dataset, but unfortunately, the link you have attached in README is referring to another model. Could you please provide us with the correct model?

Initialized weight for video-swin-tiny pretraining

Hi,

I encountered some issues when I pretrian the video-swin-tiny backbone, do you use the kinetics pretrained weight to initialize the video-swin-tiny for pretraining in the coco/+/g? You mentioned in another issue that you use imagenet pretrained model but I found imagenet pretrained model is not fit the current video-swin-tiny setting, would you give me some advice? Thank you

Training on DAVIS dataset

Hi, I am having CUDA OOM issue when I train using DAVIS dataset, may I know what can solve the issue? The following is the result of error:

Size of attn: torch.Size([120, 24, 49, 49])
Size of relative_position_bias.unsqueeze(0): torch.Size([1, 24, 49, 49])

Size of attn: torch.Size([120, 24, 49, 49])
Size of v: torch.Size([120, 24, 49, 32])

Size of attn: torch.Size([120, 24, 49, 49])
Size of v: torch.Size([120, 24, 49, 32])
Traceback (most recent call last):
File "main.py", line 250, in
main(args)
File "main.py", line 200, in main
train_stats = train_one_epoch(
File "/home/fyp-student/PycharmProjects/ReferringImageSegmentationTraining/ReferFormer/engine.py", line 43, in train_one_epoch
outputs = model(samples, captions, targets) #forward() in Referformer Class under referformer.py
File "/home/fyp-student/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/fyp-student/.local/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/fyp-student/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/fyp-student/PycharmProjects/ReferringImageSegmentationTraining/ReferFormer/models/referformer.py", line 207, in forward
features, pos = self.backbone(samples)
File "/home/fyp-student/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/fyp-student/PycharmProjects/ReferringImageSegmentationTraining/ReferFormer/models/swin_transformer.py", line 680, in forward
xs = self0
File "/home/fyp-student/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/fyp-student/PycharmProjects/ReferringImageSegmentationTraining/ReferFormer/models/swin_transformer.py", line 644, in forward
xs = self.body(tensor_list.tensors)
File "/home/fyp-student/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/fyp-student/PycharmProjects/ReferringImageSegmentationTraining/ReferFormer/models/swin_transformer.py", line 621, in forward
x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)
File "/home/fyp-student/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, **kwargs)
File "/home/fyp-student/PycharmProjects/ReferringImageSegmentationTraining/ReferFormer/models/swin_transformer.py", line 406, in forward
x = blk(x, attn_mask)
File "/home/fyp-student/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(input, **kwargs)
File "/home/fyp-student/PycharmProjects/ReferringImageSegmentationTraining/ReferFormer/models/swin_transformer.py", line 248, in forward
attn_windows = self.attn(x_windows, mask=attn_mask) # nWB, window_sizewindow_size, C
File "/home/fyp-student/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/fyp-student/PycharmProjects/ReferringImageSegmentationTraining/ReferFormer/models/swin_transformer.py", line 162, in forward
attn_v = torch.matmul(attn, v)
RuntimeError: CUDA out of memory. Tried to allocate 18.00 MiB (GPU 0; 10.75 GiB total capacity; 8.61 GiB already allocated; 21.69 MiB free; 8.97 GiB reserved in total by PyTorch)

I tried to run the line: ./scripts/dist_train_test_davis.sh davis_dirs/swin_large pretrained_weights/swin_large_pretrained.pth --backbone swin_l_p4w7.

The CUDA OOM error pops up after a few iteration of matrix multiplication between attn and v. May I know what can solve this issue? Thank you,

@iFighting @wjn922

No train.zip in ref-youtube-vos dataset

Hi, there is no train.zip in the competition's website (https://competitions.codalab.org/competitions/29139#participate-get_data). I can't get training data. Can you share a copy? Thanks.

Performance of joint train with Video-Swin-B

Hi! We run the following commands with the scripts/dist_joint_train_ytvos.sh you published for joint training,
./scripts/dist_joint_train_ytvos.sh ytvos_dirs/output --backbone video_swin_b_p4w7 --backbone_pretrained pretrained_weights/swin_base_patch244_window877_kinetics400_22k.pth --batch_size 1
but the corresponding performance (64.9) in the report still cannot be reached. I would like to ask if you know the reason for this.

For A2D-Sentences, did you only use COCO for pre-training or use COCO+YTV?

Hi,

For A2D-Sentences, did you only use COCO for pre-training or use COCO+YTV?

Thank you.

Training time for pre-training phase and main-training phase?

Hi,
Thanks for your great works.

I wonder how long does it take to train in pre-training phase and main-training phase (i.e for Swin-L backbone without joint training)

Thanks.

About HMDB51 Dataset

Thank you for your great work.
Can you provide HMDB51 Dataset，because I can't download from official channels. Thanks.

None

Could you please provide a link to DAVIS-2017_semantics-480p.zip? I haven't found it.

CPU Memory increasing when training

Hi! I'm training the pretraining process on RefCOCO* datasets.
But, the CPU memory increasing from 18G to 22G after 1 epoch.
I'm worry about after a few epochs later, the CPU MEMORY will face OOM problem. (My device is 32G CPU)
Do you have any idea about this?

Where to get the test meta file?

Hi,

thanks for releasing the great code!
I understand that there have been some changes in the split of the validation and test set and it seems that also the meta file for the test set is no longer available for download on codalab.
Hence, I get an error in this line:

ReferFormer/inference_ytvos.py

Line 79 in 8024da1

 test_meta_file = os.path.join(root, "meta_expressions", "test", "meta_expressions.json") 

What is the best way to deal with this? Maybe you could please share the json meta file?

Best,

Paul

RefCOCO/+/g json files

Thanks for your great work!
Can you provide the RefCOCO/+/g json files? I din't find them in this repository.

Hello, pre-trained model link 404

The pre-trained model link has a 404 error and cannot be found.

What does "Pretrain" mean in the released checkpoints?

Hi,

There are two columns for each result, ie., "pretrain" and "model". The model denotes the final weights after pretraining and main training. May I know what "pretrain" mean? Does it mean the model weights after pretraining on Ref-COCOs?

Thanks.

youtube_vos: No test meta_expressions.json

Hi, I download the youtube_vos datasets at the provided website link. But there is not the test meta_expressions.json in the meta_expressions file and the code need this. What should I do?Thanks.

can you provide code to generate demo?

The performance of Video Swin Transformer Base as a backbone on the Ref-Youtube-VOS dataset without pretraining on RefCOCO

Thank you for sharing such excellent work. I would like to ask if you have tested the Video Swin Transformer Base as a backbone on the Ref-Youtube-VOS dataset without pretraining on RefCOCO? The results I obtained using your code seem to be similar to those with Video Swin Tiny.

I'm unsure of the cause. It's possible there are some bugs, or the Ref-Youtube-VOS dataset might be too small for effectively fine-tuning the Video Swin Transformer Base.

Thank you for your attention!

How to use CFBI as post-process

I saw you are using CFBI as post-process, could you provide more implementation details of your algorithm? For example, are all of the predicted results involved in post-processing, and how are they loaded into the CFBI?

How to align the categories of image data and video data during joint training

Image data is based on coco dataset, with 80 classes, but RVOS-D only has 65 classes, and many classes cannot be aligned. How to solve this problem during joint training？

Issue w.r.t pretraining models

Thank you for releasing the codes of ReferFormer and the following update of the pretraining code.

Can you please also release the scripts for the pre-training process? I have tried to use the hyper params mentioned in the paper (like the multi-step LR scheduler). However, the code you released uses a StepLRScheduler rather than a MultiStep one, and the run got stuck and failed. As such, I'm wondering if the script for the released pretraining code needs a special setup. The pretraining process consumes lots of computation resources and I don't want to waste any of the GPU cards. It would be appreciated if you can help with this.

Thanks in advance.

Training dataset for Pretrain weight

The training dataset of Pretrain weight is ImageNet or RefCOCO、RefCOCO+ and RefCOCOg , or other？

High variance of the reproduced results

Thanks for your good job.
I tested both environment of cu113+torch1.10 or cu111+torch1.8.1, but the reproduced results of resnet50 referformer with provided pretrained model suffer from high variance, could you please share any ideas?

problem about image pretrain when evaluation

Thank you for your great work.
When i trained image pretrain, during evalutation,
The line 85 in engine.py incured a error. res = {target['image_id'].item(): output for target, output in zip(targets, results)},
since targets = utils.targets_to(targets, device) filtered out the key for image_id.

Mismatch between implementation and conceptual explanation

Hi All,

I have read your paper and it is quite interesting. However, I have a couple of questions on referformer for better understanding. In referformer.py, the lines from 235 to 280 compute the cross-modal attention before feeding it to the Deformable transformer, but this module is missing in Figure 2. What is the use of it? Also, the deformable transformer contains a transformer encoder network. Is this the same transformer encoder block (blue) specified in Figure 2? From the code, it looks like there are two transformer encoders. Please clarify.

Thank You,
Raj

can you provide log files of pretraining on refcoco/g/+ dataset？

Hi, thank you for your wonderful work! I am trying to pretrain ReferFormer using your main_pretrain.py on the dataset of refcoco/+/g, but the result doesn't seem to converge (the values of major indicators are very low). Can you release some logs of the pretraining process on resnet101 or swin-transformer?

Joint Training Settings

Thank you for your great work!
I met a problem when I reproduced the result of your joint train. Do any weights of the backbone network need to be loaded during the joint training e.g. video-swin-b-kinetics400-22k. I observed that there is no option to load any weights in the published joint training script. Thanks!

question about ref-davis evaluation

Hi, thanks for sharing the great work. I have a question about the ref-davis evaluation. After running the evaluation script, I see that there is a global_results-val.csv file generated for each annotator. How do you get the metrics for the whole dataset as reported in the paper? Do you average the numbers of the four annotators? Thank you!

wjn922 / referformer Goto Github PK

referformer's Introduction

Hi there 👋

referformer's People

Contributors

Stargazers

Watchers

Forkers

referformer's Issues

Recommend Projects

Recommend Topics

Recommend Org