Git Product home page Git Product logo

mask2former's Introduction

Mask2Former: Masked-attention Mask Transformer for Universal Image Segmentation (CVPR 2022)

Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexander Kirillov, Rohit Girdhar

[arXiv] [Project] [BibTeX]


Features

  • A single architecture for panoptic, instance and semantic segmentation.
  • Support major segmentation datasets: ADE20K, Cityscapes, COCO, Mapillary Vistas.

Updates

  • Add Google Colab demo.
  • Video instance segmentation is now supported! Please check our tech report for more details.

Installation

See installation instructions.

Getting Started

See Preparing Datasets for Mask2Former.

See Getting Started with Mask2Former.

Run our demo using Colab: Open In Colab

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo: Hugging Face Spaces

Replicate web demo and docker image is available here: Replicate

Advanced usage

See Advanced Usage of Mask2Former.

Model Zoo and Baselines

We provide a large set of baseline results and trained models available for download in the Mask2Former Model Zoo.

License

Shield: License: MIT

The majority of Mask2Former is licensed under a MIT License.

However portions of the project are available under separate license terms: Swin-Transformer-Semantic-Segmentation is licensed under the MIT license, Deformable-DETR is licensed under the Apache-2.0 License.

Citing Mask2Former

If you use Mask2Former in your research or wish to refer to the baseline results published in the Model Zoo, please use the following BibTeX entry.

@inproceedings{cheng2021mask2former,
  title={Masked-attention Mask Transformer for Universal Image Segmentation},
  author={Bowen Cheng and Ishan Misra and Alexander G. Schwing and Alexander Kirillov and Rohit Girdhar},
  journal={CVPR},
  year={2022}
}

If you find the code useful, please also consider the following BibTeX entry.

@inproceedings{cheng2021maskformer,
  title={Per-Pixel Classification is Not All You Need for Semantic Segmentation},
  author={Bowen Cheng and Alexander G. Schwing and Alexander Kirillov},
  journal={NeurIPS},
  year={2021}
}

Acknowledgement

Code is largely based on MaskFormer (https://github.com/facebookresearch/MaskFormer).

mask2former's People

Contributors

ak391 avatar bowenc0221 avatar ccoulombe avatar chenxwh avatar erik-sovereign avatar huliang2016 avatar rohitgirdhar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mask2former's Issues

why do we need to compile CUDA kernel for MSDeformAttn?

can you please explain why do we need to compile to cuda kernel for MSDeformAttn?
I see we have a python file for it, I am not understanding why the compilation is needed?
Sorry I am not very familiar with the concept of why it would not work if we use the py functions without compiling
In what scenario I should generally compile the cuda kernel? because I never paid attention to it :/

really appreciate it if you can please explain the reason behind it.
Thanks a ton!

Is it possible to compare detection results with models based on bounding box AP?

Hi, first, thank you for your work it works really well on my custom dataset and the robustness to occlusion is impressive!

I have questions regarding object detection precision. If using mask R-CNN to detect objects, it gives AP bb and AP segm. For the same dataset, the AP segm obtained with Mask2Former is better than the one with Mask R-CNN. However, the AP bb of Mask R-CNN is higher than AP segm of Mask2Former.

My questions are:

  1. Does it mean detection precision is better with Mask R-CNN?
  2. Did you guys try to get the bounding box from the mask segmentation, compute AP bb and compare it with other models?

Request for training logs on COCO

Hi,

Could you please share the training logs for models as well? Its common now to share them DeiT and would help in debugging in reproducing. Even just for say, COCO on R50 model.

Best,
Kartik

Question about the length of training time

Hello, thank you for timely sharing the source code.
But I found that the training speed was very slow on my server.
Could u plz tell me the length of your training time on 8 V100?

configs/coco/instance-segmentation/maskformer2_R50_bs16_50ep.yaml
configs/coco/instance-segmentation/maskformer2_R101_bs16_50ep.yaml

How to install with pip?

I tried to run pip install git+https://github.com/facebookresearch/Mask2Former, but terminal throws a bunch of errors (probably because lack of setup.py in repo).
I really don't like using conda so Is there any way to install it with pip or should I build it from source?

Issues training the ytvis_2019 model

Hi Bowen,

Thanks for your excellent work and code! I am retraining the video instance segmentation model on the Youtube VIS 2019 dataset. I managed to train the model, but the quantitative result on the CodaLab turns to be very low (of only about 40).

The command I used for training is:

python3 train_net_video.py \
    --config-file configs/youtubevis_2019/swin/video_maskformer2_swin_base_IN21k_384_bs16_8ep.yaml \
    --num-gpus 8 \
    MODEL.WEIGHTS swin_base_patch4_window12_384_22k.pkl

The backbone weight is got from:

wget https://github.com/SwinTransformer/storage/releases/download/v1.0.0/swin_base_patch4_window12_384_22k.pth
python tools/convert-pretrained-swin-model-to-d2.py swin_base_patch4_window12_384_22k.pth swin_base_patch4_window12_384_22k.pkl

Then I submitted the output/inference/results.json after the training to CodaLab, but only got 40 accuracy. I also tried to rerun the evaluation using output/model_final.pth, and the results are almost the same.

The config files are untouched. I am able to reproduce the correct result using the pretrain model, so I assume that my environment and dataset setups are ok. I also checked the tensorboard output and the loss curve looks good. Could you help to check if there is anything wrong with my training process? Thanks!

Colab demo doesn't work

I would love to try out this model but I am struggling with installation. The Colab demo does not work either and gives the error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/content/Mask2Former/mask2former/modeling/pixel_decoder/ops/functions/ms_deform_attn_func.py in <module>()
     21 try:
---> 22     import MultiScaleDeformableAttention as MSDA
     23 except ModuleNotFoundError as e:

ModuleNotFoundError: No module named 'MultiScaleDeformableAttention'

During handling of the above exception, another exception occurred:

ModuleNotFoundError                       Traceback (most recent call last)
7 frames
/content/Mask2Former/mask2former/modeling/pixel_decoder/ops/functions/ms_deform_attn_func.py in <module>()
     27         "\t`sh make.sh`\n"
     28     )
---> 29     raise ModuleNotFoundError(info_string)
     30 
     31 

ModuleNotFoundError: 

Please compile MultiScaleDeformableAttention CUDA op with the following commands:
	`cd mask2former/modeling/pixel_decoder/ops`
	`sh make.sh`


---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

How to map category_id to object label in panoptic_seg

I am running inference using COCO config. How can I get the object label of each class in the way that it is displayed on a visualized output image? Basically looking for some mapping from category_id to label.
During inference, I got category_ids greater than 91 so I thought the standard COCO mapping won't work.

Question about the need of bounding boxes in the labels

Hi, thank you for sharing your work, works greatly on my dataset.

I don't get if the model uses the ground truth of the bounding boxes:
for example, let's say we are working with the Coco dataset, would it change anything in the training phase if we switched every annotations[n].segments_info[m].bbox into [0,0,0,0] ?

gpu information

Hi,
do you guys have information on GPU usage during inference ?

Thank you

[Inference] AssertionError: Non-existent key: --output

Hello! Thank you for sharing.
I found some error in inference code (demo.py).
There is non-existent key, --output.

I modified the code for saving, you need to modify the code for saving.

(detectron) hello96min@rvi-node001:~/minseok/Mask2Former/demo$ python3 demo.py --config-file /home/hello96min/minseok/Mask2Former/configs/coco/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_100ep.yaml --input ~/minseok/image-inpainting/datasets_dgm/sample_data/images/images_0.png --opts MODEL.WEIGHTS /home/hello96min/minseok/Mask2Former/configs/coco/panoptic-segmentation/model_final_f07440.pkl --output ./1.png
[12/14 22:37:14 detectron2]: Arguments: Namespace(confidence_threshold=0.5, config_file='/home/hello96min/minseok/Mask2Former/configs/coco/panoptic-segmentation/swin/maskformer2_swin_large_IN21k_384_bs16_100ep.yaml', input=['/home/hello96min/minseok/image-inpainting/datasets_dgm/sample_data/images/images_0.png'], opts=['MODEL.WEIGHTS', '/home/hello96min/minseok/Mask2Former/configs/coco/panoptic-segmentation/model_final_f07440.pkl', '--output', './1.png'], output=None, video_input=None, webcam=False)
Traceback (most recent call last):
  File "demo.py", line 106, in <module>
    cfg = setup_cfg(args)
  File "demo.py", line 40, in setup_cfg
    cfg.merge_from_list(args.opts)
  File "/home/hello96min/yes/envs/detectron/lib/python3.8/site-packages/fvcore/common/config.py", line 143, in merge_from_list
    return super().merge_from_list(cfg_list)
  File "/home/hello96min/yes/envs/detectron/lib/python3.8/site-packages/yacs/config.py", line 243, in merge_from_list
    _assert_with_logging(subkey in d, "Non-existent key: {}".format(full_key))
  File "/home/hello96min/yes/envs/detectron/lib/python3.8/site-packages/yacs/config.py", line 545, in _assert_with_logging
    assert cond, msg
AssertionError: Non-existent key: --output

Issue reproduce COCO training

Hello all, I am quite confuse with the definition " panoptic_{train,val}2017/ # png annotations " on coco folder structure. When I download the COCO dataset, I couldnt find this folder/dataset. could you please tell me how can I get/generate this folder? I know there is panoptic annotations, but exactly how can I generate the folder. Thank you

Question for input data

Hi, I really enjoyed reading Mask2Former paper.

Could 3d image be used for training if appropriate modification is done on code?

Regards,
Tae

Train on custom dataset for instance segmentation

The custom dataset only has one class, so I set the MODEL.ROI_HEADS.NUM_CLASSES and MODEL.RETINANET.NUM_CLASSES both as 1. However, when I evaluate the trained model, an error happened:

File "train_net.py", line 411, in main
res = Trainer.test(cfg, model)
File "/home/chengzhi/PROGRAM/COW_GAME/detectron2/detectron2/engine/defaults.py", line 617, in test
results_i = inference_on_dataset(model, data_loader, evaluator)
File "/home/chengzhi/PROGRAM/COW_GAME/detectron2/detectron2/evaluation/evaluator.py", line 205, in inference_on_dataset
results = evaluator.evaluate()
File "/home/chengzhi/PROGRAM/COW_GAME/detectron2/detectron2/evaluation/coco_evaluation.py", line 206, in evaluate
self._eval_predictions(predictions, img_ids=img_ids)
File "/home/chengzhi/PROGRAM/COW_GAME/detectron2/detectron2/evaluation/coco_evaluation.py", line 241, in _eval_predictions
f"A prediction has class={category_id}, "
AssertionError: A prediction has class=24, but the dataset only has 1 classes and predicted class id should be in [0, 0].

About evaluation Mask2Former on YouTubeVIS-2021

The original data downloaded from the link is organized as:

ytvis_2021/
{train/valid/test}/
JPEGImages/
instance.json

There are no Annotations as you mentioned like below,

ytvis_2021/
{train,valid,test}.json
{train,valid,test}/
Annotations/
JPEGImages/

How do you evaluation Mask2Former on YouTubeVIS-2021?

Scale the lr when using different samping num

Hi,

Thank you for sharing such a good work! I have a question regarding to the loss. I found if I use more frames during training, the loss goes very high. Do I need to scale down the lr linearly according to the sampling frame num? Thank you.

Train coco dataset error

i followed the instruction:
https://github.com/facebookresearch/Mask2Former/blob/main/datasets/README.md and perpare the coco datasets.

I have already run demo successfully but the error occur when i running train scrip:
python train_net.py --num-gpus 8 --config-file configs/coco/panoptic-segmentation/maskformer2_R50_bs16_50ep.yaml

which is followed:
`
[02/01 14:43:34 mask2former.data.dataset_mappers.coco_panoptic_new_baseline_dataset_mapper]: [COCOPanopticNewBaselineDatasetMapper] Full TransformGens used in training: [RandomFlip(), ResizeScale(min_scale=0.1, max_scale=2.0, target_height=1024, target_width=1024), FixedSizeCrop(crop_size=(1024, 1024))]
[02/01 14:43:41 d2.data.build]: Using training sampler TrainingSampler
[02/01 14:43:41 d2.data.common]: Serializing 118287 elements to byte tensors and concatenating them all ...
[02/01 14:43:42 d2.data.common]: Serialized dataset takes 78.29 MiB
[02/01 14:43:51 fvcore.common.checkpoint]: [Checkpointer] Loading from model_final_94dc52.pkl ...
[02/01 14:43:51 fvcore.common.checkpoint]: Reading a file from 'MaskFormer Model Zoo'
WARNING [02/01 14:43:51 mask2former.modeling.transformer_decoder.mask2former_transformer_decoder]: Weight format of MultiScaleMaskedTransformerDecoder have changed! Please upgrade your models. Applying automatic conversion now ...
[02/01 14:43:51 d2.engine.train_loop]: Starting training from iteration 0
Traceback (most recent call last):
File "", line 1, in
File "/opt/conda/lib/python3.9/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/opt/conda/lib/python3.9/multiprocessing/spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "/home/xt.xie/workspace/code/Mask2Former-main/train_net.py", line 321, in
launch(
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/launch.py", line 67, in launch
mp.spawn(
File "/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 4 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker
main_func(*args)
File "/home/xt.xie/workspace/code/Mask2Former-main/train_net.py", line 315, in main
return trainer.train()
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step()
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 395, in run_step
loss_dict = self.model(data)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xt.xie/workspace/code/Mask2Former-main/mask2former/maskformer_model.py", line 209, in forward
losses = self.criterion(outputs, targets)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xt.xie/workspace/code/Mask2Former-main/mask2former/modeling/criterion.py", line 222, in forward
indices = self.matcher(outputs_without_aux, targets)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/xt.xie/workspace/code/Mask2Former-main/mask2former/modeling/matcher.py", line 179, in forward
return self.memory_efficient_forward(outputs, targets)
File "/opt/conda/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/xt.xie/workspace/code/Mask2Former-main/mask2former/modeling/matcher.py", line 122, in memory_efficient_forward
tgt_mask = point_sample(
File "/home/xt.xie/.local/lib/python3.9/site-packages/detectron2/projects/point_rend/point_features.py", line 39, in point_sample
output = F.grid_sample(input, 2.0 * point_coords - 1.0, **kwargs)
File "/opt/conda/lib/python3.9/site-packages/torch/nn/functional.py", line 3836, in grid_sample
return torch.grid_sampler(input, grid, mode_enum, padding_mode_enum, align_corners)
RuntimeError: grid_sampler(): expected input and grid to have same dtype, but input has c10::Half and grid has float
`

The result of swin-small backbone on ADE

Hi,

I run Mask2Former on ADE (maskformer2_swin_small_bs16_160k.yaml) with 4 16GB V-100 GPUs. However, I can only achieve 49.6%, which is much worse than the reported result (51.3%). Could you provide the log for me to analysize the result?

Thanks

How to visualize the VIS results?

Hi,

Thanks for your wonderful work and repo.

Could you please provide the instructions on how to visualize the video instance segmentation results on images or videos? Thanks!

Can model convert to torchscript?

Hi, Can model convert to torchscript?
I try to do, but I got the error.

RuntimeError: 
Could not export Python function call 'MSDeformAttnFunction'. Remove calls to Python functions before export. Did you forget to add @script or @script_method annotation? If this is a nn.ModuleList, add it to __constants__:
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/pixel_decoder/ops/modules/ms_deform_attn.py(117): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/pixel_decoder/msdeformattn.py(124): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/pixel_decoder/msdeformattn.py(159): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/pixel_decoder/msdeformattn.py(87): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/pixel_decoder/msdeformattn.py(324): forward_features
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/autocast_mode.py(198): decorate_autocast
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/meta_arch/mask_former_head.py(119): layers
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/modeling/meta_arch/mask_former_head.py(116): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/mask2former/maskformer_model.py(198): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/detectron2/detectron2/export/flatten.py(259): <lambda>
/home/ubuntu/PycharmProjects/mask2former/venv/detectron2/detectron2/export/flatten.py(294): forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/jit/_trace.py(965): trace_module
/home/ubuntu/PycharmProjects/mask2former/venv/lib/python3.6/site-packages/torch/jit/_trace.py(750): trace
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/toTorchScript.py(44): export_tracing
/home/ubuntu/PycharmProjects/mask2former/venv/Mask2Former/toTorchScript.py(115): <module>

Converting a PyTorch model to an ONNX model

Hi, thanks for your great work.
Currently I'm using coco maskformer2_swin_large_IN21k_384_bs16_100ep.yaml configuration with pretrained model.
I'm trying to convert this model to onnx format. But it gives me segmentation fault error.

Could you please share the converted model or inform about how to do it?

Validation accuracy always 100?

Hello,

Thank you for this project and code. I'm running a custom semantic segmentation training job (based on this config) with one class (custom class 'AT') and for some reason my validation accuracy after an epoch is always 100:

[12/21 21:20:26 d2.evaluation.evaluator]: Total inference time: 0:00:44.703675 (0.065935 s / iter per device, on 1 devices)
[12/21 21:20:26 d2.evaluation.evaluator]: Total inference pure compute time: 0:00:37 (0.055430 s / iter per device, on 1 devices)
[12/21 21:20:26 d2.evaluation.sem_seg_evaluation]: OrderedDict([('sem_seg', {'mIoU': 100.0, 'fwIoU': 100.0, 'IoU-AT': 100.0, 'mACC': 100.0, 'pACC': 100.0, 'ACC-AT': 100.0})])
[12/21 21:20:26 d2.engine.defaults]: Evaluation results for ade20k_full_sem_seg_val in csv format:
[12/21 21:20:26 d2.evaluation.testing]: copypaste: Task: sem_seg
[12/21 21:20:26 d2.evaluation.testing]: copypaste: mIoU,fwIoU,mACC,pACC
[12/21 21:20:26 d2.evaluation.testing]: copypaste: 100.0000,100.0000,100.0000,100.0000

When I view the loss curves in tensorboard it seems like the model is learning so I'm not sure what's going on:

2021-12-21 21_35_27-localhost_5139 - Remote Desktop Connection

Here's the full config:

config.zip

Any ideas?

Thank you

error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device

using /data
Preparation done. Between equal marks is user's output:
/root/conda/bin/python
running build
running build_py
running build_ext
building 'MultiScaleDeformableAttention' extension
Emitting ninja build file /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
g++ -pthread -shared -B /root/conda/compiler_compat -L/root/conda/lib -Wl,-rpath=/root/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/workspace/mask2former/modeling/pixel_decoder/ops/src/vision.o /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/workspace/mask2former/modeling/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.o /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/workspace/mask2former/modeling/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.o -L/root/conda/lib/python3.7/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-3.7/MultiScaleDeformableAttention.cpython-37m-x86_64-linux-gnu.so
running install
running bdist_egg
running egg_info
writing MultiScaleDeformableAttention.egg-info/PKG-INFO
writing dependency_links to MultiScaleDeformableAttention.egg-info/dependency_links.txt
writing top-level names to MultiScaleDeformableAttention.egg-info/top_level.txt
reading manifest file 'MultiScaleDeformableAttention.egg-info/SOURCES.txt'
writing manifest file 'MultiScaleDeformableAttention.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/functions
copying build/lib.linux-x86_64-3.7/functions/init.py -> build/bdist.linux-x86_64/egg/functions
copying build/lib.linux-x86_64-3.7/functions/ms_deform_attn_func.py -> build/bdist.linux-x86_64/egg/functions
copying build/lib.linux-x86_64-3.7/MultiScaleDeformableAttention.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/modules
copying build/lib.linux-x86_64-3.7/modules/ms_deform_attn.py -> build/bdist.linux-x86_64/egg/modules
copying build/lib.linux-x86_64-3.7/modules/init.py -> build/bdist.linux-x86_64/egg/modules
byte-compiling build/bdist.linux-x86_64/egg/functions/init.py to init.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/functions/ms_deform_attn_func.py to ms_deform_attn_func.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/modules/ms_deform_attn.py to ms_deform_attn.cpython-37.pyc
byte-compiling build/bdist.linux-x86_64/egg/modules/init.py to init.cpython-37.pyc
creating stub loader for MultiScaleDeformableAttention.cpython-37m-x86_64-linux-gnu.so
byte-compiling build/bdist.linux-x86_64/egg/MultiScaleDeformableAttention.py to MultiScaleDeformableAttention.cpython-37.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying MultiScaleDeformableAttention.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MultiScaleDeformableAttention.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MultiScaleDeformableAttention.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying MultiScaleDeformableAttention.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt
zip_safe flag not set; analyzing archive contents...
pycache.MultiScaleDeformableAttention.cpython-37: module references file
creating 'dist/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg
removing '/root/conda/lib/python3.7/site-packages/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg' (and everything under it)
creating /root/conda/lib/python3.7/site-packages/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg
Extracting MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg to /root/conda/lib/python3.7/site-packages
MultiScaleDeformableAttention 1.0 is already the active version in easy-install.pth

Installed /root/conda/lib/python3.7/site-packages/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg
Processing dependencies for MultiScaleDeformableAttention==1.0
Finished processing dependencies for MultiScaleDeformableAttention==1.0
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
Command Line Args: Namespace(config_file='configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=8, num_machines=1, opts=[], resume=False)
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
[02/22 03:41:38 detectron2]: Rank of current process: 0. World size: 8
[02/22 03:41:40 detectron2]: Environment info:


sys.platform linux
Python 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0]
numpy 1.19.2
detectron2 0.6 @/root/conda/lib/python3.7/site-packages/detectron2
Compiler GCC 7.3
CUDA compiler CUDA 11.1
detectron2 arch flags 3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6
DETECTRON2_ENV_MODULE
PyTorch 1.9.0 @/root/conda/lib/python3.7/site-packages/torch
PyTorch debug build False
GPU available Yes
GPU 0,1,2,3,4,5,6,7 GeForce RTX 3090 (arch=8.6)
Driver version 460.73.01
CUDA_HOME /usr/local/cuda
TORCH_CUDA_ARCH_LIST 6.0;6.1;6.2;7.0;7.5
Pillow 8.0.1
torchvision 0.10.0 @/root/conda/lib/python3.7/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20220212
iopath 0.1.9
cv2 4.1.2


PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

[02/22 03:41:40 detectron2]: Command line arguments: Namespace(config_file='configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=8, num_machines=1, opts=[], resume=False)
[02/22 03:41:40 detectron2]: Contents of args.config_file=configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml:
BASE: Base-YouTubeVIS-VideoInstanceSegmentation.yaml
MODEL:
WEIGHTS: 186m"186mmodel_final_3c8ec9.pkl186m"
META_ARCHITECTURE: 186m"186mVideoMaskFormer186m"
SEM_SEG_HEAD:
NAME: 186m"186mMaskFormerHead186m"
IGNORE_VALUE: 255
NUM_CLASSES: 40
LOSS_WEIGHT: 1.0
CONVS_DIM: 256
MASK_DIM: 256
NORM: 186m"186mGN186m"
242m# pixel decoder
PIXEL_DECODER_NAME: 186m"186mMSDeformAttnPixelDecoder186m"
IN_FEATURES: [186m"186mres2186m", 186m"186mres3186m", 186m"186mres4186m", 186m"186mres5186m"]
DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: [186m"186mres3186m", 186m"186mres4186m", 186m"186mres5186m"]
COMMON_STRIDE: 4
TRANSFORMER_ENC_LAYERS: 6
MASK_FORMER:
TRANSFORMER_DECODER_NAME: 186m"186mVideoMultiScaleMaskedTransformerDecoder186m"
TRANSFORMER_IN_FEATURE: 186m"186mmulti_scale_pixel_decoder186m"
DEEP_SUPERVISION: True
NO_OBJECT_WEIGHT: 0.1
CLASS_WEIGHT: 2.0
MASK_WEIGHT: 5.0
DICE_WEIGHT: 5.0
HIDDEN_DIM: 256
NUM_OBJECT_QUERIES: 100
NHEADS: 8
DROPOUT: 0.0
DIM_FEEDFORWARD: 2048
ENC_LAYERS: 0
PRE_NORM: False
ENFORCE_INPUT_PROJ: False
SIZE_DIVISIBILITY: 32
DEC_LAYERS: 10 242m# 9 decoder layers, add one for the loss on learnable query
TRAIN_NUM_POINTS: 12544
OVERSAMPLE_RATIO: 3.0
IMPORTANCE_SAMPLE_RATIO: 0.75
TEST:
SEMANTIC_ON: False
INSTANCE_ON: True
PANOPTIC_ON: False
OVERLAP_THRESHOLD: 0.8
OBJECT_MASK_THRESHOLD: 0.8

[02/22 03:41:40 detectron2]: Running with full config:
CUDNN_BENCHMARK: false
DATALOADER:
ASPECT_RATIO_GROUPING: true
FILTER_EMPTY_ANNOTATIONS: false
NUM_WORKERS: 4
REPEAT_THRESHOLD: 0.0
SAMPLER_TRAIN: TrainingSampler
DATASETS:
PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
PROPOSAL_FILES_TEST: []
PROPOSAL_FILES_TRAIN: []
TEST:

  • ytvis_2019_val
    TRAIN:
  • ytvis_2019_train
    GLOBAL:
    HACK: 1.0
    INPUT:
    AUGMENTATIONS: []
    COLOR_AUG_SSD: false
    CROP:
    ENABLED: false
    SINGLE_CATEGORY_MAX_AREA: 1.0
    SIZE:
    • 600
    • 720
      TYPE: absolute_range
      DATASET_MAPPER_NAME: mask_former_semantic
      FORMAT: RGB
      IMAGE_SIZE: 1024
      MASK_FORMAT: polygon
      MAX_SCALE: 2.0
      MAX_SIZE_TEST: 1333
      MAX_SIZE_TRAIN: 1333
      MIN_SCALE: 0.1
      MIN_SIZE_TEST: 360
      MIN_SIZE_TRAIN:
  • 360
  • 480
    MIN_SIZE_TRAIN_SAMPLING: choice_by_clip
    RANDOM_FLIP: flip_by_clip
    SAMPLING_FRAME_NUM: 2
    SAMPLING_FRAME_RANGE: 20
    SAMPLING_FRAME_SHUFFLE: false
    SIZE_DIVISIBILITY: -1
    MODEL:
    ANCHOR_GENERATOR:
    ANGLES:
      • -90
      • 0
      • 90
        ASPECT_RATIOS:
      • 0.5
      • 1.0
      • 2.0
        NAME: DefaultAnchorGenerator
        OFFSET: 0.0
        SIZES:
      • 32
      • 64
      • 128
      • 256
      • 512
        BACKBONE:
        FREEZE_AT: 0
        NAME: build_resnet_backbone
        DEVICE: cuda
        FPN:
        FUSE_TYPE: sum
        IN_FEATURES: []
        NORM: 186m'186m'
        OUT_CHANNELS: 256
        KEYPOINT_ON: false
        LOAD_PROPOSALS: false
        MASK_FORMER:
        CLASS_WEIGHT: 2.0
        DEC_LAYERS: 10
        DEEP_SUPERVISION: true
        DICE_WEIGHT: 5.0
        DIM_FEEDFORWARD: 2048
        DROPOUT: 0.0
        ENC_LAYERS: 0
        ENFORCE_INPUT_PROJ: false
        HIDDEN_DIM: 256
        IMPORTANCE_SAMPLE_RATIO: 0.75
        MASK_WEIGHT: 5.0
        NHEADS: 8
        NO_OBJECT_WEIGHT: 0.1
        NUM_OBJECT_QUERIES: 100
        OVERSAMPLE_RATIO: 3.0
        PRE_NORM: false
        SIZE_DIVISIBILITY: 32
        TEST:
        INSTANCE_ON: true
        OBJECT_MASK_THRESHOLD: 0.8
        OVERLAP_THRESHOLD: 0.8
        PANOPTIC_ON: false
        SEMANTIC_ON: false
        SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE: false
        TRAIN_NUM_POINTS: 12544
        TRANSFORMER_DECODER_NAME: VideoMultiScaleMaskedTransformerDecoder
        TRANSFORMER_IN_FEATURE: multi_scale_pixel_decoder
        MASK_ON: true
        META_ARCHITECTURE: VideoMaskFormer
        PANOPTIC_FPN:
        COMBINE:
        ENABLED: true
        INSTANCES_CONFIDENCE_THRESH: 0.5
        OVERLAP_THRESH: 0.5
        STUFF_AREA_LIMIT: 4096
        INSTANCE_LOSS_WEIGHT: 1.0
        PIXEL_MEAN:
  • 123.675
  • 116.28
  • 103.53
    PIXEL_STD:
  • 58.395
  • 57.12
  • 57.375
    PROPOSAL_GENERATOR:
    MIN_SIZE: 0
    NAME: RPN
    RESNETS:
    DEFORM_MODULATED: false
    DEFORM_NUM_GROUPS: 1
    DEFORM_ON_PER_STAGE:
    • false
    • false
    • false
    • false
      DEPTH: 50
      NORM: FrozenBN
      NUM_GROUPS: 1
      OUT_FEATURES:
    • res2
    • res3
    • res4
    • res5
      RES2_OUT_CHANNELS: 256
      RES4_DILATION: 1
      RES5_DILATION: 1
      RES5_MULTI_GRID:
    • 1
    • 1
    • 1
      STEM_OUT_CHANNELS: 64
      STEM_TYPE: basic
      STRIDE_IN_1X1: false
      WIDTH_PER_GROUP: 64
      RETINANET:
      BBOX_REG_LOSS_TYPE: smooth_l1
      BBOX_REG_WEIGHTS: &id001
    • 1.0
    • 1.0
    • 1.0
    • 1.0
      FOCAL_LOSS_ALPHA: 0.25
      FOCAL_LOSS_GAMMA: 2.0
      IN_FEATURES:
    • p3
    • p4
    • p5
    • p6
    • p7
      IOU_LABELS:
    • 0
    • -1
    • 1
      IOU_THRESHOLDS:
    • 0.4
    • 0.5
      NMS_THRESH_TEST: 0.5
      NORM: 186m'186m'
      NUM_CLASSES: 80
      NUM_CONVS: 4
      PRIOR_PROB: 0.01
      SCORE_THRESH_TEST: 0.05
      SMOOTH_L1_LOSS_BETA: 0.1
      TOPK_CANDIDATES_TEST: 1000
      ROI_BOX_CASCADE_HEAD:
      BBOX_REG_WEIGHTS:
      • 10.0
      • 10.0
      • 5.0
      • 5.0
      • 20.0
      • 20.0
      • 10.0
      • 10.0
      • 30.0
      • 30.0
      • 15.0
      • 15.0
        IOUS:
    • 0.5
    • 0.6
    • 0.7
      ROI_BOX_HEAD:
      BBOX_REG_LOSS_TYPE: smooth_l1
      BBOX_REG_LOSS_WEIGHT: 1.0
      BBOX_REG_WEIGHTS:
    • 10.0
    • 10.0
    • 5.0
    • 5.0
      CLS_AGNOSTIC_BBOX_REG: false
      CONV_DIM: 256
      FC_DIM: 1024
      NAME: 186m'186m'
      NORM: 186m'186m'
      NUM_CONV: 0
      NUM_FC: 0
      POOLER_RESOLUTION: 14
      POOLER_SAMPLING_RATIO: 0
      POOLER_TYPE: ROIAlignV2
      SMOOTH_L1_BETA: 0.0
      TRAIN_ON_PRED_BOXES: false
      ROI_HEADS:
      BATCH_SIZE_PER_IMAGE: 512
      IN_FEATURES:
    • res4
      IOU_LABELS:
    • 0
    • 1
      IOU_THRESHOLDS:
    • 0.5
      NAME: Res5ROIHeads
      NMS_THRESH_TEST: 0.5
      NUM_CLASSES: 80
      POSITIVE_FRACTION: 0.25
      PROPOSAL_APPEND_GT: true
      SCORE_THRESH_TEST: 0.05
      ROI_KEYPOINT_HEAD:
      CONV_DIMS:
    • 512
    • 512
    • 512
    • 512
    • 512
    • 512
    • 512
    • 512
      LOSS_WEIGHT: 1.0
      MIN_KEYPOINTS_PER_IMAGE: 1
      NAME: KRCNNConvDeconvUpsampleHead
      NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true
      NUM_KEYPOINTS: 17
      POOLER_RESOLUTION: 14
      POOLER_SAMPLING_RATIO: 0
      POOLER_TYPE: ROIAlignV2
      ROI_MASK_HEAD:
      CLS_AGNOSTIC_MASK: false
      CONV_DIM: 256
      NAME: MaskRCNNConvUpsampleHead
      NORM: 186m'186m'
      NUM_CONV: 0
      POOLER_RESOLUTION: 14
      POOLER_SAMPLING_RATIO: 0
      POOLER_TYPE: ROIAlignV2
      RPN:
      BATCH_SIZE_PER_IMAGE: 256
      BBOX_REG_LOSS_TYPE: smooth_l1
      BBOX_REG_LOSS_WEIGHT: 1.0
      BBOX_REG_WEIGHTS: *id001
      BOUNDARY_THRESH: -1
      CONV_DIMS:
    • -1
      HEAD_NAME: StandardRPNHead
      IN_FEATURES:
    • res4
      IOU_LABELS:
    • 0
    • -1
    • 1
      IOU_THRESHOLDS:
    • 0.3
    • 0.7
      LOSS_WEIGHT: 1.0
      NMS_THRESH: 0.7
      POSITIVE_FRACTION: 0.5
      POST_NMS_TOPK_TEST: 1000
      POST_NMS_TOPK_TRAIN: 2000
      PRE_NMS_TOPK_TEST: 6000
      PRE_NMS_TOPK_TRAIN: 12000
      SMOOTH_L1_BETA: 0.0
      SEM_SEG_HEAD:
      ASPP_CHANNELS: 256
      ASPP_DILATIONS:
    • 6
    • 12
    • 18
      ASPP_DROPOUT: 0.1
      COMMON_STRIDE: 4
      CONVS_DIM: 256
      DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES:
    • res3
    • res4
    • res5
      DEFORMABLE_TRANSFORMER_ENCODER_N_HEADS: 8
      DEFORMABLE_TRANSFORMER_ENCODER_N_POINTS: 4
      IGNORE_VALUE: 255
      IN_FEATURES:
    • res2
    • res3
    • res4
    • res5
      LOSS_TYPE: hard_pixel_mining
      LOSS_WEIGHT: 1.0
      MASK_DIM: 256
      NAME: MaskFormerHead
      NORM: GN
      NUM_CLASSES: 40
      PIXEL_DECODER_NAME: MSDeformAttnPixelDecoder
      PROJECT_CHANNELS:
    • 48
      PROJECT_FEATURES:
    • res2
      TRANSFORMER_ENC_LAYERS: 6
      USE_DEPTHWISE_SEPARABLE_CONV: false
      SWIN:
      APE: false
      ATTN_DROP_RATE: 0.0
      DEPTHS:
    • 2
    • 2
    • 6
    • 2
      DROP_PATH_RATE: 0.3
      DROP_RATE: 0.0
      EMBED_DIM: 96
      MLP_RATIO: 4.0
      NUM_HEADS:
    • 3
    • 6
    • 12
    • 24
      OUT_FEATURES:
    • res2
    • res3
    • res4
    • res5
      PATCH_NORM: true
      PATCH_SIZE: 4
      PRETRAIN_IMG_SIZE: 224
      QKV_BIAS: true
      QK_SCALE: null
      USE_CHECKPOINT: false
      WINDOW_SIZE: 7
      WEIGHTS: /data/bolu.ldz/PRETRAINED_WEIGHTS/mask2former/model_final_3c8ec9.pkl
      OUTPUT_DIR: /summary
      SEED: -1
      SOLVER:
      AMP:
      ENABLED: true
      BACKBONE_MULTIPLIER: 0.1
      BASE_LR: 0.0001
      BIAS_LR_FACTOR: 1.0
      CHECKPOINT_PERIOD: 5000
      CLIP_GRADIENTS:
      CLIP_TYPE: full_model
      CLIP_VALUE: 0.01
      ENABLED: true
      NORM_TYPE: 2.0
      GAMMA: 0.1
      IMS_PER_BATCH: 16
      LR_SCHEDULER_NAME: WarmupMultiStepLR
      MAX_ITER: 6000
      MOMENTUM: 0.9
      NESTEROV: false
      OPTIMIZER: ADAMW
      POLY_LR_CONSTANT_ENDING: 0.0
      POLY_LR_POWER: 0.9
      REFERENCE_WORLD_SIZE: 0
      STEPS:
  • 4000
    WARMUP_FACTOR: 1.0
    WARMUP_ITERS: 10
    WARMUP_METHOD: linear
    WEIGHT_DECAY: 0.05
    WEIGHT_DECAY_BIAS: null
    WEIGHT_DECAY_EMBED: 0.0
    WEIGHT_DECAY_NORM: 0.0
    TEST:
    AUG:
    ENABLED: false
    FLIP: true
    MAX_SIZE: 4000
    MIN_SIZES:
    • 400
    • 500
    • 600
    • 700
    • 800
    • 900
    • 1000
    • 1100
    • 1200
      DETECTIONS_PER_IMAGE: 100
      EVAL_PERIOD: 0
      EXPECTED_RESULTS: []
      KEYPOINT_OKS_SIGMAS: []
      PRECISE_BN:
      ENABLED: false
      NUM_ITER: 200
      VERSION: 2
      VIS_PERIOD: 0

[02/22 03:41:40 detectron2]: Full config saved to /summary/config.yaml
[02/22 03:41:40 d2.utils.env]: Using a generated random seed 40230477
[02/22 03:41:45 d2.engine.defaults]: Model:
VideoMaskFormer(
(backbone): ResNet(
(stem): BasicStem(
(conv1): Conv2d(
3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
)
(res2): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv1): Conv2d(
64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv2): Conv2d(
64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
)
(conv3): Conv2d(
64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
)
)
(res3): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv1): Conv2d(
256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv2): Conv2d(
128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
)
(conv3): Conv2d(
128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
)
)
(res4): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
(conv1): Conv2d(
512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(3): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(4): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
(5): BottleneckBlock(
(conv1): Conv2d(
1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv2): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
)
(conv3): Conv2d(
256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
)
)
)
(res5): Sequential(
(0): BottleneckBlock(
(shortcut): Conv2d(
1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
(conv1): Conv2d(
1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
(1): BottleneckBlock(
(conv1): Conv2d(
2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
(2): BottleneckBlock(
(conv1): Conv2d(
2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv2): Conv2d(
512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
)
(conv3): Conv2d(
512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
)
)
)
)
(sem_seg_head): MaskFormerHead(
(pixel_decoder): MSDeformAttnPixelDecoder(
(input_proj): ModuleList(
(0): Sequential(
(0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(1): Sequential(
(0): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(2): Sequential(
(0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1))
(1): GroupNorm(32, 256, eps=1e-05, affine=True)
)
)
(transformer): MSDeformAttnTransformerEncoderOnly(
(encoder): MSDeformAttnTransformerEncoder(
(layers): ModuleList(
(0): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(2): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(3): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(4): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(5): MSDeformAttnTransformerEncoderLayer(
(self_attn): MSDeformAttn(
(sampling_offsets): Linear(in_features=256, out_features=192, bias=True)
(attention_weights): Linear(in_features=256, out_features=96, bias=True)
(value_proj): Linear(in_features=256, out_features=256, bias=True)
(output_proj): Linear(in_features=256, out_features=256, bias=True)
)
(dropout1): Dropout(p=0.0, inplace=False)
(norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(linear1): Linear(in_features=256, out_features=1024, bias=True)
(dropout2): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=1024, out_features=256, bias=True)
(dropout3): Dropout(p=0.0, inplace=False)
(norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
)
)
(pe_layer): Positional encoding PositionEmbeddingSine
num_pos_feats: 128
temperature: 10000
normalize: True
scale: 6.283185307179586
(mask_features): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(adapter_1): Conv2d(
256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
(norm): GroupNorm(32, 256, eps=1e-05, affine=True)
)
(layer_1): Conv2d(
256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
(norm): GroupNorm(32, 256, eps=1e-05, affine=True)
)
)
(predictor): VideoMultiScaleMaskedTransformerDecoder(
(pe_layer): PositionEmbeddingSine3D()
(transformer_self_attention_layers): ModuleList(
(0): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(1): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(2): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(3): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(4): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(5): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(6): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(7): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(8): SelfAttentionLayer(
(self_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(transformer_cross_attention_layers): ModuleList(
(0): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(1): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(2): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(3): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(4): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(5): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(6): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(7): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
(8): CrossAttentionLayer(
(multihead_attn): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True)
)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(dropout): Dropout(p=0.0, inplace=False)
)
)
(transformer_ffn_layers): ModuleList(
(0): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(1): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(2): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(3): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(4): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(5): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(6): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(7): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
(8): FFNLayer(
(linear1): Linear(in_features=256, out_features=2048, bias=True)
(dropout): Dropout(p=0.0, inplace=False)
(linear2): Linear(in_features=2048, out_features=256, bias=True)
(norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
)
)
(decoder_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
(query_feat): Embedding(100, 256)
(query_embed): Embedding(100, 256)
(level_embed): Embedding(3, 256)
(input_proj): ModuleList(
(0): Sequential()
(1): Sequential()
(2): Sequential()
)
(class_embed): Linear(in_features=256, out_features=41, bias=True)
(mask_embed): MLP(
(layers): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=256, bias=True)
(2): Linear(in_features=256, out_features=256, bias=True)
)
)
)
)
(criterion): Criterion VideoSetCriterion
matcher: Matcher VideoHungarianMatcher
cost_class: 2.0
cost_mask: 5.0
cost_dice: 5.0
losses: ['labels', 'masks']
weight_dict: {'loss_ce': 2.0, 'loss_mask': 5.0, 'loss_dice': 5.0, 'loss_ce_0': 2.0, 'loss_mask_0': 5.0, 'loss_dice_0': 5.0, 'loss_ce_1': 2.0, 'loss_mask_1': 5.0, 'loss_dice_1': 5.0, 'loss_ce_2': 2.0, 'loss_mask_2': 5.0, 'loss_dice_2': 5.0, 'loss_ce_3': 2.0, 'loss_mask_3': 5.0, 'loss_dice_3': 5.0, 'loss_ce_4': 2.0, 'loss_mask_4': 5.0, 'loss_dice_4': 5.0, 'loss_ce_5': 2.0, 'loss_mask_5': 5.0, 'loss_dice_5': 5.0, 'loss_ce_6': 2.0, 'loss_mask_6': 5.0, 'loss_dice_6': 5.0, 'loss_ce_7': 2.0, 'loss_mask_7': 5.0, 'loss_dice_7': 5.0, 'loss_ce_8': 2.0, 'loss_mask_8': 5.0, 'loss_dice_8': 5.0}
num_classes: 40
eos_coef: 0.1
num_points: 12544
oversample_ratio: 3.0
importance_sample_ratio: 0.75
)
[02/22 03:41:45 mask2former_video.data_video.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(360, 480), max_size=1333, sample_style='choice_by_clip', clip_frame_cnt=2), RandomFlip(clip_frame_cnt=2)]
[02/22 03:41:57 mask2former_video.data_video.datasets.ytvis]: Loading /data/bolu.ldz/DATASET/YoutubeVOS2019/train.json takes 12.59 seconds.
[02/22 03:41:57 mask2former_video.data_video.datasets.ytvis]: Loaded 2238 videos in YTVIS format from /data/bolu.ldz/DATASET/YoutubeVOS2019/train.json
[02/22 03:42:05 mask2former_video.data_video.build]: Using training sampler TrainingSampler
[02/22 03:42:19 d2.data.common]: Serializing 2238 elements to byte tensors and concatenating them all ...
[02/22 03:42:19 d2.data.common]: Serialized dataset takes 151.32 MiB
[02/22 03:42:20 fvcore.common.checkpoint]: [Checkpointer] Loading from /data/bolu.ldz/PRETRAINED_WEIGHTS/mask2former/model_final_3c8ec9.pkl ...
[02/22 03:42:22 fvcore.common.checkpoint]: Reading a file from 'MaskFormer Model Zoo'
WARNING [02/22 03:42:22 mask2former_video.modeling.transformer_decoder.video_mask2former_transformer_decoder]: Weight format of VideoMultiScaleMaskedTransformerDecoder have changed! Please upgrade your models. Applying automatic conversion now ...
WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Skip loading parameter 'sem_seg_head.predictor.class_embed.weight' to the model due to incompatible shapes: (81, 256) in the checkpoint but (41, 256) in the model! You might want to double check if this is expected.
WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Skip loading parameter 'sem_seg_head.predictor.class_embed.bias' to the model due to incompatible shapes: (81,) in the checkpoint but (41,) in the model! You might want to double check if this is expected.
WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Skip loading parameter 'criterion.empty_weight' to the model due to incompatible shapes: (81,) in the checkpoint but (41,) in the model! You might want to double check if this is expected.
WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint:
criterion.empty_weight
sem_seg_head.predictor.class_embed.{bias, weight}
[02/22 03:42:22 d2.engine.train_loop]: Starting training from iteration 0
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
run on: autodrive
DETECTRON2_DATASETS: /data/bolu.ldz/DATASET
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device
error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device

Question about the model training

Hi thank you for your excellent work. I meet a problem when re-run your experiments.

When I re-train the Instance segmentation model with R-50 on COCO dataset, the results are:
43.5, 23.0, 47.0, 65.1
43.2, 22.7, 46.4, 64.8
which are a bit lower with your reported number:
43.7, 23.4, 47.2, 64.8

I use the standard configuration file to run the experiments without any modification, and run on 4/8 V-100 cards. I do not know whether it's just a common scenario, or did you meet the same problem during training?

Does MaskFormer support multi-scale testing for VIS task?

Appreciate to your excellent work! I wonder whether you have tried some testing skills like multi-scale testing which may boost the final performance. Or does the implementation of Mask2Former in this repo support multi-scale testing on ytvis 2019/2021?

About ytvis_2021 dataset

Hi, I am following up on your work on video instance segmentation and trying to run experiments on the ytvis_2021 dataset. The original data downloaded from the link is organized as:

{train/valid/test}/
    JPEGImages/
    instance.json

How should I convert it to the structure you used here? I just copied the instance.json as train/valid/test.json, evaluation could run correctly, but there were some file-not-found errors during training. Looks like some videos listed in train/instance.json are not included in train/JPEGImages/. What should I do?

Thanks a lot!

A training problem about Global alloc not supported yet

I created a new running environment for mask2former according to the steps. When I train the COCO dataset, I can train normally, but when I train my dataset, I encounter the following problems.

2021-12-09 11-50-45 的屏幕截图

I've been looking for a solution on Google for a long time, so I'd like to ask if you have any similar problems. Thank you very much for your reply.

Training of Swin-L for video instance segmentation

Hi Bowen,
I am working on an 8*V100(32G) cluster.
When I use this config for training, it is still out of memory.

python scripts/train_net_video.py \
  --num-gpus 8 \
  --config-file configs/youtubevis_2021/swin/video_maskformer2_swin_large_IN21k_384_bs16_8ep.yaml
RuntimeError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 4; 31.75 GiB total capacity; 28.95 GiB already allocated; 11.75 MiB free; 30.10 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I will appreciate it if you can provide more information about GPUs for training.

CUDA out of memory

When I run demo_video/demo.py to infer my video, it shows "CUDA out of memory". I try to reduce the input size, but it doesn't work, can you tell me how to solve this problem. Thanks!

A question regarding pixel decoder with Fapn

What confused me a lot is the pixel decoder as you mentioned briefly in the paper in one of the experiments said “Swin-L-FaPN uses FaPN as pixel decoder”, I have the strong desire to know exactly how do you use the FaPN as the pixel decoder as FaPN itself is the complete model? If you incorporate the FaPN components——FeatureAlign and FeatureSelectionModule to the pixel decoder of Mask2Former in one of the experiments?

error in ms_deformable_im2col_cuda: invalid device function

Hi,

I successfully followed the installation instructions in INSTALL.md, namely:

conda create --name mask2former python=3.8 -y
conda activate mask2former
conda install pytorch==1.9.0 torchvision==0.10.0 cudatoolkit=11.1 -c pytorch -c nvidia
pip install -U opencv-python

# under your working directory
git clone [email protected]:facebookresearch/detectron2.git
cd detectron2
pip install -e .
pip install git+https://github.com/cocodataset/panopticapi.git
pip install git+https://github.com/mcordts/cityscapesScripts.git

cd ..
git clone [email protected]:facebookresearch/Mask2Former.git
cd Mask2Former
pip install -r requirements.txt
cd mask2former/modeling/pixel_decoder/ops
sh make.sh

However, when running the demo I get the following:

[02/23 09:54:12 detectron2]: Arguments: Namespace(confidence_threshold=0.5, config_file='../configs/coco/panoptic-segmentation/maskformer2_R50_bs16_50ep.yaml', input=['/home/weber/Pictures/man.png'], opts=['MODEL.WEIGHTS', '/media/weber/Ubuntu2/ubuntu2/Human_Pose/code-from-source/Mask2Former/model_final_94dc52.pkl'], output=None, video_input=None, webcam=False)
[02/23 09:54:14 fvcore.common.checkpoint]: [Checkpointer] Loading from /media/weber/Ubuntu2/ubuntu2/Human_Pose/code-from-source/Mask2Former/model_final_94dc52.pkl ...
[02/23 09:54:16 fvcore.common.checkpoint]: Reading a file from 'MaskFormer Model Zoo'
Weight format of MultiScaleMaskedTransformerDecoder have changed! Please upgrade your models. Applying automatic conversion now ...
/mnt/c7dd8318-a1d3-4622-a5fb-3fc2d8819579/CORSMAL/envs/detectron2/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448278899/work/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/mnt/c7dd8318-a1d3-4622-a5fb-3fc2d8819579/CORSMAL/envs/detectron2/lib/python3.8/site-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448278899/work/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
error in ms_deformable_im2col_cuda: invalid device function
[02/23 09:54:17 detectron2]: /home/weber/Pictures/man.png: detected 56 instances in 1.09s

From a web search, it seems that this error occurs when the wrong CUDA is installed. However, I correctly installed cudatoolkit 11.1 from the installation procedure above. What else could be the issue?

FYI - the demo runs fine if I run it on my CPU (using the MODEL.DEVICE cpu)

Mapillary instance annotation?

Hi,

Thanks for your wonderful repo. I follow the steps in preparing datasets, but it seems that datasets/prepare_mapillary_vistas_ins_seg.py is not provided. Could you pls check it out?

Question about Bounding Box

Thanks for your great work!

I added bounding box head to Mask2Former model like DETR.
(parallel with mask label)

But the performance is not good.
Do you think Mask2Former architecture is not good for detecting bounding box?
If you have any idea or intuition please tell me.

Thanks a lot!

Question about Bounding Box

Thanks for your great work!

For more general model, I think the model can infer bounding box too.
So, I have two questions about your work.

  1. Is there any reason why you didn't add module for infering bounding box?
  2. I think just adding box_embed module after decoder like class_embed enables infering bounding box. Have you tired it? or you already tried but not working?

Thanks a lot!

Train on custom dataset for panoptic segmentation

I‘ve been trying to used it for a nuclei panoptic segmentation task.
Dataset is prepared like ADE20K panoptic do.
However, in evalutaion, it doesn't proposed any instance after a period time of training.

File "/home/---/anaconda3/envs/mask2former/lib/python3.8/site-packages/panopticapi/evaluation.py", line 224, in pq_compute
results[name], per_class_results = pq_stat.pq_average(categories, isthing=isthing)
File "/home/---/anaconda3/envs/mask2former/lib/python3.8/site-packages/panopticapi/evaluation.py", line 73, in pq_average
return {'pq': pq / n, 'sq': sq / n, 'rq': rq / n, 'n': n}, per_class_results
ZeroDivisionError: division by zero

There several possible reasons accounting for it I assume:

  • Dataset not well prepared: Are semantic and instance images label folder a must for panoptic? The labeled data I owned is not Detectorn2 format. But I referred to prepare_ade20k_sem_seg, prepare_ade20k_ins_seg and prepare_ade20k_pan_seg. Converted the labeled data to panoptic images (in a folder) and label json file. Commented the line "sem_seg_file_name": sem_label_file, in dataset_dict.
  • Configure file not well modified: Another reason maybe model not convergen. Is there any configuration like Mask-RCNN's anchor size or ratio in panopitc segmentation? Because nuclei in whole slide images (crop multiple patches in size 256*256, with one nuclei around (8~16)*(8~16) pixels) is rather small compared to common things in a natural image captioned by camera.

Ade20k Panoptic Segmentation demo problem

Hi,

I have a problem trying to use the demo with Ade20k Panoptic Segmentation. The command used is:

python demo.py --config-file ../configs/ade20k/panoptic-segmentation/maskformer2_R50_bs16_160k.yaml \
  --video-input ... \
  --output ... \
  --opts MODEL.WEIGHTS ../models/model_final_5c90d4.pkl

And the stack trace is:

  File "demo.py", line 182, in <module>
    for vis_frame in tqdm.tqdm(demo.run_on_video(video), total=num_frames):
  File "/home/master/Develop/Mask2Former/lib/python3.8/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/home/master/Develop/Mask2Former/demo/predictor.py", line 130, in run_on_video
    yield process_predictions(frame, self.predictor(frame))
  File "/home/master/Develop/Mask2Former/demo/predictor.py", line 94, in process_predictions
    vis_frame = video_visualizer.draw_panoptic_seg_predictions(
  File "/home/master/Develop/Mask2Former/lib/python3.8/site-packages/detectron2/utils/video_visualizer.py", line 172, in draw_panoptic_seg_predictions
    labels = [self.metadata.thing_classes[k] for k in category_ids]
  File "/home/master/Develop/Mask2Former/lib/python3.8/site-packages/detectron2/utils/video_visualizer.py", line 172, in <listcomp>
    labels = [self.metadata.thing_classes[k] for k in category_ids]
IndexError: list index out of range

I think I have found the source of the problem in the lines

    thing_classes = [k["name"] for k in ADE20K_150_CATEGORIES if k["isthing"] == 1]
    thing_colors = [k["color"] for k in ADE20K_150_CATEGORIES if k["isthing"] == 1]

of mask2former/data/datasets/register_ade20k_panoptic.py

And from my understanding it happens because Detectron2 seems to use the id as the index, but these lines remove some items and the index to id mapping is lost.

Changing the lines to

    thing_classes = [k["name"] for k in ADE20K_150_CATEGORIES]
    thing_colors = [k["color"] for k in ADE20K_150_CATEGORIES]

seems to work, but I don't know if there are any undesired consequences.

I have installed Detectron2 with pip, but the line where the error happens appears to be also in the git version.

Implementation of mask2former for vis

Hi,

Thank you for sharing such a good work! I have a simple question about the implementation of the mask2former for vis. You mentioned that you use a T=2 during training in the report. Did you keep the same setting in the inference stage? It's an IoU tracker to keep the instance id consistent like Vip-deeplab?


got the answer, I didn't read report carefully

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.