idea-research / dn-detr Goto Github PK

View Code? Open in Web Editor NEW

531.0 16.0 59.0 713 KB

[CVPR 2022 Oral] Official implementation of DN-DETR

License: Apache License 2.0

Shell 0.27% Python 78.59% C++ 1.91% Cuda 19.23%

object-detection detr

dn-detr's Introduction

DN-DETR: Accelerate DETR Training by Introducing Query DeNoising

By Feng Li*, Hao Zhang*, Shilong Liu, Jian Guo, Lionel M.Ni, and Lei Zhang.

This repository is an official implementation of the DN-DETR. Accepted to CVPR 2022 (score 112, Oral presentation). Code is avaliable now. [CVPR paper link] [extended version paper link] [中文解读]

News

[2022/12]: We release an extended version of DN-DETR on arxiv, here is the paper link! We add denoising training to CNN-based model Faster R-CNN, segmentation model Mask2Former, and other DETR-like models like Anchor DETR and DETR, to improve the performance of these models.

[2022/12]: Code for Mask DINO is available! Mask DINO further Achieves 51.7 and 59.0 box AP on COCO with a ResNet-50 and SwinL without extra detection data, outperforming DINO under the same setting!

[2022/11]: DINO implementation based on DN-DETR is released in this repo. Credits to @Vallum! This optimized version under ResNet-50 can reach 50.8 ~ 51.0 AP in 36epochs.

[2022/9]: We release a toolbox detrex that provides many state-of-the-art Transformer-based detection algorithms. It includes DN-DETR with better performance. Welcome to use it!

[2022/7] Code for DINO is available here!

[2022/6]: We release a unified detection and segmentation model Mask DINO that achieves the best results on all the three segmentation tasks (54.5 AP on COCO instance leaderboard, 59.4 PQ on COCO panoptic leaderboard, and 60.8 mIoU on ADE20K semantic leaderboard)! Code will be available here.

[2022/5]Our code is available! Better performance 49.5AP on COCO achieved with ResNet-50.

[2022/4]Code is avaliable for DAB-DETR here.

[2022/3]We build a repo Awesome Detection Transformer to present papers about transformer for detection and segmentation. Welcome to your attention!

[2022/3]DN-DETR is selected for an Oral presentation in CVPR2022.

[2022/3]We release another work DINO:DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection that for the first time establishes a DETR-like model as a SOTA model on the leaderboard. Also based on DN. Code will be avaliable here.

Introduction

We present a novel denoising training method to speedup DETR training and offer a deepened understanding of the slow convergence issue of DETR-like methods.
DN is only a training method and be plugged into many DETR-like models or even traditional models to boost performance.
DN-DETR achieves AP 43.4 and 48.6 with 12 and 50 epochs of training with ResNet-50 backbone. Compared with the baseline models under the same setting, DN-DETR achieves comparable performance with 50% training epochs.
Our optmized models result in better performance. DN-Deformable-DETR achieves 49.5 with a ResNet-50 backbone.

Model

We build upon DAB-DETR and add a denoising part to accelerate training convergence. It only adds minimal computation and will be removed during inference time. We conduct extensive experiments to validate the effectiveness of our denoising training, for example, the convergnece curve comparison. You can refer to our paper for more experimental results.

Model Zoo

We provide our models under DAB-DETR, DAB-Deformable-DETR(deformable encoder only), and DAB-Deformable-DETR (See DAB-DETR code and paper for more details).

You can also refer to our

[model zoo in google drive]

[model zoo in 百度网盘]（提取码niet）.

50 epoch setting

	name	backbone	box AP	Log/Config/Checkpoint	Where in Our Paper
0	DN-DETR-R50	R50	44.4¹	Google Drive / BaiDu	Table 1
2	DN-DETR-R50-DC5	R50	46.3	Google Drive / BaiDu	Table 1
5	DN-DAB-Deformbale-DETR (Deformbale Encoder Only)³	R50	48.6	Google Drive / BaiDu	Table 3
6	DN-DAB-Deformable-DETR-R50-v2⁴	R50	49.5 (48.4 in 24 epochs)	Google Drive / BaiDu	Optimized implementation with deformable attention in both encoder and decoder. See DAB-DETR for more details.

12 epoch setting

	name	backbone	box AP	Log/Config/Checkpoint	Where in Our Paper
1	DN-DAB-DETR-R50-DC5(3 pat)²	R50	41.7	Google Drive / BaiDu	Table 2
4	DN-DAB-DETR-R101-DC5(3 pat)²	R101	42.8	Google Drive / BaiDu	Table 2
5	DN-DAB-Deformbale-DETR (Deformble Encoder Only)³	R50	43.4	Google Drive / BaiDu	Table 2
5	DN-DAB-Deformbale-DETR (Deformble Encoder Only)³	R101	44.1	Google Drive / BaiDu	Table 2

Notes:

¹: The result increases compared with the reported one in our paper (from 44.1to 44.4) since we optimized the code. We did not rerun other models, so you are expected to get better performance than reported ones in our paper.
²: The models with marks (3 pat) are trained with multiple pattern embeds (refer to Anchor DETR or DAB-DETR for more details.).
³: This model is based on DAB-Deformbale-DETR(Deformbale Encoder Only), which is a multiscale version of DAB-DETR. It requires 16 GPUs to train as it only use deformable attention in the encoder.
⁴: This model is based on DAB-Deformbale-DETR which is an optimized implementation with deformable DETR. See DAB-DETR for more details. You are encouraged to use this deformable version as it uses deformable attention in both encoder and deocder, which is more lightweight (i.e, train with 4/8 A100 GPUs) and converges faster (i.e, achieves 48.4 in 24 epochs, comparable to the 50-epoch DAB-Deformable-DETR).

Usage

How to use denoising training in your own model

Our code largely follows DAB-DETR and adds additional components for denoising training, which are warped in a file dn_components.py. There are mainly 3 functions including prepare_for_dn, dn_post_proces (the first two are used in your detection forward function to process the dn part), and compute_dn_loss(this one is used to calculate dn loss). You can import these functions and add them to your own detection model. You may also compare DN-DETR and DAB-DETR to see how these functions are added if you would like to use it in your own detection models.

You are also encouraged to apply it to some other DETR-like models or even traditional detection models and update results in this repo.

Installation

We use the DAB-DETR project as our codebase, hence no extra dependency is needed for our DN-DETR. For the DN-Deformable-DETR, you need to compile the deformable attention operator manually.

We test our models under python=3.7.3,pytorch=1.9.0,cuda=11.1. Other versions might be available as well.

Clone this repo

git clone https://github.com/IDEA-Research/DN-DETR.git
cd DN-DETR

Install Pytorch and torchvision

Follow the instruction on https://pytorch.org/get-started/locally/.

# an example:
conda install -c pytorch pytorch torchvision

Install other needed packages

pip install -r requirements.txt

Compiling CUDA operators

cd models/dn_dab_deformable_detr/ops
python setup.py build install
# unit test (should see all checking is True)
python test.py
cd ../../..

Data

Please download COCO 2017 dataset and organize them as following:

COCODIR/
  ├── train2017/
  ├── val2017/
  └── annotations/
  	├── instances_train2017.json
  	└── instances_val2017.json

Run

We use the standard DN-DETR-R50 and DN-Deformable-DETR-R50 as examples for training and evalulation.

Eval our pretrianed models

Download our DN-DETR-R50 model checkpoint from this link and perform the command below. You can expect to get the final AP about 44.4.

For our DN-DAB-Deformable-DETR_Deformable_Encoder_Only (download here). The final AP expected is 48.6.

For our DN-DAB-Deformable-DETR (download here), the final AP expected is 49.5.

# for dn_detr: 44.1 AP; optimized result is 44.4AP
python main.py -m dn_dab_detr \
  --output_dir logs/dn_DABDETR/R50 \
  --batch_size 1 \
  --coco_path /path/to/your/COCODIR \ # replace the args to your COCO path
  --resume /path/to/our/checkpoint \ # replace the args to your checkpoint path
  --use_dn \
  --eval

# for dn_deformable_detr: 49.5 AP
python main.py -m dn_deformable_detr \
  --output_dir logs/dab_deformable_detr/R50 \
  --batch_size 1 \
  --coco_path /path/to/your/COCODIR \ # replace the args to your COCO path
  --resume /path/to/our/checkpoint \ # replace the args to your checkpoint path
  --transformer_activation relu \
  --use_dn \
  --eval
  
# for dn_deformable_detr_deformable_encoder_only: 48.6 AP
python main.py -m dn_dab_deformable_detr_deformable_encoder_only 
  --output_dir logs/dab_deformable_detr/R50 \
  --batch_size 1 \
  --coco_path /path/to/your/COCODIR \ # replace the args to your COCO path
  --resume /path/to/our/checkpoint \ # replace the args to your checkpoint path
  --transformer_activation relu \
  --num_patterns 3 \  # use 3 pattern embeddings
  --use_dn  \
  --eval

Training your own models

Similarly, you can also train our model on a single process:

# for dn_detr
python main.py -m dn_dab_detr \
  --output_dir logs/dn_DABDETR/R50 \
  --batch_size 1 \
  --epochs 50 \
  --lr_drop 40 \
  --coco_path /path/to/your/COCODIR  # replace the args to your COCO path
  --use_dn

Distributed Run

However, as the training is time consuming, we suggest to train the model on multi-device.

If you plan to train the models on a cluster with Slurm, here is an example command for training:

# for dn_detr: 44.4 AP
python run_with_submitit.py \
  --timeout 3000 \
  --job_name DNDETR \
  --coco_path /path/to/your/COCODIR \
  -m dn_dab_detr \
  --job_dir logs/dn_DABDETR/R50_%j \
  --batch_size 2 \
  --ngpus 8 \
  --nodes 1 \
  --epochs 50 \
  --lr_drop 40 \
  --use_dn

# for dn_dab_deformable_detr: 49.5 AP
python run_with_submitit.py \
  --timeout 3000 \
  --job_name dn_dab_deformable_detr \
  --coco_path /path/to/your/COCODIR \
  -m dab_deformable_detr \
  --transformer_activation relu \
  --job_dir logs/dn_dab_deformable_detr/R50_%j \
  --batch_size 2 \
  --ngpus 8 \
  --nodes 1 \
  --epochs 50 \
  --lr_drop 40 \
  --use_dn

# for dn_dab_deformable_detr_deformable_encoder_only: 48.6 AP
python run_with_submitit.py \
  --timeout 3000 \
  --job_name dn_dab_deformable_detr_deformable_encoder_only \
  --coco_path /path/to/your/COCODIR \
  -m dn_dab_deformable_detr_deformable_encoder_only \
  --transformer_activation relu \
  --job_dir logs/dn_dab_deformable_detr/R50_%j \
  --num_patterns 3 \ 
  --batch_size 1 \
  --ngpus 8 \
  --nodes 2 \
  --epochs 50 \
  --lr_drop 40 \
  --use_dn

If you want to train our DC reversion or mulitple-patterns version, add

--dilation  # for DC version

--num_patterns 3  # for 3 patterns

However, this requires additional training resources and memory, i.e, use 16 GPUs.

The final AP should be similar or better to ours, as our optimized result is better than our reported performance in the paper( for example, we report 44.1 for DN-DETR, but our new result can achieve 44.4. Don't be surprised if you get better result! ).

Our training setting is same as DAB-DETR but add a argument --use_dn, you may also refer to DAB-DETR as well.

Notes:

The results are sensitive to the batch size. We use 16(2 images each GPU x 8 GPUs) by default.

Or run with multi-processes on a single node:

# for dn_dab_detr: 44.4 AP
python -m torch.distributed.launch --nproc_per_node=8 \
  main.py -m dn_dab_detr \
  --output_dir logs/dn_DABDETR/R50 \
  --batch_size 2 \
  --epochs 50 \
  --lr_drop 40 \
  --coco_path /path/to/your/COCODIR \
  --use_dn

# for dn_deformable_detr: 49.5 AP
python -m torch.distributed.launch --nproc_per_node=8 \
  main.py -m dn_dab_deformable_detr \
  --output_dir logs/dn_dab_deformable_detr/R50 \
  --batch_size 2 \
  --epochs 50 \
  --lr_drop 40 \
  --transformer_activation relu \
  --coco_path /path/to/your/COCODIR \
  --use_dn

LICNESE

DN-DETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Bibtex

If you find our work helpful for your research, please consider citing the following BibTeX entry.

@inproceedings{li2022dn,
  title={Dn-detr: Accelerate detr training by introducing query denoising},
  author={Li, Feng and Zhang, Hao and Liu, Shilong and Guo, Jian and Ni, Lionel M and Zhang, Lei},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={13619--13627},
  year={2022}
}

dn-detr's People

Contributors

Stargazers

Watchers

dn-detr's Issues

use pretrain model error

Hi！I'm in trouble
I train my own datase with pre-trained models
--pretrain_model_path checkpoint.pth
the result error:
Traceback (most recent call last):
File "C:/Users/20825/Desktop/detr_code/DN-DETR-main/main.py", line 426, in
main(args)
File "C:/Users/20825/Desktop/detr_code/DN-DETR-main/main.py", line 352, in main
train_stats = train_one_epoch(
File "C:\Users\20825\Desktop\detr_code\DN-DETR-main\engine.py", line 52, in train_one_epoch
outputs = model(samples)
File "C:\anaconda\envs\Detr\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\20825\Desktop\detr_code\DN-DETR-main\models\DN_DAB_DETR\DABDETR.py", line 176, in forward
prepare_for_dn(dn_args, embedweight, src.size(0), self.training, self.num_queries, self.num_classes,
File "C:\Users\20825\Desktop\detr_code\DN-DETR-main\models\DN_DAB_DETR\dn_components.py", line 61, in prepare_for_dn
targets, scalar, label_noise_scale, box_noise_scale, num_patterns = dn_args
TypeError: cannot unpack non-iterable NoneType object

Thank you for your answer

When will the open source code be released?

Trained checkpoints for DN-Detr

6	DN-DAB-Deformable-DETR-R50-v24	R50	49.5 (48.4 in 24 epochs)	Google Drive / BaiDu	Optimized implementation with deformable attention in both encoder and decoder. See DAB-DETR for more details.

I download checkpoint0049.pth from this url. It namely seems to be model trained in 50 epochs. But I got the results below when testing:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.484
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.665
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.526
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.301
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.517
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.639
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.361
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.590
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.626
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.436
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.673
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.780

Is this normal?

paper link?

How to trainmy own dataset？(not COCO)

If I use my own dataset, where do I need to change, the AP I get with my own dataset is 0.0?

Denoising part

First, thanks for your excellent work,but I hava a question.
The Figure 3b in paper. Does it contain denoising part? or omitted it in this figure.
I read the issue about your previous answer, class label embedding is noised label right?, but include noised box?

need zero_grad

https://github.com/IDEA-opensource/DN-DETR/blob/206fa267ba7df978fa968edda9f7dd351a4b72c1/engine.py#L84-L88

It looks like you forgot optimizer.zero_grad().

how to add DN to Anchor DETR with 2D Anchors

Thanks for this amazing work!
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising introduced that DN can be added to Anchor DETR, but I didn't find the relevant code in this project?
Looking forward to your reply!

关于“inference.py”

请问使用预测文件出现以下问题怎么解决呀？！
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)

Segmentation head

Hi,

Please, tell me how to run your code with segmentation head? --masks - doesn't work

请问下pad_size为什么是int(max(known_num))

您好，我把一些中间变量的shape打出来看了下，有个地方不太明白
我的理解是这样的，只讨论tgt部分，300维是可学习的编码，然后pad部分是存放添加了噪声的label

如此图，batchsize为2，两张图片的label数量分别为4和16，然后噪声label的tensor经过repeat scalar次后shape变为20×5=100
但是pad_size只设置为known_num的最大值的话，pad部分大小为16×5=80.
那这样的话新的tgt大小为380，但是噪声label是100，会占用掉非去噪部分的20

当然如果按您给的训练参数batch_size=1的话不会存在这个问题，但是batch_size为1有点慢，针对batchsize>1可否设置成pad_size=sum(known_num)呢，这里的改动会影响整个模型的性能吗。谢谢。

How to use --drop_lr_now

Thank you for your excellent job! I wonder how to use and when to use --drop_lr_now?
Thank you!

dilation convolution and two-stage strategy

Thanks for your great work,
In your pre-trained model DN-DAB-Deformable-DETR-R50-v24, you did not use dilation convolution and two-stage strategy, what I want to know is, if using these two strategies can further improve the performance?
Looking forward to your reply!

optimizer zero_grad before step

https://github.com/IDEA-opensource/DN-DETR/blob/206fa267ba7df978fa968edda9f7dd351a4b72c1/engine.py#L76

It looks like that you should call zero_grad after scaler.update()

Batch size effects

On my machine, I can only run a size 1 batch, how much will this degrade the results? I run with exactly the same parameters as yours the best one, except batch size, and the quality is much worse than MASK-RCNN

.

How to calculate flops

Hi! thanks for your excellent work, I'm wondering how to evaluate the flops of DN-DETR model?
I can't easily use the DETR script below because of the dn_components.
facebookresearch/detr#110
Could you pls share your python script?

Why call it DN-DETR rather than DN-DAB-DETR?

Glad to see the great work and excitedly awaiting the release of the code, but still got a concern:
Why call it DN-DETR rather than DN-DAB-DETR?
Since the main method of the paper was built upon DAB-DETR and the sota numbers were achieved based on the DAB-DETR, then I guess it should be called DN-DAB-DETR. The name DN-DETR sounds like the work was based on vanilla DETR without changing the original Transformer decoder. I saw that your experiments based on Deformable-DETR were called DN-Deformable-DETR, the name of DN-DETR becomes more misleading with respect to that.
And the paper said the code will be organized as a plugin-module that can be applied to any DETR-like models including vanilla DETR, so what should it be called when applying the denoising method to vanilla DETR? Also the result of that case seems didn't appear on the results table.
Many thanks

class_embed

self.class_embed = nn.Linear(hidden_dim, num_classes)

self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)

self.num_feature_levels = num_feature_levels

self.use_dab = use_dab

self.num_patterns = num_patterns

self.random_refpoints_xy = random_refpoints_xy

self.label_enc = nn.Embedding(num_classes + 1, hidden_dim - 1)

请问为什么这里class_embed使用的是91类，而label_enc使用的是92类？标准的DETR里class_embed里似乎是

nn.Linear(hidden_dim, num_classes+1)
Why class_embed use class 91 here and label_enc use class 92?

Hello! Are you ready to update your code?

The normalization of sine positional embedding

hi,
First, thanks for your perfect work.
When i learn your code, i found a little question about the sine positional embedding.I think - 0.5 here should be out of the brackets, and i dont know if there will be some influence on your temperature tuning experiment in DAB-DETR paper.

assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Traceback (most recent call last):
File "main.py", line 427, in
main(args)
File "main.py", line 355, in main
args.clip_max_norm, wo_class_error=wo_class_error, lr_scheduler=lr_scheduler, args=args, logger=(logger if args.save_log else None))
File "/home/cxq/dp_work/objectdetection/DN-DETR/engine.py", line 50, in train_one_epoch
loss_dict = criterion(outputs, targets, mask_dict)
File "/home/cxq/.conda/envs/torch10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cxq/dp_work/objectdetection/DN-DETR/models/DN_DAB_DETR/DABDETR.py", line 371, in forward
indices = self.matcher(outputs_without_aux, targets)
File "/home/cxq/.conda/envs/torch10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cxq/.conda/envs/torch10/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/cxq/dp_work/objectdetection/DN-DETR/models/DN_DAB_DETR/matcher.py", line 83, in forward
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
File "/home/cxq/dp_work/objectdetection/DN-DETR/util/box_ops.py", line 52, in generalized_box_iou
assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

I'm not sure why I reported this error during training, I've trained Epoch: [38] [12440/14785] eta: 0:17:40

TypeError: cannot unpack non-iterable NoneType object

D:\anaconda3.9\envs\zj\python.exe F:/1chen/DETR/jin/dn/DN-DETR/main.py
Not using distributed mode
[08/13 08:31:54.869]: git:
sha: a59a5de, status: has uncommited changes, branch: main

[08/13 08:31:54.869]: Command: F:/1chen/DETR/jin/dn/DN-DETR/main.py
[08/13 08:31:54.869]: Full config saved to log/r50\config.json
[08/13 08:31:54.869]: world size: 1
[08/13 08:31:54.869]: rank: 0
[08/13 08:31:54.869]: local_rank: 0
[08/13 08:31:54.870]: args: Namespace(amp=False, aux_loss=True, backbone='resnet50', backbone_freeze_keywords=None, batch_norm_type='FrozenBatchNorm2d', batch_size=2, bbox_loss_coef=5, box_noise_scale=0.4, clip_max_norm=0.1, cls_loss_coef=1, coco_panoptic_path=None, coco_path='COCODIR', dataset_file='coco', debug=False, dec_layers=6, dec_n_points=4, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, drop_lr_now=False, dropout=0.0, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=10, eval=False, find_unused_params=False, finetune_ignore=None, fix_size=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, label_noise_scale=0.2, local_rank=0, lr=0.0001, lr_backbone=1e-05, lr_drop=40, mask_loss_coef=1, masks=False, modelname='dn_dab_deformable_detr', nheads=8, note='', num_feature_levels=4, num_patterns=0, num_queries=300, num_select=300, num_workers=10, output_dir='log/r50', pe_temperatureH=20, pe_temperatureW=20, position_embedding='sine', pre_norm=False, pretrain_model_path=None, random_refpoints_xy=False, rank=0, remove_difficult=False, resume='', return_interm_layers=False, save_checkpoint_interval=10, save_log=False, save_results=False, scalar=5, seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, start_epoch=0, transformer_activation='prelu', two_stage=False, use_dn=False, weight_decay=0.0001, world_size=1)

Namespace(amp=False, aux_loss=True, backbone='resnet50', backbone_freeze_keywords=None, batch_norm_type='FrozenBatchNorm2d', batch_size=2, bbox_loss_coef=5, box_noise_scale=0.4, clip_max_norm=0.1, cls_loss_coef=1, coco_panoptic_path=None, coco_path='COCODIR', dataset_file='coco', debug=False, dec_layers=6, dec_n_points=4, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, drop_lr_now=False, dropout=0.0, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=10, eval=False, find_unused_params=False, finetune_ignore=None, fix_size=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, label_noise_scale=0.2, local_rank=0, lr=0.0001, lr_backbone=1e-05, lr_drop=40, mask_loss_coef=1, masks=False, modelname='dn_dab_deformable_detr', nheads=8, note='', num_feature_levels=4, num_patterns=0, num_queries=300, num_select=300, num_workers=10, output_dir='log/r50', pe_temperatureH=20, pe_temperatureW=20, position_embedding='sine', pre_norm=False, pretrain_model_path=None, random_refpoints_xy=False, rank=0, remove_difficult=False, resume='', return_interm_layers=False, save_checkpoint_interval=10, save_log=False, save_results=False, scalar=5, seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, start_epoch=0, transformer_activation='prelu', two_stage=False, use_dn=False, weight_decay=0.0001, world_size=1)
[08/13 08:31:55.431]: number of params:47206754
[08/13 08:31:55.433]: params:
{
"transformer.level_embed": 1024,
"transformer.encoder.layers.0.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.0.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.0.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.0.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.0.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.0.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.0.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.0.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.0.norm1.weight": 256,
"transformer.encoder.layers.0.norm1.bias": 256,
"transformer.encoder.layers.0.linear1.weight": 524288,
"transformer.encoder.layers.0.linear1.bias": 2048,
"transformer.encoder.layers.0.linear2.weight": 524288,
"transformer.encoder.layers.0.linear2.bias": 256,
"transformer.encoder.layers.0.norm2.weight": 256,
"transformer.encoder.layers.0.norm2.bias": 256,
"transformer.encoder.layers.1.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.1.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.1.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.1.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.1.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.1.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.1.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.1.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.1.norm1.weight": 256,
"transformer.encoder.layers.1.norm1.bias": 256,
"transformer.encoder.layers.1.linear1.weight": 524288,
"transformer.encoder.layers.1.linear1.bias": 2048,
"transformer.encoder.layers.1.linear2.weight": 524288,
"transformer.encoder.layers.1.linear2.bias": 256,
"transformer.encoder.layers.1.norm2.weight": 256,
"transformer.encoder.layers.1.norm2.bias": 256,
"transformer.encoder.layers.2.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.2.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.2.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.2.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.2.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.2.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.2.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.2.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.2.norm1.weight": 256,
"transformer.encoder.layers.2.norm1.bias": 256,
"transformer.encoder.layers.2.linear1.weight": 524288,
"transformer.encoder.layers.2.linear1.bias": 2048,
"transformer.encoder.layers.2.linear2.weight": 524288,
"transformer.encoder.layers.2.linear2.bias": 256,
"transformer.encoder.layers.2.norm2.weight": 256,
"transformer.encoder.layers.2.norm2.bias": 256,
"transformer.encoder.layers.3.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.3.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.3.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.3.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.3.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.3.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.3.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.3.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.3.norm1.weight": 256,
"transformer.encoder.layers.3.norm1.bias": 256,
"transformer.encoder.layers.3.linear1.weight": 524288,
"transformer.encoder.layers.3.linear1.bias": 2048,
"transformer.encoder.layers.3.linear2.weight": 524288,
"transformer.encoder.layers.3.linear2.bias": 256,
"transformer.encoder.layers.3.norm2.weight": 256,
"transformer.encoder.layers.3.norm2.bias": 256,
"transformer.encoder.layers.4.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.4.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.4.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.4.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.4.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.4.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.4.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.4.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.4.norm1.weight": 256,
"transformer.encoder.layers.4.norm1.bias": 256,
"transformer.encoder.layers.4.linear1.weight": 524288,
"transformer.encoder.layers.4.linear1.bias": 2048,
"transformer.encoder.layers.4.linear2.weight": 524288,
"transformer.encoder.layers.4.linear2.bias": 256,
"transformer.encoder.layers.4.norm2.weight": 256,
"transformer.encoder.layers.4.norm2.bias": 256,
"transformer.encoder.layers.5.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.5.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.5.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.5.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.5.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.5.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.5.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.5.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.5.norm1.weight": 256,
"transformer.encoder.layers.5.norm1.bias": 256,
"transformer.encoder.layers.5.linear1.weight": 524288,
"transformer.encoder.layers.5.linear1.bias": 2048,
"transformer.encoder.layers.5.linear2.weight": 524288,
"transformer.encoder.layers.5.linear2.bias": 256,
"transformer.encoder.layers.5.norm2.weight": 256,
"transformer.encoder.layers.5.norm2.bias": 256,
"transformer.decoder.layers.0.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.0.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.0.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.0.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.0.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.0.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.0.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.0.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.0.norm1.weight": 256,
"transformer.decoder.layers.0.norm1.bias": 256,
"transformer.decoder.layers.0.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.0.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.0.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.0.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.0.norm2.weight": 256,
"transformer.decoder.layers.0.norm2.bias": 256,
"transformer.decoder.layers.0.linear1.weight": 524288,
"transformer.decoder.layers.0.linear1.bias": 2048,
"transformer.decoder.layers.0.linear2.weight": 524288,
"transformer.decoder.layers.0.linear2.bias": 256,
"transformer.decoder.layers.0.norm3.weight": 256,
"transformer.decoder.layers.0.norm3.bias": 256,
"transformer.decoder.layers.1.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.1.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.1.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.1.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.1.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.1.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.1.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.1.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.1.norm1.weight": 256,
"transformer.decoder.layers.1.norm1.bias": 256,
"transformer.decoder.layers.1.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.1.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.1.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.1.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.1.norm2.weight": 256,
"transformer.decoder.layers.1.norm2.bias": 256,
"transformer.decoder.layers.1.linear1.weight": 524288,
"transformer.decoder.layers.1.linear1.bias": 2048,
"transformer.decoder.layers.1.linear2.weight": 524288,
"transformer.decoder.layers.1.linear2.bias": 256,
"transformer.decoder.layers.1.norm3.weight": 256,
"transformer.decoder.layers.1.norm3.bias": 256,
"transformer.decoder.layers.2.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.2.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.2.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.2.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.2.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.2.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.2.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.2.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.2.norm1.weight": 256,
"transformer.decoder.layers.2.norm1.bias": 256,
"transformer.decoder.layers.2.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.2.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.2.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.2.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.2.norm2.weight": 256,
"transformer.decoder.layers.2.norm2.bias": 256,
"transformer.decoder.layers.2.linear1.weight": 524288,
"transformer.decoder.layers.2.linear1.bias": 2048,
"transformer.decoder.layers.2.linear2.weight": 524288,
"transformer.decoder.layers.2.linear2.bias": 256,
"transformer.decoder.layers.2.norm3.weight": 256,
"transformer.decoder.layers.2.norm3.bias": 256,
"transformer.decoder.layers.3.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.3.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.3.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.3.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.3.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.3.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.3.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.3.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.3.norm1.weight": 256,
"transformer.decoder.layers.3.norm1.bias": 256,
"transformer.decoder.layers.3.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.3.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.3.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.3.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.3.norm2.weight": 256,
"transformer.decoder.layers.3.norm2.bias": 256,
"transformer.decoder.layers.3.linear1.weight": 524288,
"transformer.decoder.layers.3.linear1.bias": 2048,
"transformer.decoder.layers.3.linear2.weight": 524288,
"transformer.decoder.layers.3.linear2.bias": 256,
"transformer.decoder.layers.3.norm3.weight": 256,
"transformer.decoder.layers.3.norm3.bias": 256,
"transformer.decoder.layers.4.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.4.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.4.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.4.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.4.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.4.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.4.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.4.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.4.norm1.weight": 256,
"transformer.decoder.layers.4.norm1.bias": 256,
"transformer.decoder.layers.4.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.4.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.4.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.4.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.4.norm2.weight": 256,
"transformer.decoder.layers.4.norm2.bias": 256,
"transformer.decoder.layers.4.linear1.weight": 524288,
"transformer.decoder.layers.4.linear1.bias": 2048,
"transformer.decoder.layers.4.linear2.weight": 524288,
"transformer.decoder.layers.4.linear2.bias": 256,
"transformer.decoder.layers.4.norm3.weight": 256,
"transformer.decoder.layers.4.norm3.bias": 256,
"transformer.decoder.layers.5.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.5.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.5.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.5.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.5.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.5.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.5.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.5.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.5.norm1.weight": 256,
"transformer.decoder.layers.5.norm1.bias": 256,
"transformer.decoder.layers.5.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.5.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.5.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.5.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.5.norm2.weight": 256,
"transformer.decoder.layers.5.norm2.bias": 256,
"transformer.decoder.layers.5.linear1.weight": 524288,
"transformer.decoder.layers.5.linear1.bias": 2048,
"transformer.decoder.layers.5.linear2.weight": 524288,
"transformer.decoder.layers.5.linear2.bias": 256,
"transformer.decoder.layers.5.norm3.weight": 256,
"transformer.decoder.layers.5.norm3.bias": 256,
"transformer.decoder.query_scale.layers.0.weight": 65536,
"transformer.decoder.query_scale.layers.0.bias": 256,
"transformer.decoder.query_scale.layers.1.weight": 65536,
"transformer.decoder.query_scale.layers.1.bias": 256,
"transformer.decoder.ref_point_head.layers.0.weight": 131072,
"transformer.decoder.ref_point_head.layers.0.bias": 256,
"transformer.decoder.ref_point_head.layers.1.weight": 65536,
"transformer.decoder.ref_point_head.layers.1.bias": 256,
"transformer.decoder.bbox_embed.0.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.0.layers.0.bias": 256,
"transformer.decoder.bbox_embed.0.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.0.layers.1.bias": 256,
"transformer.decoder.bbox_embed.0.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.0.layers.2.bias": 4,
"transformer.decoder.bbox_embed.1.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.1.layers.0.bias": 256,
"transformer.decoder.bbox_embed.1.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.1.layers.1.bias": 256,
"transformer.decoder.bbox_embed.1.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.1.layers.2.bias": 4,
"transformer.decoder.bbox_embed.2.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.2.layers.0.bias": 256,
"transformer.decoder.bbox_embed.2.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.2.layers.1.bias": 256,
"transformer.decoder.bbox_embed.2.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.2.layers.2.bias": 4,
"transformer.decoder.bbox_embed.3.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.3.layers.0.bias": 256,
"transformer.decoder.bbox_embed.3.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.3.layers.1.bias": 256,
"transformer.decoder.bbox_embed.3.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.3.layers.2.bias": 4,
"transformer.decoder.bbox_embed.4.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.4.layers.0.bias": 256,
"transformer.decoder.bbox_embed.4.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.4.layers.1.bias": 256,
"transformer.decoder.bbox_embed.4.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.4.layers.2.bias": 4,
"transformer.decoder.bbox_embed.5.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.5.layers.0.bias": 256,
"transformer.decoder.bbox_embed.5.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.5.layers.1.bias": 256,
"transformer.decoder.bbox_embed.5.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.5.layers.2.bias": 4,
"class_embed.0.weight": 23296,
"class_embed.0.bias": 91,
"class_embed.1.weight": 23296,
"class_embed.1.bias": 91,
"class_embed.2.weight": 23296,
"class_embed.2.bias": 91,
"class_embed.3.weight": 23296,
"class_embed.3.bias": 91,
"class_embed.4.weight": 23296,
"class_embed.4.bias": 91,
"class_embed.5.weight": 23296,
"class_embed.5.bias": 91,
"label_enc.weight": 23460,
"tgt_embed.weight": 76500,
"refpoint_embed.weight": 1200,
"input_proj.0.0.weight": 131072,
"input_proj.0.0.bias": 256,
"input_proj.0.1.weight": 256,
"input_proj.0.1.bias": 256,
"input_proj.1.0.weight": 262144,
"input_proj.1.0.bias": 256,
"input_proj.1.1.weight": 256,
"input_proj.1.1.bias": 256,
"input_proj.2.0.weight": 524288,
"input_proj.2.0.bias": 256,
"input_proj.2.1.weight": 256,
"input_proj.2.1.bias": 256,
"input_proj.3.0.weight": 4718592,
"input_proj.3.0.bias": 256,
"input_proj.3.1.weight": 256,
"input_proj.3.1.bias": 256,
"backbone.0.body.layer2.0.conv1.weight": 32768,
"backbone.0.body.layer2.0.conv2.weight": 147456,
"backbone.0.body.layer2.0.conv3.weight": 65536,
"backbone.0.body.layer2.0.downsample.0.weight": 131072,
"backbone.0.body.layer2.1.conv1.weight": 65536,
"backbone.0.body.layer2.1.conv2.weight": 147456,
"backbone.0.body.layer2.1.conv3.weight": 65536,
"backbone.0.body.layer2.2.conv1.weight": 65536,
"backbone.0.body.layer2.2.conv2.weight": 147456,
"backbone.0.body.layer2.2.conv3.weight": 65536,
"backbone.0.body.layer2.3.conv1.weight": 65536,
"backbone.0.body.layer2.3.conv2.weight": 147456,
"backbone.0.body.layer2.3.conv3.weight": 65536,
"backbone.0.body.layer3.0.conv1.weight": 131072,
"backbone.0.body.layer3.0.conv2.weight": 589824,
"backbone.0.body.layer3.0.conv3.weight": 262144,
"backbone.0.body.layer3.0.downsample.0.weight": 524288,
"backbone.0.body.layer3.1.conv1.weight": 262144,
"backbone.0.body.layer3.1.conv2.weight": 589824,
"backbone.0.body.layer3.1.conv3.weight": 262144,
"backbone.0.body.layer3.2.conv1.weight": 262144,
"backbone.0.body.layer3.2.conv2.weight": 589824,
"backbone.0.body.layer3.2.conv3.weight": 262144,
"backbone.0.body.layer3.3.conv1.weight": 262144,
"backbone.0.body.layer3.3.conv2.weight": 589824,
"backbone.0.body.layer3.3.conv3.weight": 262144,
"backbone.0.body.layer3.4.conv1.weight": 262144,
"backbone.0.body.layer3.4.conv2.weight": 589824,
"backbone.0.body.layer3.4.conv3.weight": 262144,
"backbone.0.body.layer3.5.conv1.weight": 262144,
"backbone.0.body.layer3.5.conv2.weight": 589824,
"backbone.0.body.layer3.5.conv3.weight": 262144,
"backbone.0.body.layer4.0.conv1.weight": 524288,
"backbone.0.body.layer4.0.conv2.weight": 2359296,
"backbone.0.body.layer4.0.conv3.weight": 1048576,
"backbone.0.body.layer4.0.downsample.0.weight": 2097152,
"backbone.0.body.layer4.1.conv1.weight": 1048576,
"backbone.0.body.layer4.1.conv2.weight": 2359296,
"backbone.0.body.layer4.1.conv3.weight": 1048576,
"backbone.0.body.layer4.2.conv1.weight": 1048576,
"backbone.0.body.layer4.2.conv2.weight": 2359296,
"backbone.0.body.layer4.2.conv3.weight": 1048576
}
loading annotations into memory...
Done (t=0.03s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Start training
F:\1chen\DETR\jin\dn\DN-DETR\models\dn_dab_deformable_detr\position_encoding.py:53: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)
Traceback (most recent call last):
File "F:/1chen/DETR/jin/dn/DN-DETR/main.py", line 426, in
main(args)
File "F:/1chen/DETR/jin/dn/DN-DETR/main.py", line 352, in main
train_stats = train_one_epoch(
File "F:\1chen\DETR\jin\dn\DN-DETR\engine.py", line 52, in train_one_epoch
outputs = model(samples)
File "D:\anaconda3.9\envs\zj\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "F:\1chen\DETR\jin\dn\DN-DETR\models\dn_dab_deformable_detr\dab_deformable_detr.py", line 206, in forward
prepare_for_dn(dn_args, tgt_all_embed, refanchor, src.size(0), self.training, self.num_queries, self.num_classes,
File "F:\1chen\DETR\jin\dn\DN-DETR\models\dn_dab_deformable_detr\dn_components.py", line 61, in prepare_for_dn
targets, scalar, label_noise_scale, box_noise_scale, num_patterns = dn_args
TypeError: cannot unpack non-iterable NoneType object

Process finished with exit code 1

How to measure the FPS and GFLOPS?

Mismatching shape of tgt_embed and pat_embed?

In the forward part of dab_deformable_detr.py

if self.two_stage:
            assert NotImplementedError
        elif self.use_dab:
            if self.num_patterns == 0:
                tgt_all_embed = tgt_embed = self.tgt_embed.weight           # nq, 256
                refanchor = self.refpoint_embed.weight      # nq, 4
                # query_embeds = torch.cat((tgt_embed, refanchor), dim=1)
            else:
                # multi patterns
                tgt_embed = self.tgt_embed.weight           # nq, 256
                pat_embed = self.patterns_embed.weight      # num_pat, 256
                tgt_embed = tgt_embed.repeat(self.num_patterns, 1) # nq*num_pat, 256
                pat_embed = pat_embed[:, None, :].repeat(1, self.num_queries, 1).flatten(0, 1) # nq*num_pat, 256
                tgt_all_embed = tgt_embed + pat_embed
                refanchor = self.refpoint_embed.weight.repeat(self.num_patterns, 1)      # nq*num_pat, 4
                # query_embeds = torch.cat((tgt_all_embed, refanchor), dim=1)
        else:
            assert NotImplementedError

Isn't tgt_embed with the shape nq, hidden_dim - 1 ? How could you add tgt_embed with pat_embed ?

AP = 0

I trained 12 epoches with dn_dab_detr and coco2017, but the result of AP=0. Could someone tell me where the problem is?????
Here is the parameters and result:
detr --coco_path ../datasets/coco2017 --use_dn --amp --dilation
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.003
Training time 6 days, 0:10:43
Now time: 2022-12-08 12:12:42.811224

Parameters from the article

Hi,

First of all, thanks for uploading your code.

Please tell me with what parameters you need to run your code in order to repeat the results from the article.

About the loss

Where is the reconstruction loss? I can not find the loss in DABDETR.py? Thanks

About Introducing DN to Faster R-CNN

Excuse me, the paper (https://arxiv.org/abs/2203.01305) shows the DN training can also be used in Faster R-CNN, I wonder how to implement it, Thanks!

Object Detection and Inference Image

Hi there,

Amazing Job!! Thanks for your guys~

I am wondering if you use this model for the object detection detection. Could you release your inference code? Thanks for your help!

I think noise index is wrong

https://github.com/IDEA-opensource/DN-DETR/blob/f41c276fe0af61a8acfbd32dfdde5d00291b3cf9/models/dn_dab_deformable_detr/dn_components.py#L103-L108

I think L105 should be

diff[:, :2] = known_bbox_expand[:, :2] / 2

Am I correct?

RuntimeError: "ms_deform_attn_forward_cuda" not implemented for 'Half'

When I try to use mixed precision training, the program reports an error:
Traceback (most recent call last):
File "main.py", line 414, in
main(args)
File "main.py", line 335, in main
args.clip_max_norm, wo_class_error=wo_class_error, lr_scheduler=lr_scheduler, args=args, logger=(logger if args.save_log else None))
File "/home/lyz/DN-DETR/engine.py", line 48, in train_one_epoch
outputs, mask_dict = model(samples, dn_args=(targets, args.scalar, args.label_noise_scale, args.box_noise_scale, args.num_patterns))
File "/home/lyz/anaconda3/envs/pytorch-1.8.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/dab_deformable_detr.py", line 225, in forward
hs, init_reference, inter_references, enc_outputs_class, enc_outputs_coord_unact = self.transformer(srcs, masks, pos, query_embeds, attn_mask)
File "/home/lyz/anaconda3/envs/pytorch-1.8.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/deformable_transformer.py", line 173, in forward
memory = self.encoder(src_flatten, spatial_shapes, level_start_index, valid_ratios, lvl_pos_embed_flatten, mask_flatten)
File "/home/lyz/anaconda3/envs/pytorch-1.8.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/deformable_transformer.py", line 281, in forward
output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)
File "/home/lyz/anaconda3/envs/pytorch-1.8.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/deformable_transformer.py", line 232, in forward
src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points, src, spatial_shapes, level_start_index, padding_mask)
File "/home/lyz/anaconda3/envs/pytorch-1.8.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/ops/modules/ms_deform_attn.py", line 113, in forward
value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/ops/functions/ms_deform_attn_func.py", line 26, in forward
value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, ctx.im2col_step)
RuntimeError: "ms_deform_attn_forward_cuda" not implemented for 'Half'
may I ask why?

How do you solve the problem that the number of noising queries can't be same in a batch

The number of Ground Truth is different in different images, so the number of noising queries can't be same for the images in a same batch, how do you solve this problem?

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!

I get the following error using the inference.py code: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!
When I debug, I get the following result:input = {devic:'cuda:0'},weight = device : cpu
I look forward to answering my questions at your convenience. Thank you very much!

Inverse sigmoid

Hi, I have a question that what the role of “Inverse Sigmoid” is in your code? I mention that Inverse sigmoid is used in many places in your code

about the train log

大佬！你好，请问有 dn-detr 训练的 log 吗

How to append the indicator to the label embedding?

Thanks for your excellent work! I have two questions about the label embedding:

For any query, it has its own one hot vector which has 81 dimensionalities (80 classes in COCO dataset and 1 for unknown class)?
Than we can embed the one-hot vector to get a label embedding by an MLP?
The indicator which is used to differentiate between a denoising part query and the matching part query is 1 or zero,
How to append the indicator to the label embedding? Just concatenate the scalar to the end of label embedding?

num_classes

https://github.com/IDEA-opensource/DN-DETR/blob/206fa267ba7df978fa968edda9f7dd351a4b72c1/models/DN_DAB_DETR/DABDETR.py#L484

Hi thanks for wonderful repo.
Is there any reason the default num_cls = 20 for other dataset rather than coco?

About plot_logs

Hello, thanks for your wonderful work!

When I finish training and get log.txt, I want to visualize it using plot_logs, as follows：

But I get an ERROR on this line：
https://github.com/IDEA-opensource/DN-DETR/blob/f41c276fe0af61a8acfbd32dfdde5d00291b3cf9/util/plot_utils.py#L65

Traceback (most recent call last):
  File "H:/yjs/code/DN-DETR-main/tmp.py", line 53, in <module>
    fig, axs = plot_logs(log_path)
  File "H:\yjs\code\DN-DETR-main\util\plot_utils.py", line 65, in plot_logs
    df.interpolate().ewm(com=ewm_col).mean().plot(
  File "H:\Anaconda\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "H:\Anaconda\lib\site-packages\pandas\core\frame.py", line 10712, in interpolate
    return super().interpolate(
  File "H:\Anaconda\lib\site-packages\pandas\core\generic.py", line 6899, in interpolate
    new_data = obj._mgr.interpolate(
  File "H:\Anaconda\lib\site-packages\pandas\core\internals\managers.py", line 377, in interpolate
    return self.apply("interpolate", **kwargs)
  File "H:\Anaconda\lib\site-packages\pandas\core\internals\managers.py", line 327, in apply
    applied = getattr(b, f)(**kwargs)
  File "H:\Anaconda\lib\site-packages\pandas\core\internals\blocks.py", line 1369, in interpolate
    new_values = values.fillna(value=fill_value, method=method, limit=limit)
  File "H:\Anaconda\lib\site-packages\pandas\core\arrays\_mixins.py", line 218, in fillna
    value, method = validate_fillna_kwargs(
  File "H:\Anaconda\lib\site-packages\pandas\util\_validators.py", line 372, in validate_fillna_kwargs
    method = clean_fill_method(method)
  File "H:\Anaconda\lib\site-packages\pandas\core\missing.py", line 120, in clean_fill_method
    raise ValueError(f"Invalid fill method. Expecting {expecting}. Got {method}")
ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear

My version of pandas is 1.3.5

I don't know if I am using it in a wrong way or it is a bug in pandas, how can I fix it?

The cup load is high when training

Hello, thanks for your wonderful work!
I modified a little code to train on my own dataset (the same format as coco) but the cpu load seems a little high (over 50%). But when using dab-detr to train the same dataset, the cpu load is very low.
Is it normal or dose it need some improvement?

Some details about implementation

Does DN-DETR only add class label embedding to the content queries (tgt in the code) in cross-attention module of first decoder layer as Conditional DETR or DAB-DETR does?

How to do visualization?

Amazing work!! I want to konw if there is possible to release model weights on BaiDu or Tsinghua Cloud?

How Known Labels Detection implemented？

Thanks for your excellent work.
Could you give more details how Known Labels Detection implemented？How do you let the decoder output all boxes of specific class c only using the label embedding of class c?

Why is the attention_mask only used in self-attention, not used in cross-attention?

Hi,
Great work! I'm confused about why the attention mask is only used for self-attention? 如果去掉self-attention模块，只保留cross-attention模块，会不会造成noised boxes之间的信息泄露呢？
https://github.com/IDEA-opensource/DN-DETR/blob/a59a5de5bf784f196e15bffed3145d05d5a9126a/models/DN_DAB_DETR/transformer.py#L125

How class embedding is implemented？

Thanks for your excellent work.
Could you give more details how decoder embedding is specified as class label embedding？

how to add DN to Vallina-DETR like model

Thanks for this amazing work! I have some question about adding DN to a Vallina-DETR like model.

Could you explain more about how can I use DN for a Vallina-DETR like algorithm?
Because the Vallina-DETR's object quries are not anchor-like, thus I don't know how to change an denoised gt(dim=a) to a obj query(dim=b && b != a)?
Will a learnable nn.linear or other oporator work?

Looking forward to your reply!

util/misc.py collate_fn函数

def collate_fn(batch): # import ipdb; ipdb.set_trace() batch = list(zip(*batch)) batch[0] = nested_tensor_from_tensor_list(batch[0]) return tuple(batch)
这里看到在nested_tensor_from_tensor_list中对batch[0]，也就是训练图片的每个batch都做了像最大size进行padding的操作将一个batch的图片size保持一致，但是这里不需要对box进行修正吗？感觉box的cx cy w h还是用的修正前的图像坐标使用的？

Adding the DINO component to DN-DETR

Hi, authors,

Thank you for opening your fantastic project.

I was very impressed on your successive project DN-DETR and DINO,

so I have merged DINO component to this precedent Deformable DETR based DN-DETR, which is a little bit different from official-DINO.

Do you authors, by any chance, interested in to merge DINO into this DN-DETR?

If so, please let me know and prepare the code sharing.
Because you already have your own official DINO repo, maybe you don't want to mix DN-DETR with another DINO code,
That's ok, and in that case, I am considering to take another way to open my implementation

Thanks.

Sizes of tensors must match except in dimension 1.

Traceback (most recent call last):
File "main.py", line 428, in
main(args)
File "main.py", line 388, in main
wo_class_error=wo_class_error, args=args, logger=(logger if args.save_log else None)
File "/home/cxq/.conda/envs/torch10/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/cxq/dp_work/objectdetection/DN-DETR/engine.py", line 221, in evaluate
res_info = torch.cat((_res_bbox, _res_prob.unsqueeze(-1), _res_label.unsqueeze(-1)), 1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 900 but got size 300 for tensor number 1 in the list

This problem occurs when the code is tested after it has been trained

An error occurs when using the --save_results argument

details about the implementation

Hi, thanks for bringing new insights to the DETR series. DN-DETR is really an excellent work that can get such high performance with only 12 epochs.

After reading the paper, I have several questions about the detailed implementation of DN-DETR.

about the class embedding. According to the description of the class embedding in the paper and the discussion in the issue #3, the class embedding can be achieved by two different ways: (1) use a pre-trained language model to generate the embedding for the word of classes (classes of COCO with an unknown class: [person], [bicycle], [car], ..., [toothbrush], [unknown]); (2) use one-hot vector to represent different classes, then use Linear layer or MLP to project the one-hot vector to the latent space. Could you give more details about the implementation?
about the learning rate. I notice that DN-DETR uses an initial learning rate of 1e-5 with a batch size of 16 (Sec 5.1), which is different from the one in DAB-DETR(lr: 1e-4, lr_backbone: 1e-5 with a batch size of 16). Is it a typo or intended? If the learning rate is adjusted in DN-DETR, could you kindly report the gains of adjusting the learning rate?

Looking forward to a reply. Thanks in advance!