idea-research / dn-detr Goto Github PK
View Code? Open in Web Editor NEW[CVPR 2022 Oral] Official implementation of DN-DETR
License: Apache License 2.0
[CVPR 2022 Oral] Official implementation of DN-DETR
License: Apache License 2.0
Traceback (most recent call last):
File "main.py", line 428, in
main(args)
File "main.py", line 388, in main
wo_class_error=wo_class_error, args=args, logger=(logger if args.save_log else None)
File "/home/cxq/.conda/envs/torch10/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/cxq/dp_work/objectdetection/DN-DETR/engine.py", line 221, in evaluate
res_info = torch.cat((_res_bbox, _res_prob.unsqueeze(-1), _res_label.unsqueeze(-1)), 1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 900 but got size 300 for tensor number 1 in the list
This problem occurs when the code is tested after it has been trained
An error occurs when using the --save_results argument
Hi,
Great work! I'm confused about why the attention mask is only used for self-attention? 如果去掉self-attention模块,只保留cross-attention模块,会不会造成noised boxes之间的信息泄露呢?
https://github.com/IDEA-opensource/DN-DETR/blob/a59a5de5bf784f196e15bffed3145d05d5a9126a/models/DN_DAB_DETR/transformer.py#L125
Hi there,
Amazing Job!! Thanks for your guys~
I am wondering if you use this model for the object detection detection. Could you release your inference code? Thanks for your help!
Hi thanks for wonderful repo.
Is there any reason the default num_cls = 20 for other dataset rather than coco?
Hi, thanks for bringing new insights to the DETR series. DN-DETR is really an excellent work that can get such high performance with only 12 epochs.
After reading the paper, I have several questions about the detailed implementation of DN-DETR.
Looking forward to a reply. Thanks in advance!
Thanks for your excellent work.
Could you give more details how decoder embedding is specified as class label embedding?
Thanks for your great work,
In your pre-trained model DN-DAB-Deformable-DETR-R50-v24, you did not use dilation convolution and two-stage strategy, what I want to know is, if using these two strategies can further improve the performance?
Looking forward to your reply!
When I try to use mixed precision training, the program reports an error:
Traceback (most recent call last):
File "main.py", line 414, in
main(args)
File "main.py", line 335, in main
args.clip_max_norm, wo_class_error=wo_class_error, lr_scheduler=lr_scheduler, args=args, logger=(logger if args.save_log else None))
File "/home/lyz/DN-DETR/engine.py", line 48, in train_one_epoch
outputs, mask_dict = model(samples, dn_args=(targets, args.scalar, args.label_noise_scale, args.box_noise_scale, args.num_patterns))
File "/home/lyz/anaconda3/envs/pytorch-1.8.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/dab_deformable_detr.py", line 225, in forward
hs, init_reference, inter_references, enc_outputs_class, enc_outputs_coord_unact = self.transformer(srcs, masks, pos, query_embeds, attn_mask)
File "/home/lyz/anaconda3/envs/pytorch-1.8.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/deformable_transformer.py", line 173, in forward
memory = self.encoder(src_flatten, spatial_shapes, level_start_index, valid_ratios, lvl_pos_embed_flatten, mask_flatten)
File "/home/lyz/anaconda3/envs/pytorch-1.8.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/deformable_transformer.py", line 281, in forward
output = layer(output, pos, reference_points, spatial_shapes, level_start_index, padding_mask)
File "/home/lyz/anaconda3/envs/pytorch-1.8.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/deformable_transformer.py", line 232, in forward
src2 = self.self_attn(self.with_pos_embed(src, pos), reference_points, src, spatial_shapes, level_start_index, padding_mask)
File "/home/lyz/anaconda3/envs/pytorch-1.8.1/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/ops/modules/ms_deform_attn.py", line 113, in forward
value, input_spatial_shapes, input_level_start_index, sampling_locations, attention_weights, self.im2col_step)
File "/home/lyz/DN-DETR/models/dn_dab_deformable_detr/ops/functions/ms_deform_attn_func.py", line 26, in forward
value, value_spatial_shapes, value_level_start_index, sampling_locations, attention_weights, ctx.im2col_step)
RuntimeError: "ms_deform_attn_forward_cuda" not implemented for 'Half'
may I ask why?
您好,我把一些中间变量的shape打出来看了下,有个地方不太明白
我的理解是这样的,只讨论tgt部分,300维是可学习的编码,然后pad部分是存放添加了噪声的label
如此图,batchsize为2,两张图片的label数量分别为4和16,然后噪声label的tensor经过repeat scalar次后shape变为20×5=100
但是pad_size只设置为known_num的最大值的话,pad部分大小为16×5=80.
那这样的话新的tgt大小为380,但是噪声label是100,会占用掉非去噪部分的20
当然如果按您给的训练参数batch_size=1的话不会存在这个问题,但是batch_size为1有点慢,针对batchsize>1可否设置成pad_size=sum(known_num)呢,这里的改动会影响整个模型的性能吗。谢谢。
Hi,
Please, tell me how to run your code with segmentation head? --masks - doesn't work
Hello, thanks for your wonderful work!
I modified a little code to train on my own dataset (the same format as coco) but the cpu load seems a little high (over 50%). But when using dab-detr to train the same dataset, the cpu load is very low.
Is it normal or dose it need some improvement?
大佬!你好,请问有 dn-detr 训练的 log 吗
tgt
in the code) in cross-attention module of first decoder layer as Conditional DETR or DAB-DETR does?self.class_embed = nn.Linear(hidden_dim, num_classes)
self.bbox_embed = MLP(hidden_dim, hidden_dim, 4, 3)
self.num_feature_levels = num_feature_levels
self.use_dab = use_dab
self.num_patterns = num_patterns
self.random_refpoints_xy = random_refpoints_xy
self.label_enc = nn.Embedding(num_classes + 1, hidden_dim - 1)
请问为什么这里class_embed使用的是91类,而label_enc使用的是92类?标准的DETR里class_embed里似乎是
nn.Linear(hidden_dim, num_classes+1)
Why class_embed use class 91 here and label_enc use class 92?
Hi, authors,
Thank you for opening your fantastic project.
I was very impressed on your successive project DN-DETR and DINO,
so I have merged DINO component to this precedent Deformable DETR based DN-DETR, which is a little bit different from official-DINO.
Do you authors, by any chance, interested in to merge DINO into this DN-DETR?
If so, please let me know and prepare the code sharing.
Because you already have your own official DINO repo, maybe you don't want to mix DN-DETR with another DINO code,
That's ok, and in that case, I am considering to take another way to open my implementation
Thanks.
First, thanks for your excellent work,but I hava a question.
The Figure 3b in paper. Does it contain denoising part? or omitted it in this figure.
I read the issue about your previous answer, class label embedding is noised label right?, but include noised box?
When will the open source code be released?
Hi,
First of all, thanks for uploading your code.
Please tell me with what parameters you need to run your code in order to repeat the results from the article.
I trained 12 epoches with dn_dab_detr and coco2017, but the result of AP=0. Could someone tell me where the problem is?????
Here is the parameters and result:
detr --coco_path ../datasets/coco2017 --use_dn --amp --dilation
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.003
Training time 6 days, 0:10:43
Now time: 2022-12-08 12:12:42.811224
Thanks for your excellent work! I have two questions about the label embedding:
Thanks for your excellent work.
Could you give more details how Known Labels Detection implemented?How do you let the decoder output all boxes of specific class c only using the label embedding of class c?
The number of Ground Truth is different in different images, so the number of noising queries can't be same for the images in a same batch, how do you solve this problem?
Thanks for this amazing work! I have some question about adding DN to a Vallina-DETR like model.
Could you explain more about how can I use DN for a Vallina-DETR like algorithm?
Because the Vallina-DETR's object quries are not anchor-like, thus I don't know how to change an denoised gt(dim=a) to a obj query(dim=b && b != a)?
Will a learnable nn.linear or other oporator work?
Looking forward to your reply!
请问使用预测文件出现以下问题怎么解决呀?!
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper__index_select)
Hello, thanks for your wonderful work!
When I finish training and get log.txt, I want to visualize it using plot_logs, as follows:
But I get an ERROR on this line:
https://github.com/IDEA-opensource/DN-DETR/blob/f41c276fe0af61a8acfbd32dfdde5d00291b3cf9/util/plot_utils.py#L65
Traceback (most recent call last):
File "H:/yjs/code/DN-DETR-main/tmp.py", line 53, in <module>
fig, axs = plot_logs(log_path)
File "H:\yjs\code\DN-DETR-main\util\plot_utils.py", line 65, in plot_logs
df.interpolate().ewm(com=ewm_col).mean().plot(
File "H:\Anaconda\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "H:\Anaconda\lib\site-packages\pandas\core\frame.py", line 10712, in interpolate
return super().interpolate(
File "H:\Anaconda\lib\site-packages\pandas\core\generic.py", line 6899, in interpolate
new_data = obj._mgr.interpolate(
File "H:\Anaconda\lib\site-packages\pandas\core\internals\managers.py", line 377, in interpolate
return self.apply("interpolate", **kwargs)
File "H:\Anaconda\lib\site-packages\pandas\core\internals\managers.py", line 327, in apply
applied = getattr(b, f)(**kwargs)
File "H:\Anaconda\lib\site-packages\pandas\core\internals\blocks.py", line 1369, in interpolate
new_values = values.fillna(value=fill_value, method=method, limit=limit)
File "H:\Anaconda\lib\site-packages\pandas\core\arrays\_mixins.py", line 218, in fillna
value, method = validate_fillna_kwargs(
File "H:\Anaconda\lib\site-packages\pandas\util\_validators.py", line 372, in validate_fillna_kwargs
method = clean_fill_method(method)
File "H:\Anaconda\lib\site-packages\pandas\core\missing.py", line 120, in clean_fill_method
raise ValueError(f"Invalid fill method. Expecting {expecting}. Got {method}")
ValueError: Invalid fill method. Expecting pad (ffill) or backfill (bfill). Got linear
My version of pandas is 1.3.5
I don't know if I am using it in a wrong way or it is a bug in pandas, how can I fix it?
Where is the reconstruction loss? I can not find the loss in DABDETR.py? Thanks
It looks like that you should call zero_grad after scaler.update()
6 | DN-DAB-Deformable-DETR-R50-v24 | R50 | 49.5 (48.4 in 24 epochs) | Google Drive / BaiDu | Optimized implementation with deformable attention in both encoder and decoder. See DAB-DETR for more details. |
---|
I download checkpoint0049.pth from this url. It namely seems to be model trained in 50 epochs. But I got the results below when testing:
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.484
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.665
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.526
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.301
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.517
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.639
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.361
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.590
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.626
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.436
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.673
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.780
Is this normal?
I get the following error using the inference.py code: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0!
When I debug, I get the following result:input = {devic:'cuda:0'},weight = device : cpu
I look forward to answering my questions at your convenience. Thank you very much!
Glad to see the great work and excitedly awaiting the release of the code, but still got a concern:
Why call it DN-DETR rather than DN-DAB-DETR?
Since the main method of the paper was built upon DAB-DETR and the sota numbers were achieved based on the DAB-DETR, then I guess it should be called DN-DAB-DETR. The name DN-DETR sounds like the work was based on vanilla DETR without changing the original Transformer decoder. I saw that your experiments based on Deformable-DETR were called DN-Deformable-DETR, the name of DN-DETR becomes more misleading with respect to that.
And the paper said the code will be organized as a plugin-module that can be applied to any DETR-like models including vanilla DETR, so what should it be called when applying the denoising method to vanilla DETR? Also the result of that case seems didn't appear on the results table.
Many thanks
If I use my own dataset, where do I need to change, the AP I get with my own dataset is 0.0?
Thanks for this amazing work!
DN-DETR: Accelerate DETR Training by Introducing Query DeNoising introduced that DN can be added to Anchor DETR, but I didn't find the relevant code in this project?
Looking forward to your reply!
Hi! thanks for your excellent work, I'm wondering how to evaluate the flops of DN-DETR model?
I can't easily use the DETR script below because of the dn_components.
facebookresearch/detr#110
Could you pls share your python script?
I think L105 should be
diff[:, :2] = known_bbox_expand[:, :2] / 2
Am I correct?
def collate_fn(batch): # import ipdb; ipdb.set_trace() batch = list(zip(*batch)) batch[0] = nested_tensor_from_tensor_list(batch[0]) return tuple(batch)
这里看到在nested_tensor_from_tensor_list中对batch[0],也就是训练图片的每个batch都做了像最大size进行padding的操作将一个batch的图片size保持一致,但是这里不需要对box进行修正吗?感觉box的cx cy w h还是用的修正前的图像坐标使用的?
Thank you for your excellent job! I wonder how to use and when to use --drop_lr_now?
Thank you!
Hi!I'm in trouble
I train my own datase with pre-trained models
--pretrain_model_path checkpoint.pth
the result error:
Traceback (most recent call last):
File "C:/Users/20825/Desktop/detr_code/DN-DETR-main/main.py", line 426, in
main(args)
File "C:/Users/20825/Desktop/detr_code/DN-DETR-main/main.py", line 352, in main
train_stats = train_one_epoch(
File "C:\Users\20825\Desktop\detr_code\DN-DETR-main\engine.py", line 52, in train_one_epoch
outputs = model(samples)
File "C:\anaconda\envs\Detr\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\20825\Desktop\detr_code\DN-DETR-main\models\DN_DAB_DETR\DABDETR.py", line 176, in forward
prepare_for_dn(dn_args, embedweight, src.size(0), self.training, self.num_queries, self.num_classes,
File "C:\Users\20825\Desktop\detr_code\DN-DETR-main\models\DN_DAB_DETR\dn_components.py", line 61, in prepare_for_dn
targets, scalar, label_noise_scale, box_noise_scale, num_patterns = dn_args
TypeError: cannot unpack non-iterable NoneType object
Thank you for your answer
It looks like you forgot optimizer.zero_grad().
Traceback (most recent call last):
File "main.py", line 427, in
main(args)
File "main.py", line 355, in main
args.clip_max_norm, wo_class_error=wo_class_error, lr_scheduler=lr_scheduler, args=args, logger=(logger if args.save_log else None))
File "/home/cxq/dp_work/objectdetection/DN-DETR/engine.py", line 50, in train_one_epoch
loss_dict = criterion(outputs, targets, mask_dict)
File "/home/cxq/.conda/envs/torch10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cxq/dp_work/objectdetection/DN-DETR/models/DN_DAB_DETR/DABDETR.py", line 371, in forward
indices = self.matcher(outputs_without_aux, targets)
File "/home/cxq/.conda/envs/torch10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/cxq/.conda/envs/torch10/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/cxq/dp_work/objectdetection/DN-DETR/models/DN_DAB_DETR/matcher.py", line 83, in forward
cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
File "/home/cxq/dp_work/objectdetection/DN-DETR/util/box_ops.py", line 52, in generalized_box_iou
assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
I'm not sure why I reported this error during training, I've trained Epoch: [38] [12440/14785] eta: 0:17:40
Excuse me, the paper (https://arxiv.org/abs/2203.01305) shows the DN training can also be used in Faster R-CNN, I wonder how to implement it, Thanks!
D:\anaconda3.9\envs\zj\python.exe F:/1chen/DETR/jin/dn/DN-DETR/main.py
Not using distributed mode
[08/13 08:31:54.869]: git:
sha: a59a5de, status: has uncommited changes, branch: main
[08/13 08:31:54.869]: Command: F:/1chen/DETR/jin/dn/DN-DETR/main.py
[08/13 08:31:54.869]: Full config saved to log/r50\config.json
[08/13 08:31:54.869]: world size: 1
[08/13 08:31:54.869]: rank: 0
[08/13 08:31:54.869]: local_rank: 0
[08/13 08:31:54.870]: args: Namespace(amp=False, aux_loss=True, backbone='resnet50', backbone_freeze_keywords=None, batch_norm_type='FrozenBatchNorm2d', batch_size=2, bbox_loss_coef=5, box_noise_scale=0.4, clip_max_norm=0.1, cls_loss_coef=1, coco_panoptic_path=None, coco_path='COCODIR', dataset_file='coco', debug=False, dec_layers=6, dec_n_points=4, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, drop_lr_now=False, dropout=0.0, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=10, eval=False, find_unused_params=False, finetune_ignore=None, fix_size=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, label_noise_scale=0.2, local_rank=0, lr=0.0001, lr_backbone=1e-05, lr_drop=40, mask_loss_coef=1, masks=False, modelname='dn_dab_deformable_detr', nheads=8, note='', num_feature_levels=4, num_patterns=0, num_queries=300, num_select=300, num_workers=10, output_dir='log/r50', pe_temperatureH=20, pe_temperatureW=20, position_embedding='sine', pre_norm=False, pretrain_model_path=None, random_refpoints_xy=False, rank=0, remove_difficult=False, resume='', return_interm_layers=False, save_checkpoint_interval=10, save_log=False, save_results=False, scalar=5, seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, start_epoch=0, transformer_activation='prelu', two_stage=False, use_dn=False, weight_decay=0.0001, world_size=1)
Namespace(amp=False, aux_loss=True, backbone='resnet50', backbone_freeze_keywords=None, batch_norm_type='FrozenBatchNorm2d', batch_size=2, bbox_loss_coef=5, box_noise_scale=0.4, clip_max_norm=0.1, cls_loss_coef=1, coco_panoptic_path=None, coco_path='COCODIR', dataset_file='coco', debug=False, dec_layers=6, dec_n_points=4, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, drop_lr_now=False, dropout=0.0, enc_layers=6, enc_n_points=4, eos_coef=0.1, epochs=10, eval=False, find_unused_params=False, finetune_ignore=None, fix_size=False, focal_alpha=0.25, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, label_noise_scale=0.2, local_rank=0, lr=0.0001, lr_backbone=1e-05, lr_drop=40, mask_loss_coef=1, masks=False, modelname='dn_dab_deformable_detr', nheads=8, note='', num_feature_levels=4, num_patterns=0, num_queries=300, num_select=300, num_workers=10, output_dir='log/r50', pe_temperatureH=20, pe_temperatureW=20, position_embedding='sine', pre_norm=False, pretrain_model_path=None, random_refpoints_xy=False, rank=0, remove_difficult=False, resume='', return_interm_layers=False, save_checkpoint_interval=10, save_log=False, save_results=False, scalar=5, seed=42, set_cost_bbox=5, set_cost_class=2, set_cost_giou=2, start_epoch=0, transformer_activation='prelu', two_stage=False, use_dn=False, weight_decay=0.0001, world_size=1)
[08/13 08:31:55.431]: number of params:47206754
[08/13 08:31:55.433]: params:
{
"transformer.level_embed": 1024,
"transformer.encoder.layers.0.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.0.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.0.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.0.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.0.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.0.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.0.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.0.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.0.norm1.weight": 256,
"transformer.encoder.layers.0.norm1.bias": 256,
"transformer.encoder.layers.0.linear1.weight": 524288,
"transformer.encoder.layers.0.linear1.bias": 2048,
"transformer.encoder.layers.0.linear2.weight": 524288,
"transformer.encoder.layers.0.linear2.bias": 256,
"transformer.encoder.layers.0.norm2.weight": 256,
"transformer.encoder.layers.0.norm2.bias": 256,
"transformer.encoder.layers.1.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.1.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.1.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.1.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.1.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.1.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.1.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.1.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.1.norm1.weight": 256,
"transformer.encoder.layers.1.norm1.bias": 256,
"transformer.encoder.layers.1.linear1.weight": 524288,
"transformer.encoder.layers.1.linear1.bias": 2048,
"transformer.encoder.layers.1.linear2.weight": 524288,
"transformer.encoder.layers.1.linear2.bias": 256,
"transformer.encoder.layers.1.norm2.weight": 256,
"transformer.encoder.layers.1.norm2.bias": 256,
"transformer.encoder.layers.2.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.2.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.2.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.2.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.2.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.2.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.2.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.2.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.2.norm1.weight": 256,
"transformer.encoder.layers.2.norm1.bias": 256,
"transformer.encoder.layers.2.linear1.weight": 524288,
"transformer.encoder.layers.2.linear1.bias": 2048,
"transformer.encoder.layers.2.linear2.weight": 524288,
"transformer.encoder.layers.2.linear2.bias": 256,
"transformer.encoder.layers.2.norm2.weight": 256,
"transformer.encoder.layers.2.norm2.bias": 256,
"transformer.encoder.layers.3.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.3.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.3.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.3.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.3.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.3.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.3.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.3.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.3.norm1.weight": 256,
"transformer.encoder.layers.3.norm1.bias": 256,
"transformer.encoder.layers.3.linear1.weight": 524288,
"transformer.encoder.layers.3.linear1.bias": 2048,
"transformer.encoder.layers.3.linear2.weight": 524288,
"transformer.encoder.layers.3.linear2.bias": 256,
"transformer.encoder.layers.3.norm2.weight": 256,
"transformer.encoder.layers.3.norm2.bias": 256,
"transformer.encoder.layers.4.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.4.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.4.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.4.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.4.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.4.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.4.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.4.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.4.norm1.weight": 256,
"transformer.encoder.layers.4.norm1.bias": 256,
"transformer.encoder.layers.4.linear1.weight": 524288,
"transformer.encoder.layers.4.linear1.bias": 2048,
"transformer.encoder.layers.4.linear2.weight": 524288,
"transformer.encoder.layers.4.linear2.bias": 256,
"transformer.encoder.layers.4.norm2.weight": 256,
"transformer.encoder.layers.4.norm2.bias": 256,
"transformer.encoder.layers.5.self_attn.sampling_offsets.weight": 65536,
"transformer.encoder.layers.5.self_attn.sampling_offsets.bias": 256,
"transformer.encoder.layers.5.self_attn.attention_weights.weight": 32768,
"transformer.encoder.layers.5.self_attn.attention_weights.bias": 128,
"transformer.encoder.layers.5.self_attn.value_proj.weight": 65536,
"transformer.encoder.layers.5.self_attn.value_proj.bias": 256,
"transformer.encoder.layers.5.self_attn.output_proj.weight": 65536,
"transformer.encoder.layers.5.self_attn.output_proj.bias": 256,
"transformer.encoder.layers.5.norm1.weight": 256,
"transformer.encoder.layers.5.norm1.bias": 256,
"transformer.encoder.layers.5.linear1.weight": 524288,
"transformer.encoder.layers.5.linear1.bias": 2048,
"transformer.encoder.layers.5.linear2.weight": 524288,
"transformer.encoder.layers.5.linear2.bias": 256,
"transformer.encoder.layers.5.norm2.weight": 256,
"transformer.encoder.layers.5.norm2.bias": 256,
"transformer.decoder.layers.0.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.0.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.0.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.0.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.0.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.0.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.0.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.0.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.0.norm1.weight": 256,
"transformer.decoder.layers.0.norm1.bias": 256,
"transformer.decoder.layers.0.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.0.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.0.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.0.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.0.norm2.weight": 256,
"transformer.decoder.layers.0.norm2.bias": 256,
"transformer.decoder.layers.0.linear1.weight": 524288,
"transformer.decoder.layers.0.linear1.bias": 2048,
"transformer.decoder.layers.0.linear2.weight": 524288,
"transformer.decoder.layers.0.linear2.bias": 256,
"transformer.decoder.layers.0.norm3.weight": 256,
"transformer.decoder.layers.0.norm3.bias": 256,
"transformer.decoder.layers.1.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.1.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.1.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.1.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.1.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.1.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.1.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.1.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.1.norm1.weight": 256,
"transformer.decoder.layers.1.norm1.bias": 256,
"transformer.decoder.layers.1.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.1.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.1.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.1.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.1.norm2.weight": 256,
"transformer.decoder.layers.1.norm2.bias": 256,
"transformer.decoder.layers.1.linear1.weight": 524288,
"transformer.decoder.layers.1.linear1.bias": 2048,
"transformer.decoder.layers.1.linear2.weight": 524288,
"transformer.decoder.layers.1.linear2.bias": 256,
"transformer.decoder.layers.1.norm3.weight": 256,
"transformer.decoder.layers.1.norm3.bias": 256,
"transformer.decoder.layers.2.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.2.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.2.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.2.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.2.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.2.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.2.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.2.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.2.norm1.weight": 256,
"transformer.decoder.layers.2.norm1.bias": 256,
"transformer.decoder.layers.2.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.2.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.2.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.2.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.2.norm2.weight": 256,
"transformer.decoder.layers.2.norm2.bias": 256,
"transformer.decoder.layers.2.linear1.weight": 524288,
"transformer.decoder.layers.2.linear1.bias": 2048,
"transformer.decoder.layers.2.linear2.weight": 524288,
"transformer.decoder.layers.2.linear2.bias": 256,
"transformer.decoder.layers.2.norm3.weight": 256,
"transformer.decoder.layers.2.norm3.bias": 256,
"transformer.decoder.layers.3.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.3.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.3.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.3.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.3.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.3.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.3.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.3.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.3.norm1.weight": 256,
"transformer.decoder.layers.3.norm1.bias": 256,
"transformer.decoder.layers.3.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.3.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.3.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.3.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.3.norm2.weight": 256,
"transformer.decoder.layers.3.norm2.bias": 256,
"transformer.decoder.layers.3.linear1.weight": 524288,
"transformer.decoder.layers.3.linear1.bias": 2048,
"transformer.decoder.layers.3.linear2.weight": 524288,
"transformer.decoder.layers.3.linear2.bias": 256,
"transformer.decoder.layers.3.norm3.weight": 256,
"transformer.decoder.layers.3.norm3.bias": 256,
"transformer.decoder.layers.4.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.4.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.4.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.4.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.4.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.4.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.4.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.4.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.4.norm1.weight": 256,
"transformer.decoder.layers.4.norm1.bias": 256,
"transformer.decoder.layers.4.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.4.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.4.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.4.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.4.norm2.weight": 256,
"transformer.decoder.layers.4.norm2.bias": 256,
"transformer.decoder.layers.4.linear1.weight": 524288,
"transformer.decoder.layers.4.linear1.bias": 2048,
"transformer.decoder.layers.4.linear2.weight": 524288,
"transformer.decoder.layers.4.linear2.bias": 256,
"transformer.decoder.layers.4.norm3.weight": 256,
"transformer.decoder.layers.4.norm3.bias": 256,
"transformer.decoder.layers.5.cross_attn.sampling_offsets.weight": 65536,
"transformer.decoder.layers.5.cross_attn.sampling_offsets.bias": 256,
"transformer.decoder.layers.5.cross_attn.attention_weights.weight": 32768,
"transformer.decoder.layers.5.cross_attn.attention_weights.bias": 128,
"transformer.decoder.layers.5.cross_attn.value_proj.weight": 65536,
"transformer.decoder.layers.5.cross_attn.value_proj.bias": 256,
"transformer.decoder.layers.5.cross_attn.output_proj.weight": 65536,
"transformer.decoder.layers.5.cross_attn.output_proj.bias": 256,
"transformer.decoder.layers.5.norm1.weight": 256,
"transformer.decoder.layers.5.norm1.bias": 256,
"transformer.decoder.layers.5.self_attn.in_proj_weight": 196608,
"transformer.decoder.layers.5.self_attn.in_proj_bias": 768,
"transformer.decoder.layers.5.self_attn.out_proj.weight": 65536,
"transformer.decoder.layers.5.self_attn.out_proj.bias": 256,
"transformer.decoder.layers.5.norm2.weight": 256,
"transformer.decoder.layers.5.norm2.bias": 256,
"transformer.decoder.layers.5.linear1.weight": 524288,
"transformer.decoder.layers.5.linear1.bias": 2048,
"transformer.decoder.layers.5.linear2.weight": 524288,
"transformer.decoder.layers.5.linear2.bias": 256,
"transformer.decoder.layers.5.norm3.weight": 256,
"transformer.decoder.layers.5.norm3.bias": 256,
"transformer.decoder.query_scale.layers.0.weight": 65536,
"transformer.decoder.query_scale.layers.0.bias": 256,
"transformer.decoder.query_scale.layers.1.weight": 65536,
"transformer.decoder.query_scale.layers.1.bias": 256,
"transformer.decoder.ref_point_head.layers.0.weight": 131072,
"transformer.decoder.ref_point_head.layers.0.bias": 256,
"transformer.decoder.ref_point_head.layers.1.weight": 65536,
"transformer.decoder.ref_point_head.layers.1.bias": 256,
"transformer.decoder.bbox_embed.0.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.0.layers.0.bias": 256,
"transformer.decoder.bbox_embed.0.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.0.layers.1.bias": 256,
"transformer.decoder.bbox_embed.0.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.0.layers.2.bias": 4,
"transformer.decoder.bbox_embed.1.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.1.layers.0.bias": 256,
"transformer.decoder.bbox_embed.1.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.1.layers.1.bias": 256,
"transformer.decoder.bbox_embed.1.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.1.layers.2.bias": 4,
"transformer.decoder.bbox_embed.2.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.2.layers.0.bias": 256,
"transformer.decoder.bbox_embed.2.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.2.layers.1.bias": 256,
"transformer.decoder.bbox_embed.2.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.2.layers.2.bias": 4,
"transformer.decoder.bbox_embed.3.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.3.layers.0.bias": 256,
"transformer.decoder.bbox_embed.3.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.3.layers.1.bias": 256,
"transformer.decoder.bbox_embed.3.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.3.layers.2.bias": 4,
"transformer.decoder.bbox_embed.4.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.4.layers.0.bias": 256,
"transformer.decoder.bbox_embed.4.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.4.layers.1.bias": 256,
"transformer.decoder.bbox_embed.4.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.4.layers.2.bias": 4,
"transformer.decoder.bbox_embed.5.layers.0.weight": 65536,
"transformer.decoder.bbox_embed.5.layers.0.bias": 256,
"transformer.decoder.bbox_embed.5.layers.1.weight": 65536,
"transformer.decoder.bbox_embed.5.layers.1.bias": 256,
"transformer.decoder.bbox_embed.5.layers.2.weight": 1024,
"transformer.decoder.bbox_embed.5.layers.2.bias": 4,
"class_embed.0.weight": 23296,
"class_embed.0.bias": 91,
"class_embed.1.weight": 23296,
"class_embed.1.bias": 91,
"class_embed.2.weight": 23296,
"class_embed.2.bias": 91,
"class_embed.3.weight": 23296,
"class_embed.3.bias": 91,
"class_embed.4.weight": 23296,
"class_embed.4.bias": 91,
"class_embed.5.weight": 23296,
"class_embed.5.bias": 91,
"label_enc.weight": 23460,
"tgt_embed.weight": 76500,
"refpoint_embed.weight": 1200,
"input_proj.0.0.weight": 131072,
"input_proj.0.0.bias": 256,
"input_proj.0.1.weight": 256,
"input_proj.0.1.bias": 256,
"input_proj.1.0.weight": 262144,
"input_proj.1.0.bias": 256,
"input_proj.1.1.weight": 256,
"input_proj.1.1.bias": 256,
"input_proj.2.0.weight": 524288,
"input_proj.2.0.bias": 256,
"input_proj.2.1.weight": 256,
"input_proj.2.1.bias": 256,
"input_proj.3.0.weight": 4718592,
"input_proj.3.0.bias": 256,
"input_proj.3.1.weight": 256,
"input_proj.3.1.bias": 256,
"backbone.0.body.layer2.0.conv1.weight": 32768,
"backbone.0.body.layer2.0.conv2.weight": 147456,
"backbone.0.body.layer2.0.conv3.weight": 65536,
"backbone.0.body.layer2.0.downsample.0.weight": 131072,
"backbone.0.body.layer2.1.conv1.weight": 65536,
"backbone.0.body.layer2.1.conv2.weight": 147456,
"backbone.0.body.layer2.1.conv3.weight": 65536,
"backbone.0.body.layer2.2.conv1.weight": 65536,
"backbone.0.body.layer2.2.conv2.weight": 147456,
"backbone.0.body.layer2.2.conv3.weight": 65536,
"backbone.0.body.layer2.3.conv1.weight": 65536,
"backbone.0.body.layer2.3.conv2.weight": 147456,
"backbone.0.body.layer2.3.conv3.weight": 65536,
"backbone.0.body.layer3.0.conv1.weight": 131072,
"backbone.0.body.layer3.0.conv2.weight": 589824,
"backbone.0.body.layer3.0.conv3.weight": 262144,
"backbone.0.body.layer3.0.downsample.0.weight": 524288,
"backbone.0.body.layer3.1.conv1.weight": 262144,
"backbone.0.body.layer3.1.conv2.weight": 589824,
"backbone.0.body.layer3.1.conv3.weight": 262144,
"backbone.0.body.layer3.2.conv1.weight": 262144,
"backbone.0.body.layer3.2.conv2.weight": 589824,
"backbone.0.body.layer3.2.conv3.weight": 262144,
"backbone.0.body.layer3.3.conv1.weight": 262144,
"backbone.0.body.layer3.3.conv2.weight": 589824,
"backbone.0.body.layer3.3.conv3.weight": 262144,
"backbone.0.body.layer3.4.conv1.weight": 262144,
"backbone.0.body.layer3.4.conv2.weight": 589824,
"backbone.0.body.layer3.4.conv3.weight": 262144,
"backbone.0.body.layer3.5.conv1.weight": 262144,
"backbone.0.body.layer3.5.conv2.weight": 589824,
"backbone.0.body.layer3.5.conv3.weight": 262144,
"backbone.0.body.layer4.0.conv1.weight": 524288,
"backbone.0.body.layer4.0.conv2.weight": 2359296,
"backbone.0.body.layer4.0.conv3.weight": 1048576,
"backbone.0.body.layer4.0.downsample.0.weight": 2097152,
"backbone.0.body.layer4.1.conv1.weight": 1048576,
"backbone.0.body.layer4.1.conv2.weight": 2359296,
"backbone.0.body.layer4.1.conv3.weight": 1048576,
"backbone.0.body.layer4.2.conv1.weight": 1048576,
"backbone.0.body.layer4.2.conv2.weight": 2359296,
"backbone.0.body.layer4.2.conv3.weight": 1048576
}
loading annotations into memory...
Done (t=0.03s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Start training
F:\1chen\DETR\jin\dn\DN-DETR\models\dn_dab_deformable_detr\position_encoding.py:53: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats)
Traceback (most recent call last):
File "F:/1chen/DETR/jin/dn/DN-DETR/main.py", line 426, in
main(args)
File "F:/1chen/DETR/jin/dn/DN-DETR/main.py", line 352, in main
train_stats = train_one_epoch(
File "F:\1chen\DETR\jin\dn\DN-DETR\engine.py", line 52, in train_one_epoch
outputs = model(samples)
File "D:\anaconda3.9\envs\zj\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "F:\1chen\DETR\jin\dn\DN-DETR\models\dn_dab_deformable_detr\dab_deformable_detr.py", line 206, in forward
prepare_for_dn(dn_args, tgt_all_embed, refanchor, src.size(0), self.training, self.num_queries, self.num_classes,
File "F:\1chen\DETR\jin\dn\DN-DETR\models\dn_dab_deformable_detr\dn_components.py", line 61, in prepare_for_dn
targets, scalar, label_noise_scale, box_noise_scale, num_patterns = dn_args
TypeError: cannot unpack non-iterable NoneType object
Process finished with exit code 1
In the forward part of dab_deformable_detr.py
if self.two_stage:
assert NotImplementedError
elif self.use_dab:
if self.num_patterns == 0:
tgt_all_embed = tgt_embed = self.tgt_embed.weight # nq, 256
refanchor = self.refpoint_embed.weight # nq, 4
# query_embeds = torch.cat((tgt_embed, refanchor), dim=1)
else:
# multi patterns
tgt_embed = self.tgt_embed.weight # nq, 256
pat_embed = self.patterns_embed.weight # num_pat, 256
tgt_embed = tgt_embed.repeat(self.num_patterns, 1) # nq*num_pat, 256
pat_embed = pat_embed[:, None, :].repeat(1, self.num_queries, 1).flatten(0, 1) # nq*num_pat, 256
tgt_all_embed = tgt_embed + pat_embed
refanchor = self.refpoint_embed.weight.repeat(self.num_patterns, 1) # nq*num_pat, 4
# query_embeds = torch.cat((tgt_all_embed, refanchor), dim=1)
else:
assert NotImplementedError
Isn't tgt_embed with the shape nq, hidden_dim - 1
? How could you add tgt_embed
with pat_embed
?
Hi, I have a question that what the role of “Inverse Sigmoid” is in your code? I mention that Inverse sigmoid is used in many places in your code
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.