I'm trying to get my custom dataset working but I can't get past 8 or so images via ge

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

To complete what <a class="user-mention notranslate" data-hovercard-type="user" data-h

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

facebookresearch,detr

Comments (21)

alcinos commented on July 2, 2024 7

Hi @lessw2020 apologies for the confusion, the class IDs need to be remapped to [0, 6]. Basically you want tgt_ids.max() < num_classes

EDIT: to clarify, in our case for COCO there is 80 classes with labels in [0, 90], and for simplicity we don't do remapping so we use num_classes=91 (so that we satisfy the inequality above). It doesn't matter that some ids will never be used (it's a slight waste of parameters, but negligible in this case). In your case it won't work though, you really don't want to have a softmax over 2.9M elements, so remapping is the way to go.

from detr.

fmassa commented on July 2, 2024 4

@lessw2020 from debug2.txt, the error comes from

cost_class = -out_prob[:, tgt_ids]

which indicates that your your class probability has fewer elements than the ground-truth indices.

If you add a print(tgt_ids.max()) in your code, you'll see that it is larger than 6, which means that there might be an issue with your dataset (as you have more classes than you thought). I believe this is probably the issue that you are facing.

As an unrelated note, I noted that you are passing --no_aux_loss to the model -- note that our best results are obtained with aux_loss. The evaluation code doesn't need aux loss because it's just evaluation and it is slightly faster, but for training in general it's better to use aux_loss.

from detr.

fmassa commented on July 2, 2024 2

@lessw2020

Let's break this down in two: The device-side assert and the degenerate boxes.

Device-side assert

I think in order to properly debug the RuntimeError: CUDA error: device-side assert triggered you'll need to run your script with CUDA_LAUNCH_BLOCKING=1 python main.py, due to the asynchronous nature of the CUDA calls in PyTorch.
But as a rule of thumb, this generally comes from indexing a tensor out of bonds, for example when the number of outputs in the classifier is is smaller than the number of classes, which gets triggered at CrossEntropy.

I see from your logs though that you changed num_queries in the code to 9, but the argparse results are not changed (it still prints 100) -- can you try instead changing it in the command-line? There might be other places in the code that you forgot to change 100 to 9.

Degenerate boxes

The second (full) log that you posted seems to also indicate that you have a device-side assert being triggered, even if it seems to be pointing to the "degenerate" boxes. I think this shows that the "degenerate boxes" is a red-hearing, and the error lies elsewhere.
My first guess: make sure that, if you changed num_classes in the code, you are using the same num_classes for the SetCriterion in

detr/models/detr.py

Line 330 in 7613beb

criterion = SetCriterion(num_classes, matcher=matcher, weight_dict=weight_dict,

This could explain why you are having the device-side asserts, as we use the num_classes from the Criterion to perform indexing

detr/models/detr.py

Lines 108 to 112 in 7613beb

 target_classes = torch.full(src_logits.shape[:2], self.num_classes, 

 dtype=torch.int64, device=src_logits.device) 

 target_classes[idx] = target_classes_o 

 loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)

from detr.

alcinos commented on July 2, 2024 1

To complete what @fmassa said, the canonical place to change the number of queries is the command line arg --num_queries (you shouldn't have to change the code for that). For the number of classes, you have only one line to edit here: https://github.com/facebookresearch/detr/blob/master/models/detr.py#L296-L298

from detr.

fmassa commented on July 2, 2024 1

@lessw2020 next step is to run the code with CUDA_LAUNCH_BLOCKING=1 python main.py, as it will show exactly where the issue is -- the current error in the assert is a red-herring because it's the first point in the code that has a sync point (due to the assertion requiring a value on the CPU).

I still think that the most likely culprit should be in the Criterion. Also, can you paste the rest of the error message that is displayed? The device-side assert from CUDA generally prints a lot of repeated messages, but which indicate in which kernel the assert happened, which is helpful for debugginng

from detr.

lessw2020 commented on July 2, 2024 1

Hi @fmassa - completely understand about keeping the codebase as simple as possible.

I think just having some good documentation ideally with an example walk through for training a custom dataset would be more than sufficient b/c then best practices are distilled from the start to avoid various issues like this one in the first place, and perhaps including the tgt_ids.max() < num_classes as an assert in the matcher code (which is useful for all) should be plenty?

And yes I am dealing with mAP ==0 atm now that I can train :) Any tips on that appreciated and maybe that could be added as part of the documentation as well?
If the documentation is open-sourced for user contributions, I'd be happy to contribute as I can, since I expect to be working intensively with DETR for RL medical application to replace EfficientDet.
Regardless thanks again for all the help!

from detr.

fmassa commented on July 2, 2024 1

and perhaps including the tgt_ids.max() < num_classes as an assert in the matcher code (which is useful for all) should be plenty?

Yes, I agree that this assert would be a good thing to have, although it will incur a small runtime penalty during training, but it should be fine I think.

Any tips on that appreciated and maybe that could be added as part of the documentation as well?

There is some information in #41

I think a new file named TROUBLESHOOTING.md (that has a link in the main README) could be the good place to have more information, in the format ->

Regardless thanks again for all the help!

Let us know if you have further questions!

from detr.

lessw2020 commented on July 2, 2024

*I'll try a different dataset tomorrow that doesn't have the one outer bounding box surrounding all the inner objects and see if that is the core issue.

from detr.

alcinos commented on July 2, 2024

Hi,
I'd have to see the full backtrace to be 100% sure, but generally in this function boxes1 correspond to the predicted boxes, not the target ones, so your dataset is likely not to blame here (see eg. https://github.com/facebookresearch/detr/blob/master/models/detr.py#L150)
From the top of my head, I can think of mainly two things that can trigger this:

NaNs. Do you have higher LR than the defaults? Maybe your training is just diverging.
Lack of clamping. In https://github.com/facebookresearch/detr/blob/master/models/detr.py#L66, we have a sigmoid to force a prediction in [0,1], thus preventing degenerate boxes. Did you remove it, by any chance?

Best of luck.

from detr.

lessw2020 commented on July 2, 2024

Hi @alcinos - thanks for saving my stress levels - I was poring over the bboxes trying to figure out how it was flagging them as incorrect.

You are right though, now I see from your link that it is the predicted boxes and not the dataset loaded ones.
Re: questions:
1 - I didn't change the LR nor the clamping params. (I'm trying to make as few adjustments and just get it training first).
2 - However, I think maybe the issue is I forgot there is no adjustment for classes in the main.py script (I had first adjusted the num_queries and then resest it to 100 when started hitting this issue). I'm training for 6 classes (or 6 +1 for background) and I realize now it is likely predicting for 70+...so that may be why it is quickly asserting after just a few batches, and NaN's for the degenerate predicted boxes?

Let me try to remap the class count and create a --num_classes param and see if that fixes this!

from detr.

lessw2020 commented on July 2, 2024

ugh well no luck - I changed the classes to 6 +1. Depending on the number of queries, I get various failures in the loss matching via CUDA assert ala below. Running with 100 (default) I go right back to the degenerate bbox issue as before.
--I'm running in Juypter with this launch:

%run main.py --batch_size 2 --no_aux_loss --coco_path uw-dev7

Here's the error with queries = 9 (the --> arrows are my debugging prints so I can verify the model being created has expected num_queries and classes):

Not using distributed mode
git:
  sha: 7613beb10a530ca0ab836f2c8845d0501f5bf063, status: has uncommited changes, branch: master

Namespace(aux_loss=False, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='uw-dev7', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', pre_norm=False, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1)
_____------> num_classes = 91
___====> self.class_embed = Linear(in_features=256, out_features=7, bias=True)
___=====> self.query_embed = Embedding(9, 256)
number of params: 41257227
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Start training
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/detr/main.py in <module>
    246     if args.output_dir:
    247         Path(args.output_dir).mkdir(parents=True, exist_ok=True)
--> 248     main(args)

~/detr/main.py in main(args)
    196         train_stats = train_one_epoch(
    197             model, criterion, data_loader_train, optimizer, device, epoch,
--> 198             args.clip_max_norm)
    199         lr_scheduler.step()
    200         if args.output_dir:

~/detr/engine.py in train_one_epoch(model, criterion, data_loader, optimizer, device, epoch, max_norm)
     31 
     32         outputs = model(samples)
---> 33         loss_dict = criterion(outputs, targets)
     34         weight_dict = criterion.weight_dict
     35         losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/detr/models/detr.py in forward(self, outputs, targets)
    220 
    221         # Retrieve the matching between the outputs of the last layer and the targets
--> 222         indices = self.matcher(outputs_without_aux, targets)
    223 
    224         # Compute the average number of target boxes accross all nodes, for normalization purposes

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py in decorate_no_grad(*args, **kwargs)
     47         def decorate_no_grad(*args, **kwargs):
     48             with self:
---> 49                 return func(*args, **kwargs)
     50         return decorate_no_grad
     51 

~/detr/models/matcher.py in forward(self, outputs, targets)
     72 
     73         # Compute the giou cost betwen boxes
---> 74         cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
     75 
     76         # Final cost matrix

~/detr/util/box_ops.py in box_cxcywh_to_xyxy(x)
      9 def box_cxcywh_to_xyxy(x):
     10     x_c, y_c, w, h = x.unbind(-1)
---> 11     b = [(x_c - 0.5 * w), (y_c - 0.5 * h),
     12          (x_c + 0.5 * w), (y_c + 0.5 * h)]
     13     return torch.stack(b, dim=-1)

RuntimeError: CUDA error: device-side assert triggered

And reverting to queries=100, I get back into the original degenerate bbox issue as before:

Not using distributed mode
git:
  sha: 7613beb10a530ca0ab836f2c8845d0501f5bf063, status: has uncommited changes, branch: master

Namespace(aux_loss=False, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='uw-dev7', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', pre_norm=False, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1)
_____------> num_classes = 91
___====> self.class_embed = Linear(in_features=256, out_features=7, bias=True)
___=====> self.query_embed = Embedding(100, 256)
number of params: 41280523
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Start training
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/detr/main.py in <module>
    246     if args.output_dir:
    247         Path(args.output_dir).mkdir(parents=True, exist_ok=True)
--> 248     main(args)

~/detr/main.py in main(args)
    196         train_stats = train_one_epoch(
    197             model, criterion, data_loader_train, optimizer, device, epoch,
--> 198             args.clip_max_norm)
    199         lr_scheduler.step()
    200         if args.output_dir:

~/detr/engine.py in train_one_epoch(model, criterion, data_loader, optimizer, device, epoch, max_norm)
     31 
     32         outputs = model(samples)
---> 33         loss_dict = criterion(outputs, targets)
     34         weight_dict = criterion.weight_dict
     35         losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/detr/models/detr.py in forward(self, outputs, targets)
    220 
    221         # Retrieve the matching between the outputs of the last layer and the targets
--> 222         indices = self.matcher(outputs_without_aux, targets)
    223 
    224         # Compute the average number of target boxes accross all nodes, for normalization purposes

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py in decorate_no_grad(*args, **kwargs)
     47         def decorate_no_grad(*args, **kwargs):
     48             with self:
---> 49                 return func(*args, **kwargs)
     50         return decorate_no_grad
     51 

~/detr/models/matcher.py in forward(self, outputs, targets)
     72 
     73         # Compute the giou cost betwen boxes
---> 74         cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
     75 
     76         # Final cost matrix

~/detr/util/box_ops.py in generalized_box_iou(boxes1, boxes2)
     49     # degenerate boxes gives inf / nan results
     50     # so do an early check
---> 51     assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
     52     assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
     53     iou, union = box_iou(boxes1, boxes2)

~/anaconda3/lib/python3.7/site-packages/torch/tensor.py in wrapped(*args, **kwargs)
     26     def wrapped(*args, **kwargs):
     27         try:
---> 28             return f(*args, **kwargs)
     29         except TypeError:
     30             return NotImplemented

RuntimeError: CUDA error: device-side assert triggered

from detr.

lessw2020 commented on July 2, 2024

*note - I'll try fine tuning tomorrow as a backup plan (via --resume and the checkpoint linear layer restart).

from detr.

lessw2020 commented on July 2, 2024

Hi @alcinos and @fmassa - thanks very much to both of you for the detailed info!

I've reset my code changes, updating the classes per the above and verifying SetCriterion, and updating the queries via command line arg.
Unfortunately the problem persists - back to the degenerate bbox assert.

I'll try reverting to 100 for default query, and continue trying to pin it down further. For reference, I can run Coco eval on this server with 42 mAP, so config seems functional.

Here's my current results - I added a --num_classes arg to simplify (which adjusts at spot @alcinos pointed out), I have a print check for SetCriterion per @fmassa as well.
I've printed the model, postprocessor, criterion in the results below as well as verified the class_embed looks correct ala num_classes+1:
(class_embed): Linear(in_features=256, out_features=7, bias=True)

Here's my launch command:
%run main.py --batch_size 2 --no_aux_loss --num_queries 12 --num_classes 6 --coco_path uw-dev7 --dataset_file coco --output_dir ./output

And results

Not using distributed mode
git: sha: 7613beb10a530ca0ab836f2c8845d0501f5bf063, status: has uncommited changes, branch: master
Namespace(aux_loss=False, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='uw-dev7', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=False, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_classes=6, num_queries=12, num_workers=2, output_dir='./output', position_embedding='sine', pre_norm=False, remove_difficult=False, resume='', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1)
num_classes = 6
*** custom classes and queries ****
---> num classes = 6, num queries = 12
detr.py::SetCriterion.__init__ self.num_classes = 6
DETR(
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
        (1): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
        (2): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
        (3): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
        (4): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
        (5): TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
        (1): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
        (2): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
        (3): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
        (4): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
        (5): TransformerDecoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (multihead_attn): MultiheadAttention(
            (out_proj): Linear(in_features=256, out_features=256, bias=True)
          )
          (linear1): Linear(in_features=256, out_features=2048, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=2048, out_features=256, bias=True)
          (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (norm3): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
          (dropout3): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
    )
  )
  (class_embed): Linear(in_features=256, out_features=7, bias=True)
  (bbox_embed): MLP(
    (layers): ModuleList(
      (0): Linear(in_features=256, out_features=256, bias=True)
      (1): Linear(in_features=256, out_features=256, bias=True)
      (2): Linear(in_features=256, out_features=4, bias=True)
    )
  )
  (query_embed): Embedding(12, 256)
  (input_proj): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1))
  (backbone): Joiner(
    (0): Backbone(
      (body): IntermediateLayerGetter(
        (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
        (bn1): FrozenBatchNorm2d()
        (relu): ReLU(inplace=True)
        (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
        (layer1): Sequential(
          (0): Bottleneck(
            (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
            (downsample): Sequential(
              (0): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): FrozenBatchNorm2d()
            )
          )
          (1): Bottleneck(
            (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
          (2): Bottleneck(
            (conv1): Conv2d(256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
        )
        (layer2): Sequential(
          (0): Bottleneck(
            (conv1): Conv2d(256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
            (downsample): Sequential(
              (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False)
              (1): FrozenBatchNorm2d()
            )
          )
          (1): Bottleneck(
            (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
          (2): Bottleneck(
            (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
          (3): Bottleneck(
            (conv1): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
        )
        (layer3): Sequential(
          (0): Bottleneck(
            (conv1): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
            (downsample): Sequential(
              (0): Conv2d(512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False)
              (1): FrozenBatchNorm2d()
            )
          )
          (1): Bottleneck(
            (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
          (2): Bottleneck(
            (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
          (3): Bottleneck(
            (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
          (4): Bottleneck(
            (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
          (5): Bottleneck(
            (conv1): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
        )
        (layer4): Sequential(
          (0): Bottleneck(
            (conv1): Conv2d(1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
            (downsample): Sequential(
              (0): Conv2d(1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False)
              (1): FrozenBatchNorm2d()
            )
          )
          (1): Bottleneck(
            (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
          (2): Bottleneck(
            (conv1): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn1): FrozenBatchNorm2d()
            (conv2): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
            (bn2): FrozenBatchNorm2d()
            (conv3): Conv2d(512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False)
            (bn3): FrozenBatchNorm2d()
            (relu): ReLU(inplace=True)
          )
        )
      )
    )
    (1): PositionEmbeddingSine()
  )
)
SetCriterion(
  (matcher): HungarianMatcher()
)
{'bbox': PostProcess()}
number of params: 41257995
loading annotations into memory...
Done (t=0.02s)
creating index...
index created!
loading annotations into memory...
Done (t=0.00s)
creating index...
index created!
Start training
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/detr/main.py in <module>
    252     if args.output_dir:
    253         Path(args.output_dir).mkdir(parents=True, exist_ok=True)
--> 254     main(args)

~/detr/main.py in main(args)
    202         train_stats = train_one_epoch(
    203             model, criterion, data_loader_train, optimizer, device, epoch,
--> 204             args.clip_max_norm)
    205         lr_scheduler.step()
    206         if args.output_dir:

~/detr/engine.py in train_one_epoch(model, criterion, data_loader, optimizer, device, epoch, max_norm)
     31 
     32         outputs = model(samples)
---> 33         loss_dict = criterion(outputs, targets)
     34         weight_dict = criterion.weight_dict
     35         losses = sum(loss_dict[k] * weight_dict[k] for k in loss_dict.keys() if k in weight_dict)

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/detr/models/detr.py in forward(self, outputs, targets)
    217 
    218         # Retrieve the matching between the outputs of the last layer and the targets
--> 219         indices = self.matcher(outputs_without_aux, targets)
    220 
    221         # Compute the average number of target boxes accross all nodes, for normalization purposes

~/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    530             result = self._slow_forward(*input, **kwargs)
    531         else:
--> 532             result = self.forward(*input, **kwargs)
    533         for hook in self._forward_hooks.values():
    534             hook_result = hook(self, input, result)

~/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py in decorate_no_grad(*args, **kwargs)
     47         def decorate_no_grad(*args, **kwargs):
     48             with self:
---> 49                 return func(*args, **kwargs)
     50         return decorate_no_grad
     51 

~/detr/models/matcher.py in forward(self, outputs, targets)
     72 
     73         # Compute the giou cost betwen boxes
---> 74         cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))
     75 
     76         # Final cost matrix

~/detr/util/box_ops.py in generalized_box_iou(boxes1, boxes2)
     49     # degenerate boxes gives inf / nan results
     50     # so do an early check
---> 51     assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
     52     assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
     53     iou, union = box_iou(boxes1, boxes2)

RuntimeError: CUDA error: device-side assert triggered

from detr.

lessw2020 commented on July 2, 2024

Hi @fmassa - thanks again for the help.
I switched out from Jupyter to terminal and then was able to see the full CUDA assert info (about 32 of them).
I'm attaching debug.txt which is the std output (model loading, starting train) and more importantly, debug2.txt which contains the specific CUDA asserts generated from using the CUDA_LAUNCH_BLOCKING=1.
Thanks for the help on this and hope this additional CUDA info helps pin it down!
debug2.txt
debug.txt

from detr.

lessw2020 commented on July 2, 2024

I should add your intuition was quite correct as the core CUDA issue is an index out of bounds:
/opt/conda/conda-bld/pytorch_1579040055865/work/aten/src/ATen/native/cuda/IndexKernel.cu:60: lambda [](int)->auto::operator()(int)->auto: block: [0,0,0], thread: [95,0,0] Assertionindex >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

from detr.

lessw2020 commented on July 2, 2024

Hi @fmassa - thanks for the updates! (and really appreciate you helping trace my issue through).

1 - appreciate the tip re: --no_aux_loss, will drop that as we definitely want best results here (this is for malaria and covid diagnostics, so accuracy is paramount).

2 - Re: more classes than expected - never say never, but overall I don't believe that is possible (double checks shown below).
I did do the print statement and that may be helpful - it shows 11 targets instead of 12? There are only 6 unique class ids in that target tensor though:

`--> HungarianMatcher::tgt_ids.max() = 2905442
and tgt_ids tensor([2905419, 2905420, 2905442, 2905422, 2905421, 2905419, 2905442, 2905422,
2905421, 2905420, 2905418], device='cuda:0')

2905418 2905419 2905420 2905421 2905422 2905442 = 6 unique targets`

I've double checked at the source labeling process and the master label process only knows about 6 classes (see attached)... I've also trained this same dataset in EfficientDet and no issues in terms of aberrant classes.

In addition, from the bbox prints I showed above you can see that the get_item is only returning six bboxes per image.

If it helps here is some additional print info on the tgt_ids from matcher - this tensor seems to match to expected in terms of 24 bboxes or 12 * 2, with 7 items:

--> HungarianMatcher::bs = 2, num_queries = 12

--> HungarianMatcher::out_prob = tensor([
[0.1429, 0.1524, 0.0360, 0.3441, 0.0595, 0.1435, 0.1216],
[0.1272, 0.1756, 0.0668, 0.3013, 0.0830, 0.1012, 0.1450],
[0.1289, 0.2092, 0.0606, 0.2196, 0.0773, 0.1041, 0.2003],
[0.1707, 0.1649, 0.0612, 0.3110, 0.0605, 0.0686, 0.1631],
[0.1422, 0.1706, 0.0585, 0.3027, 0.0632, 0.1221, 0.1407],
[0.1525, 0.1247, 0.0755, 0.3634, 0.0449, 0.0984, 0.1406],
[0.1743, 0.1759, 0.0831, 0.2489, 0.0527, 0.0978, 0.1673],
[0.1781, 0.1521, 0.0790, 0.3080, 0.0660, 0.1194, 0.0975],
[0.1112, 0.1441, 0.0579, 0.4215, 0.0434, 0.1162, 0.1056],
[0.1850, 0.1077, 0.0532, 0.4358, 0.0441, 0.0603, 0.1138],
[0.1347, 0.1865, 0.0617, 0.2846, 0.0529, 0.1071, 0.1725],
[0.1269, 0.1257, 0.0557, 0.3342, 0.0883, 0.0871, 0.1820],
[0.1542, 0.1182, 0.0392, 0.3187, 0.0728, 0.1875, 0.1094],
[0.0895, 0.1262, 0.0507, 0.3688, 0.0578, 0.1258, 0.1812],
[0.1555, 0.0875, 0.0241, 0.3864, 0.0619, 0.1670, 0.1175],
[0.1452, 0.1161, 0.0651, 0.3215, 0.0478, 0.1602, 0.1442],
[0.1515, 0.0955, 0.0361, 0.4037, 0.0529, 0.1357, 0.1246],
[0.1484, 0.1239, 0.0483, 0.3102, 0.0846, 0.1653, 0.1193],
[0.1504, 0.1253, 0.0385, 0.2777, 0.0588, 0.1858, 0.1635],
[0.1305, 0.0958, 0.0304, 0.4392, 0.0672, 0.1001, 0.1368],
[0.1302, 0.1838, 0.0467, 0.2408, 0.0975, 0.1746, 0.1265],
[0.1314, 0.1010, 0.0646, 0.3697, 0.0854, 0.1541, 0.0939],
[0.1565, 0.0833, 0.0494, 0.3646, 0.0443, 0.1281, 0.1738],
[0.1233, 0.0938, 0.0386, 0.4156, 0.0898, 0.1282, 0.1107]],
device='cuda:0')

from detr.

lessw2020 commented on July 2, 2024

Hi @alcinos and @fmassa -
Oh- thanks for clarifying this!
I see what you mean here - I incorrectly thought that mapping was being auto-handled via the inheritance from torchvision.datasets.CocoDetection (which I've never used before).... and then all the bbox asserts etc sent me down a round-about path.

Anyway, I learned some nice info on debugging CUDA asserts and thanks for all the help.

definitely agree re: softmax of over 2.9M :) - I'm putting together a custom dataset class for DETR based on what I used for EfficientDet while trying to keep close to your impl to do the class remapping and will confirm I'm training after that.

Thanks again!

from detr.

lessw2020 commented on July 2, 2024

I've got it all remapped and working - will try to train tomorrow.
Because the CocoDetection class uses the separate class ConvertPolysToMasks, I figured the cleanest point to remap was right before returning the target since I need access to the self.coco which that class doesn't have and I have to wait for ConvertPolys to review and weed out any errant bboxes. I made a remap_labels function and do it in place.
Github seems to be stripping up some of the code formatting, but any feedback is welcome if there's a better spot to remap etc.

` def getitem(self, idx):
img, target = super().getitem(idx)
image_id = self.ids[idx]
target = {'image_id': image_id, 'annotations': target}
img, target = self.prepare(img, target)
if self._transforms is not None:
img, target = self._transforms(img, target)
#modify target['labels'] in place to my labels
self.remap_labels(target)
return img, target

def remap_labels(self,target):
    #print(target['labels'])
    ll = target['labels'].tolist()
    for i,item in enumerate(ll):
        new_id = self.coco_label_to_my_label(item)
        #print(f"item: {item} --> new_id {new_id}")
        ll[i]=new_id
    newclasses = torch.tensor(ll, dtype = torch.int64)
    #print(f"---> updated labels:  {newclasses}")
    target['labels']=newclasses

def coco_label_to_my_label(self, coco_label):
    return self.coco_labels_inverse[coco_label]

def my_label_to_coco_label(self, label):
    return self.coco_labels[label]

from detr.

lessw2020 commented on July 2, 2024

Hi @alcinos and @fmassa - I'm up and training successfully now.
Just wanted to say thanks again for the help!
I made a custom_coco.py and modded main.py and init.py in datasets to keep everything as closely aligned as I could while allowing custom_class counts and handling the remapping for future updates.
I can PR the custom_coco if that would be useful to others otherwise this issue is resolved and can close.
Thanks again!

from detr.

fmassa commented on July 2, 2024

@lessw2020 great that you managed to make it work!

We had in initial versions of the code a class to remap categories of COCO, but we removed it because it was not being used anymore and made things a bit more complicated for the evaluation, that you also need to pay attention to otherwise your mAP will be zero.

I think this is a good record to keep in mind and improve on the documentation, but I'm non sure what would be the best way to do it while keeping things as easy as possible -- it would need to involve adding a few more abstractions as the ones in torchvision to make it work, and I believe we would prefer to keep the codebase as simple as possible.

Maybe @szagoruyko or @alcinos can comment on this, but I think a note somewhere explaining how to do it would maybe be preferable.

from detr.

Mashood3624 commented on July 2, 2024

Hi, I may be wrong but what I have understood is that @lessw2020 had some custom class labels of 2M+ values (like 29054191). By remapping or aliasing them to 0, 1, 2 and etc integers the error got resolved right?. As max class label must not increase the total number of classes for example if we have a total number of 4 classes in our dataset then our labels should be 0, 1, 2 and 3. I am facing the same assert error even though I have labelled my classes correctly. Please guide. Thanks.

from detr.

custom training asserts with "degenerate bboxes" over and over - but bboxes look correct, any debugging insight? about detr HOT 21 CLOSED

Comments (21)

Device-side assert

Degenerate boxes

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	target_classes = torch.full(src_logits.shape[:2], self.num_classes,
	dtype=torch.int64, device=src_logits.device)
	target_classes[idx] = target_classes_o

	loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)