The training works fine for all training steps in epoch 1. At the end of the epoch when saving the checkpoint, the memory usage on the GPU suddenly jumps from ~8/9 GB to 18 GB and eventually failing when reaching the limit of 24 GB.
2023-05-16 14:05:55,895 - mmdet - INFO - workflow: [('train', 1)], max: 33 epochs
2023-05-16 14:05:55,895 - mmdet - INFO - Checkpoints will be saved to /mmdetection3d/tools/work_dirs/td3d_is_s3dis-3d-5class by HardDiskBackend.
/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiSparseTensor.py:298: UserWarning: coordinates implicitly converted to torch.IntTensor. To remove this warning, use `.int()` to convert the coords into an torch.IntTensor
+ "coords into an torch.IntTensor"
2023-05-16 14:06:42,654 - mmdet - INFO - Epoch [1][50/663] lr: 1.000e-03, eta: 5:40:12, time: 0.935, data_time: 0.290, memory: 7883, bbox_loss: 0.8248, cls_loss: 0.7217, inst_loss: 0.7607, loss: 2.3071, grad_norm: 1.5957
2023-05-16 14:07:15,048 - mmdet - INFO - Epoch [1][100/663] lr: 1.000e-03, eta: 4:47:17, time: 0.648, data_time: 0.016, memory: 7883, bbox_loss: 0.7341, cls_loss: 0.3994, inst_loss: 0.6373, loss: 1.7708, grad_norm: 0.9642
2023-05-16 14:07:50,540 - mmdet - INFO - Epoch [1][150/663] lr: 1.000e-03, eta: 4:36:46, time: 0.710, data_time: 0.035, memory: 8027, bbox_loss: 0.7062, cls_loss: 0.3581, inst_loss: 0.6261, loss: 1.6904, grad_norm: 1.1923
2023-05-16 14:08:25,190 - mmdet - INFO - Epoch [1][200/663] lr: 1.000e-03, eta: 4:29:42, time: 0.693, data_time: 0.014, memory: 8027, bbox_loss: 0.6692, cls_loss: 0.3358, inst_loss: 0.6145, loss: 1.6194, grad_norm: 1.0767
2023-05-16 14:09:01,773 - mmdet - INFO - Epoch [1][250/663] lr: 1.000e-03, eta: 4:28:00, time: 0.732, data_time: 0.023, memory: 8027, bbox_loss: 0.6513, cls_loss: 0.3226, inst_loss: 0.6042, loss: 1.5781, grad_norm: 1.2070
2023-05-16 14:09:39,756 - mmdet - INFO - Epoch [1][300/663] lr: 1.000e-03, eta: 4:28:21, time: 0.760, data_time: 0.015, memory: 8027, bbox_loss: 0.6300, cls_loss: 0.3100, inst_loss: 0.5524, loss: 1.4923, grad_norm: 1.2423
2023-05-16 14:10:18,196 - mmdet - INFO - Epoch [1][350/663] lr: 1.000e-03, eta: 4:28:53, time: 0.769, data_time: 0.015, memory: 8027, bbox_loss: 0.6168, cls_loss: 0.3033, inst_loss: 0.5165, loss: 1.4367, grad_norm: 1.2490
2023-05-16 14:11:00,874 - mmdet - INFO - Epoch [1][400/663] lr: 1.000e-03, eta: 4:32:56, time: 0.854, data_time: 0.056, memory: 8638, bbox_loss: 0.6106, cls_loss: 0.2944, inst_loss: 0.5128, loss: 1.4178, grad_norm: 1.3136
2023-05-16 14:11:40,923 - mmdet - INFO - Epoch [1][450/663] lr: 1.000e-03, eta: 4:33:49, time: 0.801, data_time: 0.017, memory: 8638, bbox_loss: 0.6041, cls_loss: 0.2857, inst_loss: 0.4876, loss: 1.3774, grad_norm: 1.3142
2023-05-16 14:12:23,333 - mmdet - INFO - Epoch [1][500/663] lr: 1.000e-03, eta: 4:36:05, time: 0.848, data_time: 0.021, memory: 8638, bbox_loss: 0.5784, cls_loss: 0.2747, inst_loss: 0.4711, loss: 1.3242, grad_norm: 1.2854
2023-05-16 14:13:04,558 - mmdet - INFO - Epoch [1][550/663] lr: 1.000e-03, eta: 4:37:03, time: 0.824, data_time: 0.014, memory: 8638, bbox_loss: 0.5704, cls_loss: 0.2632, inst_loss: 0.4488, loss: 1.2824, grad_norm: 1.2698
2023-05-16 14:13:47,635 - mmdet - INFO - Epoch [1][600/663] lr: 1.000e-03, eta: 4:38:49, time: 0.862, data_time: 0.025, memory: 8638, bbox_loss: 0.5713, cls_loss: 0.2618, inst_loss: 0.4385, loss: 1.2715, grad_norm: 1.3437
2023-05-16 14:14:31,222 - mmdet - INFO - Epoch [1][650/663] lr: 1.000e-03, eta: 4:40:30, time: 0.872, data_time: 0.055, memory: 8638, bbox_loss: 0.5565, cls_loss: 0.2557, inst_loss: 0.4479, loss: 1.2601, grad_norm: 1.3006
2023-05-16 14:14:41,589 - mmdet - INFO - Saving checkpoint at 1 epochs
[>>>>>> ] 9/68, 1.0 task/s, elapsed: 9s, ETA: 60sTraceback (most recent call last):
File "train.py", line 263, in <module>
main()
File "train.py", line 259, in main
meta=meta)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/apis/train.py", line 351, in train_model
meta=meta)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/apis/train.py", line 319, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 136, in run
epoch_runner(data_loaders[i], **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/epoch_based_runner.py", line 58, in train
self.call_hook('after_train_epoch')
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/base_runner.py", line 317, in call_hook
getattr(hook, fn_name)(self)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/hooks/evaluation.py", line 271, in after_train_epoch
self._do_evaluate(runner)
File "/usr/local/lib/python3.7/dist-packages/mmdet/core/evaluation/eval_hooks.py", line 56, in _do_evaluate
results = single_gpu_test(runner.model, self.dataloader, show=False)
File "/usr/local/lib/python3.7/dist-packages/mmdet/apis/test.py", line 29, in single_gpu_test
result = model(return_loss=False, rescale=True, **data)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/parallel/data_parallel.py", line 51, in forward
return super().forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/base.py", line 62, in forward
return self.forward_test(**kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/base.py", line 43, in forward_test
return self.simple_test(points[0], img_metas[0], img[0], **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/detectors/td3d_instance_segmentor.py", line 122, in simple_test
instances = self.head.forward_test(x, field, img_metas)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/decode_heads/td3d_instance_head.py", line 556, in forward_test
cls_preds, idxs, v2r, r2scene, rois, scores, labels = self._forward_second(x[0], src_idxs, bbox_list)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/decode_heads/td3d_instance_head.py", line 222, in _forward_second
preds = self.unet(feats).features
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/mmdet3d/models/backbones/mink_unet.py", line 225, in forward
out = self.conv0p1s1(x)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiConvolution.py", line 321, in forward
input._manager,
File "/usr/local/lib/python3.7/dist-packages/MinkowskiEngine/MinkowskiConvolution.py", line 84, in forward
coordinate_manager._manager,
MemoryError: std::bad_alloc: cudaErrorMemoryAllocation: out of memory