Hello! I use the SD dataset to train the line model, I prepare the dataset preparation command is this:
./tools/dist_train.sh ./projects/configs/vma_res152_e80_line.py 1
the error log is this:
KeyError: Caught KeyError in DataLoader worker process 0
I tried to regenerate the SD line data set according to the method of docs/prepare_dataset.md, but the problem still occurred. How can I troubleshoot the cause of this problem? Thank you!
all error log is this:
2023-09-11 16:55:43,906 - mmdet - INFO - Saving checkpoint at 5 epochs
[ ] 0/7, elapsed: 0s, ETA:Traceback (most recent call last):
File "./tools/train.py", line 261, in
main()
File "./tools/train.py", line 250, in main
custom_train_model(
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/bevformer/apis/train.py", line 27, in custom_train_model
custom_train_detector(
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/bevformer/apis/mmdet_train.py", line 212, in custom_train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/epoch_based_runner.py", line 54, in train
self.call_hook('after_train_epoch')
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/base_runner.py", line 307, in call_hook
getattr(hook, fn_name)(self)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/mmcv/runner/hooks/evaluation.py", line 267, in after_train_epoch
self._do_evaluate(runner)
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/core/evaluation/eval_hooks.py", line 78, in _do_evaluate
results = custom_multi_gpu_test(
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/bevformer/apis/test.py", line 71, in custom_multi_gpu_test
for i, data in enumerate(data_loader):
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/datasets/builder.py", line 173, in iCurb_collate
data['seq'] = [x[0] for x in batch]
File "/home/pc01/code/VMA/projects/mmdet3d_plugin/datasets/builder.py", line 173, in
data['seq'] = [x[0] for x in batch]
KeyError: 0
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 82570) of binary: /home/pc01/anaconda3/envs/vma/bin/python
Traceback (most recent call last):
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/pc01/anaconda3/envs/vma/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================
Root Cause:
[0]:
time: 2023-09-11_16:55:52
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 82570)
error_file: <N/A>
msg: "Process failed with exitcode 1"
Other Failures:
<NO_OTHER_FAILURES>