Git Product home page Git Product logo

Comments (13)

jihanyang avatar jihanyang commented on August 12, 2024 1

Hello, I guess the epoch 3 is due to this line

epoch_id = num_list[-1] if num_list.__len__() > 0 else 'no_number'
with the name ST3D. Meanwhile, I download the link and test it, but got:

[2021-09-07 14:06:10,520  detector3d_template.py 325  INFO]  ==> Loading parameters from checkpoint ../output/model_zoo/ckpt.pth to GPU
[2021-09-07 14:06:10,713  detector3d_template.py 331  INFO]  ==> Checkpoint trained from version: pcdet+0.2.0+ee0831b+pyab7b158
[2021-09-07 14:06:10,854  detector3d_template.py 350  INFO]  ==> Done (loaded 189/189)
[2021-09-07 14:06:10,864  eval_utils.py 40  INFO]  *************** EPOCH no_number EVALUATION *****************

You can try to rename the ckpt.pth to checkpoint_epoch_100.pth, I guess it will becom epoch 100 evaluation.

The st3d training takes 30 epochs. I will add it to model zoo soon.

from st3d.

jihanyang avatar jihanyang commented on August 12, 2024

Hello, I train it for around 50 epochs. I will revise the config to add this term. Thank you.

from st3d.

darrenjkt avatar darrenjkt commented on August 12, 2024

Hi thanks for your response.

Just to clarify 50 epochs is for the pre-training of the nuscenes-kitti secondiou_old_anchor.yaml? When I evaluated the model_zoo checkpoint.pth file it says "EPOCH 3 EVALUATION". Also out of curiosity how long does one epoch take for you and what hardware are you using for training?

According to the OpenPCDet cfg file for second/pointpillars they only train it for 20 epochs (though their model zoo checkpoint.pth in evaluation also says something vastly different).

from st3d.

jihanyang avatar jihanyang commented on August 12, 2024

Hello, you concern is make sense. we decide to use 50 epochs for nus->kitti setting (with over 20,000 frames) since we use 30 epochs for waymo->kitti setting (with over 70,000 frames). Meanwhile, may I know which checkpoint did you used to evaluate? Can you reproduce the result with the checkpoint? The epoch number of two pretrained models for secondiou of nus->kitti should be 33 and 27 separately. Maybe I upload an error checkpoint.

from st3d.

darrenjkt avatar darrenjkt commented on August 12, 2024

If 50 epochs is for nuscenes-kitti pre-trained, how many epochs is for the nusc-kitti st3d training afterwards? Could you also provide that checkpoint in the model zoo?

I downloaded the model ckpt.pth from your model zoo in the nuScenes -> KITTI TASK, first row in the table for SECOND-IoU ROS. I've copied the download link you provided here.

When evaluating the model, this is what I see in the logs. It says EPOCH 3 EVALUATION and hence my confusion.
image

Regarding reproducing from the checkpoint, that is no issue. I just can't seem to reproduce it when I train it myself to 3 epochs. However, if you're saying that you train the full 28130 nuscenes samples to 50 epochs for this same ckpt.pth then that makes a bit more sense.

from st3d.

darrenjkt avatar darrenjkt commented on August 12, 2024

Thanks for the clarification. Yes please, it'll help with reproducing the results if the epochs are added for all the steps. Appreciate your work.

from st3d.

weijiawang96 avatar weijiawang96 commented on August 12, 2024

Hi @darrenjkt and @jihanyang, I tried to run the training command like this:
python train.py --cfg_file cfgs/da-nuscenes-kitti_models/secondiou/secondiou_old_anchor_ros.yaml --batch_size 1 --epochs 50 --extra_tag st3d_infos
I'm using batch size = 1, since any number above 1 gives me CUDA out of memory error. Just wish to double check, will setting the batch size to 1 affect the training results? Thx.

from st3d.

jihanyang avatar jihanyang commented on August 12, 2024

May I know how much the GPU memory that it occupies when batch_size=1?

from st3d.

weijiawang96 avatar weijiawang96 commented on August 12, 2024

Dear @jihanyang,

Thank you so much for your reply. The command I ran was PVRCNN_ST3D: python train.py --cfg_file ./cfgs/da-nuscenes-kitti_models/pvrcnn_st3d/pvrcnn_st3d.yaml --batch_size 1 --epochs 50 --extra_tag st3d_infos
And here's my nvidia-smi when the above command runs:

Screenshot from 2021-09-29 14-01-00

So the program uses around 63% of my GPU memory. The training loop can run, but I just don't know if setting batch_size=1 will affect the training results.

I'm using a single Nvidia GeForce RTX 3080 GPU on my computer.

By the way, here's the CUDA out of memory error when I set bathc_size=4:
Command:
python train.py --cfg_file ./cfgs/da-nuscenes-kitti_models/pvrcnn_st3d/pvrcnn_st3d.yaml --batch_size 4 --epochs 50 --extra_tag st3d_infos

Error (reporting required CUDA memory):
RuntimeError: CUDA out of memory. Tried to allocate 864.00 MiB (GPU 0; 9.78 GiB total capacity; 7.84 GiB already allocated; 196.12 MiB free; 7.99 GiB reserved in total by PyTorch)

from st3d.

jihanyang avatar jihanyang commented on August 12, 2024

Hello, according to the cfg :

We use batch size = 2 in defaut, and it seems that GPU memory usage is normal for pvrcnn. All experiments are finished with 1080ti with 11GB GPU memory.

from st3d.

weijiawang96 avatar weijiawang96 commented on August 12, 2024

Thank you for sharing, seems that I can only train using batch_size = 1 with my current GPU. But strangely, I come across another RecursionError when I ran PVRCNN's command. The command I ran is this:
python train.py --cfg_file ./cfgs/da-nuscenes-kitti_models/pvrcnn_st3d/pvrcnn_st3d.yaml --batch_size 1 --epochs 50 --extra_tag st3d_infos

And after 2283 out of 3712 iterations in the train loop, the program stops with this RecursionError:
RecursionError: maximum recursion depth exceeded while calling a Python object

it seems these 2 lines of code are called alternatively and indefinitely:
File "/home/wangweijia/Desktop/UDA/ST3D/tools/../pcdet/datasets/dataset.py", line 253, in prepare_data return self.__getitem__(new_index) File "/home/wangweijia/Desktop/UDA/ST3D/tools/../pcdet/datasets/kitti/kitti_dataset.py", line 421, in __getitem__ data_dict = self.prepare_data(data_dict=input_dict)

Here's the error:
[2021-09-29 22:51:14,537 train.py 174 INFO] Start training cfgs/da-nuscenes-kitti_models/pvrcnn_st3d/pvrcnn_st3d(st3d_infos)
[2021-09-29 22:51:14,935 train_st_utils.py 103 INFO] ==> Loading pseudo labels from /home/wangweijia/Desktop/UDA/ST3D/output/cfgs/da-nuscenes-kitti_models/pvrcnn_st3d/pvrcnn_st3d/st3d_infos/ps_label/ps_label_e0.pkl
epochs: 0%| | 0/50 [00:00<?, ?it/s]
/home/wangweijia/Desktop/UDA/ST3D/tools/../pcdet/models/roi_heads/pvrcnn_head.py:135: UserWarning: This overload of nonzero is deprecated: | 0/3712 [00:00<?, ?it/s]
nonzero()
Consider using one of the following signatures instead:
nonzero(*, bool as_tuple) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:882.)
dense_idx = faked_features.nonzero() # (N, 3) [x_idx, y_idx, z_idx]
generate_ps_e0: 100%|████████████████████████████████████████████████████████████████████████████████████████| 3712/3712 [16:46<00:00, 3.69it/s, pos_ps_box=20.000(20.001), ign_ps_box=0.000(0.001)]
epochs: 0%| | 0/50 [1:25:13<?, ?it/s, st_loss=2.664(4.211), pos_ps_box=19.8, ign_ps_box=0]
Traceback (most recent call last):█████████████████████████████████████████▎ | 2283/3712 [1:07:44<25:31, 1.07s/it, total_it=2283, pos_ps_box=19, ign_ps_box=0]
File "train.py", line 205, in
main()
File "train.py", line 197, in main
ema_model=None
File "/home/wangweijia/Desktop/UDA/ST3D/tools/train_utils/train_st_utils.py", line 157, in train_model_st
dataloader_iter=dataloader_iter, ema_model=ema_model
File "/home/wangweijia/Desktop/UDA/ST3D/tools/train_utils/train_st_utils.py", line 42, in train_one_epoch_st
target_batch = next(dataloader_iter)
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in next
data = self._next_data()
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
return self._process_data(data)
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
data.reraise()
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/torch/_utils.py", line 428, in reraise
raise self.exc_type(msg)
RecursionError: Caught RecursionError in DataLoader worker process 3.
Original Traceback (most recent call last):
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
data = fetcher.fetch(index)
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/torch/utils/data/utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/wangweijia/Desktop/UDA/ST3D/tools/../pcdet/datasets/kitti/kitti_dataset.py", line 421, in getitem
data_dict = self.prepare_data(data_dict=input_dict)
File "/home/wangweijia/Desktop/UDA/ST3D/tools/../pcdet/datasets/dataset.py", line 253, in prepare_data
return self.getitem(new_index)
(the above 2 lines repeat for LOTS of times)
File "/home/wangweijia/Desktop/UDA/ST3D/tools/../pcdet/datasets/kitti/kitti_dataset.py", line 421, in getitem
data_dict = self.prepare_data(data_dict=input_dict)
File "/home/wangweijia/Desktop/UDA/ST3D/tools/../pcdet/datasets/dataset.py", line 248, in prepare_data
data_dict=data_dict
File "/home/wangweijia/Desktop/UDA/ST3D/tools/../pcdet/datasets/processor/data_processor.py", line 134, in forward
data_dict = cur_processor(data_dict=data_dict)
File "/home/wangweijia/Desktop/UDA/ST3D/tools/../pcdet/datasets/processor/data_processor.py", line 74, in transform_points_to_voxels
voxel_output = voxel_generator.generate(points)
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/spconv/utils/init.py", line 258, in generate
self.height_high_threshold)
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/spconv/utils/init.py", line 75, in points_to_voxel
voxelmap_shape = tuple(np.round(voxelmap_shape).astype(np.int32).tolist())
File "<array_function internals>", line 6, in round

File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 3637, in round

return around(a, decimals=decimals, out=out)
File "<array_function internals>", line 6, in around
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 3262, in around
return _wrapfunc(a, 'round', decimals=decimals, out=out)
File "/home/wangweijia/anaconda3/envs/st3d/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 58, in _wrapfunc
return bound(*args, **kwds)
RecursionError: maximum recursion depth exceeded while calling a Python object

I'll continue looking into this, but please suggest if you've come across similar issues. By the way, I'm using ST3D's version with OpenPCDet 0.3.

from st3d.

jihanyang avatar jihanyang commented on August 12, 2024

It seems that some issues dicussed this situation. You can refer to those issues. I have find some errors on ST3D with openpcdet v0.3.0, so I suggest you to re-pull the repo and install it with openpcdet v0.2.0 for reproduction.

from st3d.

weijiawang96 avatar weijiawang96 commented on August 12, 2024

from st3d.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.