Git Product home page Git Product logo

Comments (13)

tianweiy avatar tianweiy commented on September 13, 2024

Can you tell me your torch, cuda, spconv version? Also what other changes(if any) did you make to the code? Unfortunately, I can't reproduce this error. (I guess it is a few hours into the training?)

from centerpoint.

tianweiy avatar tianweiy commented on September 13, 2024

cuda execution failed with error 2 Uhm, I am not 100 percent sure, but the cuda error 2 seems to mean that you are out of memory.

from centerpoint.

DeepakVellampalli avatar DeepakVellampalli commented on September 13, 2024

Sorry for late reply.
I was using torch version 1.1
spconv 1.0 vwrsion.
I replicated the same setup as you mentioned in the installation instructions.
I tried your pointpillars succesfully without any hurdles.
But the config file uses spconv module and spconv module is crashing.
Moreover iam traing with sweeps=1. Hence i commented lines 87-97 in https://github.com/tianweiy/CenterPoint/blob/master/det3d/datasets/pipelines/loading.py

Apart from this change, there is no change in code.
Kindly help

from centerpoint.

tianweiy avatar tianweiy commented on September 13, 2024

cuda execution failed with error 2 Uhm, I am not 100 percent sure, but the cuda error 2 seems to mean that you are out of memory.

You don't need to comment loading files. Just change the config nsweep field to 1. Also I suspect it is gpu out of memory issue from the error log, can you check this ?

from centerpoint.

AbdeslemSmahi avatar AbdeslemSmahi commented on September 13, 2024

cuda execution failed with error 2 Uhm, I am not 100 percent sure, but the cuda error 2 seems to mean that you are out of memory.

You don't need to comment loading files. Just change the config nsweep field to 1. Also I suspect it is gpu out of memory issue from the error log, can you check this ?

how to reduce memory usage in test phase?

from centerpoint.

tianweiy avatar tianweiy commented on September 13, 2024

@AbdeslemSmahi the simplest way is to add a --speed_test flag during testing. This will by default use batch size 1. Not sure how to go beyond this

from centerpoint.

AbdeslemSmahi avatar AbdeslemSmahi commented on September 13, 2024

@AbdeslemSmahi the simplest way is to add a --speed_test flag during testing. This will by default use batch size 1. Not sure how to go beyond this

Even that didn't work.

from centerpoint.

tianweiy avatar tianweiy commented on September 13, 2024

probably need to get a larger gpu then... or try pointpillars model which take less memory.

from centerpoint.

tianweiy avatar tianweiy commented on September 13, 2024

close for now. Feel free to reopen if you still have questions.

from centerpoint.

ZiyuXiong avatar ZiyuXiong commented on September 13, 2024

@AbdeslemSmahi @tianweiy
Hi,I also encountered the same question with config nusc_centerpoint_voxelnet_dcn_0075voxel_flip_circle_nms.py, but it works fine with config nusc_centerpoint_pp_dcn_02voxel_circle_nms.py, have you solved this problem?
All the training process is on 2 Titan V(4 Titan V also tested, failed either), and I noticed that the first GPU seem to use more GPU mem than the second GPU.
is there any chance that the distributed launch assigns the dataloading only to the first GPU?

from centerpoint.

tianweiy avatar tianweiy commented on September 13, 2024

Hi, 0075 voxelnet will definitely take much more memory than pp. Can you train the 0.1 voxel size model ? You can also decrease the batch size a bit, I don't think this matter much for performance.

For the distributed data parallel stuff, does your model work with a single gpu ?

Also it seems spconv(voxelnet) is quite weird for Titan v. Basically, I try to train voxelnet on Titan xp, Titan rtx, 2070/2080, v100, Titan v. All other gpu works but for Titanv I can't use even batch size 2 for a kitti model. I feel this is a bug with spconv. Do let me know if your titanv works well with spconv

from centerpoint.

ZiyuXiong avatar ZiyuXiong commented on September 13, 2024

@tianweiy Thank you for your reply. I followed your advice and the results are:

  1. voxel_size=0.1, batch_size=4, Titan V, nproc_per_node=2, failed (cuda execution failed with error 2)
  2. voxel_size=0.1, batch_size=4, Titan V, single Titan V, failed (cuda execution failed with error 2)
  3. voxel_size=0.075(nusc_centerpoint_voxelnet_dcn_0075voxel_flip_circle_nms.py), batch_size=4, Titan xp, nproc_per_node=2, failed (GPU out of memory)
  4. voxel_size=0.075(nusc_centerpoint_voxelnet_dcn_0075voxel_flip_circle_nms.py), batch_size=4, Titan xp, nproc_per_node=2, succeed
    image

it seems that spconv cannot work on Titan V(when voxelnet involved), and it indeed takes large amount of memory to run the config with small voxel size. But now I reduce the bacth size to 2 and it worked, so nothing wired happen for the moment.
Thank you again for your timely and detailed reply!

from centerpoint.

tianweiy avatar tianweiy commented on September 13, 2024

Sure, good luck with your project.

from centerpoint.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.