Git Product home page Git Product logo

torchpack's People

Contributors

hanrui-wang avatar ralphmao avatar raunakdoesdev avatar zhijian-liu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

torchpack's Issues

KeyError: 'MASTER_HOST'

hi , @zhijian-liu , when I want to use spvcnn in other task and try to run on multi-gpu with a Cloud Server Machine, KeyError: 'MASTER_HOST'
environment:

ubuntu 18.04
python 3.7.15
mpi4py 3.1.4
torchpack 0.3.1

the error is shown as below

--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            gz-tqr5m-35554-worker-0
  Device name:           mlx5_2
  Device vendor ID:      0x02c9
  Device vendor part ID: 4123

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   gz-tqr5m-35554-worker-0
  Local device: mlx5_2
--------------------------------------------------------------------------
Traceback (most recent call last):
  File "monoscene/scripts/train_monoscene.py", line 54, in main
    dist.init()
  File "/data/packages/anaconda3/envs/monoscene/lib/python3.7/site-packages/torchpack/distributed/context.py", line 23, in init
    master_host = 'tcp://' + os.environ['MASTER_HOST']
  File "/data/packages/anaconda3/envs/monoscene/lib/python3.7/os.py", line 681, in __getitem__
    raise KeyError(key) from None
KeyError: 'MASTER_HOST'

Assertion Error caused by tqdm.auto

I get following Assertion Error while training with torchpack. Someone says it is because of tqdm in tqdm.auto and should change the line from tqdm.auto import tqdmin progress.py into from tqdm import tqdm. Is this right?

Traceback (most recent call last):
  File "miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
    if w.is_alive():
  File "miniconda3/envs/torch/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
    assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process

Import torchpack.callbacks will raise Failed to import tensorflow warning

Hi Zhijian,

I found when I import torchpack.callbacks with the following code:

from torchpack.callbacks import Callback

def main():
    print('main called')
    
if __name__=='__main__':
    main()

A warning will be raise:

Failed to import tensorflow.
main called

I checked the callback classes, but didn't find where tensorflow is imported, so I am confused why such warning is raised. Any help would be appreciated!

Best regards,
Liancheng Fang

Multi Node training

Can you suggest how to implement multi gpu - multi node training with torchpack ?

I have set -H ip1:gpus,ip2:gpus and launched the train from both the nodes, however they don't seem to be getting a handle of one another. What am I missing here ?

`comm.py` should maybe consider backend-specific support of different devices

Depending on the backend, distributed communication may only be supported on either CPU or GPU, see table here.

Right now, in comm.py communication is always done on the GPU, see here e.g.:

# serialize
if context.rank() == src:
tensor = _serialize(obj).cuda()

I would suggest considering the backend-specific device support for both allgather() and broadcast() to ensure the functions are usable across multiple backends.

torch.distributed.broadcast_object_list and torch.distributed.all_gather_object might be a useful starting points for this.

Are there any inbuilt visualising support for the losses?

I am using the e3d repo to train SPVCNN using torckpack.
Are there any inbuilt visualising support for the losses?
If not please point to an example of using any visualising frameworks (eg. tensorboard) along with torchpack.

Thank you.

Is it possible to use without MPI?

I would like to avoid any MPI setup.
I'm using only 1 machine, so I guess that I won't get any performance boost by using MPI, right?
Is there any configuration parameter or similar that we can use to avoid needing MPI?
Thanks!

save and load

Hi, I'm trying to use torchpack to simplify my model, but I got confused while using the save and load part.
Now I can save the .pt file under the checkpoints dir using 'Saver()', but I don't know how to load the .pt file. Could you please give me some advice? Thank you!

No output

When I use the torchpack command and run like:
torchpack dist-run -np 1 python train.py
I have been running for several days, but there is no output. Can you please help me understand why?
I set batch_size=1, it still no output.

AMP Support

Hi, I wonder if we could add in AMP support now that torchsparse supports mixed precision. I think it would just require addition of GradScaler and an amp.autocast block.

Package not intialized

HI Zhijian, found out you might forget to import all the packages in init.py.
image

I imported them manually in the init.py file.
image

Or is there any reason why did you leave the packages unimported?
Anyway, thanks for the good work!

Address already used

image
image
I print the tcp. From the results, setting the port does not seem to have any effect

ModuleNotFoundError: No module named 'torchpack.utils.tqdm'

While attempting to run BevFusion (available at bevFusion GitHub(https://github.com/mit-han-lab/bevfusion)) for visualization purposes, I encountered a ModuleNotFoundError. Despite trying different versions of torchpack, such as 0.3.1 and 0.3.0, the issue remained unresolved. However, I believe the problem is not related to the BevFusion code itself, since the training and evaluation components of the code are working correctly. The complete error message is as follows:

Traceback (most recent call last):
File "tools/visualize.py", line 13, in
from torchpack.utils.tqdm import tqdm
ModuleNotFoundError: No module named 'torchpack.utils.tqdm'

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[32545,1],0]
Exit code: 1

Any suggestion is appreciated.

Distributed train problem

when I run the commad 'torchpack dist-run -np 3 python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml' , I got a error as follows. Could you please tell me how can I resolve this problem? Thanks very much!

ssh: Could not resolve hostname localhost:3: Name or service not known

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

  • not finding the required libraries and/or binaries on
    one or more nodes. Please check your PATH and LD_LIBRARY_PATH
    settings, or configure OMPI with --enable-orterun-prefix-by-default

  • lack of authority to execute on one or more specified nodes.
    Please verify your allocation and authorities.

  • the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
    Please check with your sys admin to determine the correct location to use.

  • compilation of the orted with dynamic libraries when static are required
    (e.g., on Cray). Please check your configure cmd line and consider using
    one of the contrib/platform definitions for your system type.

  • an inability to create a connection back to mpirun due to a
    lack of common network interfaces and/or no route found between
    them. Please check network connectivity (including firewalls
    and network routing requirements).


Warning messages while trying to use ZeroRedundancyOptimizer

I am training the SPVCNN model built using torchsparse and trained using torchpack wrapper.
While trying to use ZeroRedundancyOptimizer as follows

optimizer = ZeroRedundancyOptimizer(params=model.parameters(), optim=torch.optim.SGD,
                                                lr=configs.optimizer.lr,
                                                momentum=configs.optimizer.momentum,
                                                weight_decay=configs.optimizer.weight_decay,
                                                nesterov=configs.optimizer.nesterov)

and running training using the command
torchpack dist-run -np 1 python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml --run-dir runs/test

I see the following warnings right before the checkpoints are saved
WARNING:root:Optimizer state has not been consolidated. Returning the local state
WARNING:root:Please call consolidate_state_dict() beforehand if you meant to save the global state

  • GCC: 8.4.0
  • NVCC: 10.2.89
  • PyTorch: 1.8.1+cu102
  • PyTorch CUDA: 10.2
  • TorchSparse: 1.2.0
  • Torchpack: 0.3.0

support configs which include lists in `config.update`

We should probably eventually support lists as well here since this might lead to some unexpected behaviors if we load yaml files that include lists of dicts.

def update(self, other: Dict) -> None:
for key, value in other.items():
if isinstance(value, dict):
if key not in self or not isinstance(self[key], Config):
self[key] = Config()
self[key].update(value)
else:
self[key] = value

Installing Torchpack removing older version of Torch

Collecting torch>=1.5.0
Using cached torch-1.11.0-cp38-cp38-manylinux1_x86_64.whl (750.6 MB)


Attempting uninstall: torch
Found existing installation: torch 1.10.2
Uninstalling torch-1.10.2:
Successfully uninstalled torch-1.10.2
Successfully installed ** torch-1.11.0 **

command error

Hi! When I run torchpack dist-run -np 2 python train.py configs/s3dis/from_scratch.yaml --run-dir ./s3dis_out,there is a error:
<[mpiexec@cappuccino-Super-Server] match_arg (utils/args/args.c:163): unrecognized argument allow-run-as-root
[mpiexec@cappuccino-Super-Server] HYDU_parse_array (utils/args/args.c:178): argument matching returned error
[mpiexec@cappuccino-Super-Server] parse_args (ui/mpich/utils.c:1642): error parsing input array
[mpiexec@cappuccino-Super-Server] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
[mpiexec@cappuccino-Super-Server] main (ui/mpich/mpiexec.c:148): error parsing parameters>

CUDA error from torchpack while saving models

While running training using torckpack
torchpack dist-run -np 2 python train.py configs/spvcnn/cr0p64.yaml --run-dir runs/static_model

Environment details

Package Version
Torchpack 0.3.0
Pytorch 1.7.0
Torchsparse 1.1.0
Cuda 10.0

Getting an error while saving the model using the Saver callback. The error is thrown from the torchpack/distributed/comm.py file as mentioned in the traceback.

Traceback
Traceback (most recent call last):
  File "train.py", line 141, in <module>
    main()
  File "train.py", line 136, in main
    Saver(max_to_keep=10),
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/train/trainer.py", line 39, in train_with_defaults
    callbacks=callbacks)
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/train/trainer.py", line 88, in train
    self.trigger_epoch()
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/train/trainer.py", line 156, in trigger_epoch
    self.callbacks.trigger_epoch()
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 90, in trigger_epoch
    self._trigger_epoch()
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 308, in _trigger_epoch
    callback.trigger_epoch()
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 90, in trigger_epoch
    self._trigger_epoch()
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/inference.py", line 29, in _trigger_epoch
    self._trigger()
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/inference.py", line 41, in _trigger
    self.callbacks.after_epoch()
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 80, in after_epoch
    self._after_epoch()
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 304, in _after_epoch
    callback.after_epoch()
  File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 80, in after_epoch
    self._after_epoch()
  File "/home/SandeepMenon/e3d_pcd/spvnas/core/callbacks.py", line 54, in _after_epoch
    self.total_seen[i] = dist.allreduce(self.total_seen[i], reduction='sum')
  File "/home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/distributed/comm.py", line 13, in allreduce
    data = allgather(data)
  File "/home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/distributed/comm.py", line 32, in allgather
    sizes = [int(size.item()) for size in sizes]
  File "/home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/distributed/comm.py", line 32, in <listcomp>
    sizes = [int(size.item()) for size in sizes]
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f40abc888b2 in /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f40b5f7d952 in /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f40abc73b7d in /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fec3a (0x7f40b6bb4c3a in /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fece6 (0x7f40b6bb4ce6 in /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: python() [0x54edf6]
frame #6: python() [0x588fd8]
frame #7: python() [0x5add78]
frame #8: python() [0x5add8e]
frame #9: python() [0x5add8e]
frame #10: python() [0x5add8e]
frame #11: python() [0x5add8e]
frame #12: python() [0x5add8e]
frame #13: python() [0x5add8e]
frame #14: python() [0x5add8e]
frame #15: python() [0x5add8e]
frame #16: python() [0x5add8e]
frame #17: python() [0x5add8e]
frame #18: python() [0x5add8e]
frame #19: python() [0x5add8e]
frame #20: python() [0x5add8e]
frame #21: python() [0x5add8e]
frame #22: python() [0x5add8e]
frame #23: python() [0x56b606]
<omitting python frames>
frame #29: __libc_start_main + 0xe7 (0x7f40bc6c1bf7 in /lib/x86_64-linux-gnu/libc.so.6)

[deepen-Z11PA-U12-Series:24283] *** Process received signal ***
[deepen-Z11PA-U12-Series:24283] Signal: Aborted (6)
[deepen-Z11PA-U12-Series:24283] Signal code:  (-6)
[deepen-Z11PA-U12-Series:24283] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f40bc6df040]
[deepen-Z11PA-U12-Series:24283] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f40bc6defb7]
[deepen-Z11PA-U12-Series:24283] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f40bc6e0921]
[deepen-Z11PA-U12-Series:24283] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7f40b7b42957]
[deepen-Z11PA-U12-Series:24283] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ae6)[0x7f40b7b48ae6]
[deepen-Z11PA-U12-Series:24283] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x91b49)[0x7f40b7b47b49]
[deepen-Z11PA-U12-Series:24283] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a8)[0x7f40b7b484b8]
[deepen-Z11PA-U12-Series:24283] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10573)[0x7f40b78ae573]
[deepen-Z11PA-U12-Series:24283] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x125)[0x7f40b78aedf5]
[deepen-Z11PA-U12-Series:24283] [ 9] /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libc10_cuda.so(_ZN3c104cuda20CUDACachingAllocator10raw_deleteEPv+0x9e9)[0x7f40b5f7d869]
[deepen-Z11PA-U12-Series:24283] [10] /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImpl17release_resourcesEv+0x4d)[0x7f40abc73b7d]
[deepen-Z11PA-U12-Series:24283] [11] /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x5fec3a)[0x7f40b6bb4c3a]
[deepen-Z11PA-U12-Series:24283] [12] /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x5fece6)[0x7f40b6bb4ce6]
[deepen-Z11PA-U12-Series:24283] [13] python[0x54edf6]
[deepen-Z11PA-U12-Series:24283] [14] python[0x588fd8]
[deepen-Z11PA-U12-Series:24283] [15] python[0x5add78]
[deepen-Z11PA-U12-Series:24283] [16] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [17] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [18] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [19] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [20] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [21] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [22] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [23] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [24] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [25] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [26] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [27] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [28] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [29] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[63105,1],1]
  Exit code:    134

torchpack stalls on startup

Hi! Strange problem with torchpack as it stalls on startup. I am evaluating SPVCNN models and when I launch the script it sometimes starts, but sometimes such stalls and does nothing.

torchpack dist-run -np 1 python evaluate.py configs.yaml --name run-1fb33cb7

There are no errors, python evaluate.py process is started but there is no CPU or GPU load. I will then hit ctrl+C and start the same command again and it will run eventually (at least after a few tries). But it is very annoying. Some kind of deadlock? Do you have any suggestions on how to debug this and find the cause? Are there any debug parameters for torchpack?

config.load() can lead to infinite loop in recursive mode

I observed that the load() function from torchpack/utils/config.py can be stuck in an infinite loop when recursive mode is enabled. In particular this happens when the path to the config is given as an absolute path instead of a relative path. When the top-level directory "/" is reached, os.path.dirname("/") will again return "/" and the while loop never ends.
See example for the development of the variable fpath in load() below:

/code/configs/config.yaml
/code/configs
/code
/
/
... (infinite)
/

The program will finally crashed with a memory error caused by a memory leak of the list fpaths to which infinite paths are appended.

torchpack without command line

I have been using torchpack for my experiments. Now I need to integrate my model training code into another codebase built on Bazel.
I am not able to use the torchpack command line tool and run like

torchpack dist-run -np 1 python train.py

I have installed torchpack using pip. Is there a way to replicate the same behavior through python executable
(similar to python3 -m torch.distributed.launch train.py)

how to set port

When I run the bevfusion, the error notes me that address already used.

Hyperparameter Callback

Design a callback to enable updating hyperparameters/the config object based off of a score over time.

No module named torchpack.launch.assets

I used pip3 to install torchpack. I have both python2 and python3 on my system and I prefer that usr/bin/python points to python2. How do I tell torchpack to look for python3?

FileNotFoundError: [Errno 2] No such file or directory: '/bin/sh'

Hello, when I run bevfusion for MIT I get torchpack\launch\launchers\drunner.py", line 104, in main
os.execve('/bin/sh', ['/bin/sh', '-c', command], environ)
FileNotFoundError: [Errno 2] No such file or directory: '/bin/sh'。
Windows does not have this file, how to change os.execve('/bin/sh', ['/bin/sh', '-c', command], environ) under windows。

torchpack usage on multiple nodes on slurm cluster

Thanks for providing this package. I am successfully able to use the torchpack dist-run -np ${_ngpu} command on a slurm cluster when using only 1 node.
Could you please explain how to use this with multiple nodes. I assume it involves setting the --hosts parameter but I'm not able to figure out how to identify the allocated nodes from the slurm script.

YAML Formatting

YAML lists should be formatted compactly as [1, 2, 3, 4] rather than

- 1
- 2
- 3
- 4

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.