zhijian-liu / torchpack Goto Github PK
View Code? Open in Web Editor NEWA neural network training interface based on PyTorch, with a focus on flexibility
Home Page: https://pypi.org/project/torchpack/
License: MIT License
A neural network training interface based on PyTorch, with a focus on flexibility
Home Page: https://pypi.org/project/torchpack/
License: MIT License
hi , @zhijian-liu , when I want to use spvcnn in other task and try to run on multi-gpu with a Cloud Server Machine, KeyError: 'MASTER_HOST'
environment:
ubuntu 18.04
python 3.7.15
mpi4py 3.1.4
torchpack 0.3.1
the error is shown as below
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: gz-tqr5m-35554-worker-0
Device name: mlx5_2
Device vendor ID: 0x02c9
Device vendor part ID: 4123
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: gz-tqr5m-35554-worker-0
Local device: mlx5_2
--------------------------------------------------------------------------
Traceback (most recent call last):
File "monoscene/scripts/train_monoscene.py", line 54, in main
dist.init()
File "/data/packages/anaconda3/envs/monoscene/lib/python3.7/site-packages/torchpack/distributed/context.py", line 23, in init
master_host = 'tcp://' + os.environ['MASTER_HOST']
File "/data/packages/anaconda3/envs/monoscene/lib/python3.7/os.py", line 681, in __getitem__
raise KeyError(key) from None
KeyError: 'MASTER_HOST'
I get following Assertion Error while training with torchpack. Someone says it is because of tqdm
in tqdm.auto
and should change the line from tqdm.auto import tqdm
in progress.py
into from tqdm import tqdm
. Is this right?
Traceback (most recent call last):
File "miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
self._shutdown_workers()
File "miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1320, in _shutdown_workers
if w.is_alive():
File "miniconda3/envs/torch/lib/python3.9/multiprocessing/process.py", line 160, in is_alive
assert self._parent_pid == os.getpid(), 'can only test a child process'
AssertionError: can only test a child process
run bevfusion code,error no model named tqdm,why?
Hi Zhijian,
I found when I import torchpack.callbacks with the following code:
from torchpack.callbacks import Callback
def main():
print('main called')
if __name__=='__main__':
main()
A warning will be raise:
Failed to import tensorflow.
main called
I checked the callback classes, but didn't find where tensorflow is imported, so I am confused why such warning is raised. Any help would be appreciated!
Best regards,
Liancheng Fang
Can you suggest how to implement multi gpu - multi node training with torchpack ?
I have set -H ip1:gpus,ip2:gpus
and launched the train from both the nodes, however they don't seem to be getting a handle of one another. What am I missing here ?
Depending on the backend, distributed communication may only be supported on either CPU or GPU, see table here.
Right now, in comm.py
communication is always done on the GPU, see here e.g.:
torchpack/torchpack/distributed/comm.py
Lines 32 to 34 in d3fda52
I would suggest considering the backend-specific device support for both allgather()
and broadcast()
to ensure the functions are usable across multiple backends.
torch.distributed.broadcast_object_list
and torch.distributed.all_gather_object
might be a useful starting points for this.
I am using the e3d repo to train SPVCNN using torckpack.
Are there any inbuilt visualising support for the losses?
If not please point to an example of using any visualising frameworks (eg. tensorboard) along with torchpack.
Thank you.
I would like to avoid any MPI setup.
I'm using only 1 machine, so I guess that I won't get any performance boost by using MPI, right?
Is there any configuration parameter or similar that we can use to avoid needing MPI?
Thanks!
Hi, I'm trying to use torchpack to simplify my model, but I got confused while using the save and load part.
Now I can save the .pt file under the checkpoints dir using 'Saver()', but I don't know how to load the .pt file. Could you please give me some advice? Thank you!
When I use the torchpack command and run like:
torchpack dist-run -np 1 python train.py
I have been running for several days, but there is no output. Can you please help me understand why?
I set batch_size=1, it still no output.
Hi, I wonder if we could add in AMP support now that torchsparse supports mixed precision. I think it would just require addition of GradScaler
and an amp.autocast
block.
What license is this released under?
Just FYI source code without a license can't be used by anyone, is that what you intend?
While attempting to run BevFusion (available at bevFusion GitHub(https://github.com/mit-han-lab/bevfusion)) for visualization purposes, I encountered a ModuleNotFoundError. Despite trying different versions of torchpack, such as 0.3.1 and 0.3.0, the issue remained unresolved. However, I believe the problem is not related to the BevFusion code itself, since the training and evaluation components of the code are working correctly. The complete error message is as follows:
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[32545,1],0]
Exit code: 1
Any suggestion is appreciated.
when I run the commad 'torchpack dist-run -np 3 python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml' , I got a error as follows. Could you please tell me how can I resolve this problem? Thanks very much!
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
I am training the SPVCNN model built using torchsparse and trained using torchpack wrapper.
While trying to use ZeroRedundancyOptimizer as follows
optimizer = ZeroRedundancyOptimizer(params=model.parameters(), optim=torch.optim.SGD,
lr=configs.optimizer.lr,
momentum=configs.optimizer.momentum,
weight_decay=configs.optimizer.weight_decay,
nesterov=configs.optimizer.nesterov)
and running training using the command
torchpack dist-run -np 1 python train.py configs/semantic_kitti/spvcnn/cr0p5.yaml --run-dir runs/test
I see the following warnings right before the checkpoints are saved
WARNING:root:Optimizer state has not been consolidated. Returning the local state
WARNING:root:Please call consolidate_state_dict()
beforehand if you meant to save the global state
We should probably eventually support lists as well here since this might lead to some unexpected behaviors if we load yaml files that include lists of dicts.
torchpack/torchpack/utils/config.py
Lines 45 to 52 in d3fda52
Collecting torch>=1.5.0
Using cached torch-1.11.0-cp38-cp38-manylinux1_x86_64.whl (750.6 MB)
Attempting uninstall: torch
Found existing installation: torch 1.10.2
Uninstalling torch-1.10.2:
Successfully uninstalled torch-1.10.2
Successfully installed ** torch-1.11.0 **
when my program has this code : from torchpack.callbacks import Callbacks, SaverRestore
, it occur Segmentation fault (core dumped)
, who knew why?? thanks.
Hi! When I run torchpack dist-run -np 2 python train.py configs/s3dis/from_scratch.yaml --run-dir ./s3dis_out
,there is a error:
<[mpiexec@cappuccino-Super-Server] match_arg (utils/args/args.c:163): unrecognized argument allow-run-as-root
[mpiexec@cappuccino-Super-Server] HYDU_parse_array (utils/args/args.c:178): argument matching returned error
[mpiexec@cappuccino-Super-Server] parse_args (ui/mpich/utils.c:1642): error parsing input array
[mpiexec@cappuccino-Super-Server] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
[mpiexec@cappuccino-Super-Server] main (ui/mpich/mpiexec.c:148): error parsing parameters>
While running training using torckpack
torchpack dist-run -np 2 python train.py configs/spvcnn/cr0p64.yaml --run-dir runs/static_model
Environment details
Package | Version |
---|---|
Torchpack | 0.3.0 |
Pytorch | 1.7.0 |
Torchsparse | 1.1.0 |
Cuda | 10.0 |
Getting an error while saving the model using the Saver callback. The error is thrown from the torchpack/distributed/comm.py file as mentioned in the traceback.
Traceback (most recent call last):
File "train.py", line 141, in <module>
main()
File "train.py", line 136, in main
Saver(max_to_keep=10),
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/train/trainer.py", line 39, in train_with_defaults
callbacks=callbacks)
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/train/trainer.py", line 88, in train
self.trigger_epoch()
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/train/trainer.py", line 156, in trigger_epoch
self.callbacks.trigger_epoch()
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 90, in trigger_epoch
self._trigger_epoch()
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 308, in _trigger_epoch
callback.trigger_epoch()
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 90, in trigger_epoch
self._trigger_epoch()
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/inference.py", line 29, in _trigger_epoch
self._trigger()
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/inference.py", line 41, in _trigger
self.callbacks.after_epoch()
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 80, in after_epoch
self._after_epoch()
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 304, in _after_epoch
callback.after_epoch()
File "/home/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/callbacks/callback.py", line 80, in after_epoch
self._after_epoch()
File "/home/SandeepMenon/e3d_pcd/spvnas/core/callbacks.py", line 54, in _after_epoch
self.total_seen[i] = dist.allreduce(self.total_seen[i], reduction='sum')
File "/home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/distributed/comm.py", line 13, in allreduce
data = allgather(data)
File "/home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/distributed/comm.py", line 32, in allgather
sizes = [int(size.item()) for size in sizes]
File "/home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torchpack/distributed/comm.py", line 32, in <listcomp>
sizes = [int(size.item()) for size in sizes]
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f40abc888b2 in /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f40b5f7d952 in /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f40abc73b7d in /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fec3a (0x7f40b6bb4c3a in /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fece6 (0x7f40b6bb4ce6 in /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: python() [0x54edf6]
frame #6: python() [0x588fd8]
frame #7: python() [0x5add78]
frame #8: python() [0x5add8e]
frame #9: python() [0x5add8e]
frame #10: python() [0x5add8e]
frame #11: python() [0x5add8e]
frame #12: python() [0x5add8e]
frame #13: python() [0x5add8e]
frame #14: python() [0x5add8e]
frame #15: python() [0x5add8e]
frame #16: python() [0x5add8e]
frame #17: python() [0x5add8e]
frame #18: python() [0x5add8e]
frame #19: python() [0x5add8e]
frame #20: python() [0x5add8e]
frame #21: python() [0x5add8e]
frame #22: python() [0x5add8e]
frame #23: python() [0x56b606]
<omitting python frames>
frame #29: __libc_start_main + 0xe7 (0x7f40bc6c1bf7 in /lib/x86_64-linux-gnu/libc.so.6)
[deepen-Z11PA-U12-Series:24283] *** Process received signal ***
[deepen-Z11PA-U12-Series:24283] Signal: Aborted (6)
[deepen-Z11PA-U12-Series:24283] Signal code: (-6)
[deepen-Z11PA-U12-Series:24283] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3f040)[0x7f40bc6df040]
[deepen-Z11PA-U12-Series:24283] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f40bc6defb7]
[deepen-Z11PA-U12-Series:24283] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f40bc6e0921]
[deepen-Z11PA-U12-Series:24283] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7f40b7b42957]
[deepen-Z11PA-U12-Series:24283] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ae6)[0x7f40b7b48ae6]
[deepen-Z11PA-U12-Series:24283] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x91b49)[0x7f40b7b47b49]
[deepen-Z11PA-U12-Series:24283] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a8)[0x7f40b7b484b8]
[deepen-Z11PA-U12-Series:24283] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10573)[0x7f40b78ae573]
[deepen-Z11PA-U12-Series:24283] [ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x125)[0x7f40b78aedf5]
[deepen-Z11PA-U12-Series:24283] [ 9] /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libc10_cuda.so(_ZN3c104cuda20CUDACachingAllocator10raw_deleteEPv+0x9e9)[0x7f40b5f7d869]
[deepen-Z11PA-U12-Series:24283] [10] /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libc10.so(_ZN3c1010TensorImpl17release_resourcesEv+0x4d)[0x7f40abc73b7d]
[deepen-Z11PA-U12-Series:24283] [11] /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x5fec3a)[0x7f40b6bb4c3a]
[deepen-Z11PA-U12-Series:24283] [12] /home/deepen/SandeepMenon/venv-e3d/lib/python3.6/site-packages/torch/lib/libtorch_python.so(+0x5fece6)[0x7f40b6bb4ce6]
[deepen-Z11PA-U12-Series:24283] [13] python[0x54edf6]
[deepen-Z11PA-U12-Series:24283] [14] python[0x588fd8]
[deepen-Z11PA-U12-Series:24283] [15] python[0x5add78]
[deepen-Z11PA-U12-Series:24283] [16] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [17] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [18] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [19] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [20] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [21] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [22] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [23] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [24] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [25] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [26] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [27] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [28] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] [29] python[0x5add8e]
[deepen-Z11PA-U12-Series:24283] *** End of error message ***
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[63105,1],1]
Exit code: 134
Hi! Strange problem with torchpack as it stalls on startup. I am evaluating SPVCNN models and when I launch the script it sometimes starts, but sometimes such stalls and does nothing.
torchpack dist-run -np 1 python evaluate.py configs.yaml --name run-1fb33cb7
There are no errors, python evaluate.py
process is started but there is no CPU or GPU load. I will then hit ctrl+C and start the same command again and it will run eventually (at least after a few tries). But it is very annoying. Some kind of deadlock? Do you have any suggestions on how to debug this and find the cause? Are there any debug parameters for torchpack?
How to debug code launched with torchpack in IDE such as VSCode?
Thx.
As far as I can see tqdm
prints from non-master nodes are being printed by torchpack
. This is probably not the desired behavior.
I observed that the load()
function from torchpack/utils/config.py
can be stuck in an infinite loop when recursive mode is enabled. In particular this happens when the path to the config is given as an absolute path instead of a relative path. When the top-level directory "/"
is reached, os.path.dirname("/")
will again return "/" and the while
loop never ends.
See example for the development of the variable fpath
in load()
below:
/code/configs/config.yaml
/code/configs
/code
/
/
... (infinite)
/
The program will finally crashed with a memory error caused by a memory leak of the list fpaths
to which infinite paths are appended.
I have been using torchpack for my experiments. Now I need to integrate my model training code into another codebase built on Bazel.
I am not able to use the torchpack command line tool and run like
torchpack dist-run -np 1 python train.py
I have installed torchpack using pip. Is there a way to replicate the same behavior through python executable
(similar to python3 -m torch.distributed.launch train.py
)
When I run the bevfusion, the error notes me that address already used.
Design a callback to enable updating hyperparameters/the config object based off of a score over time.
I used pip3 to install torchpack. I have both python2 and python3 on my system and I prefer that usr/bin/python points to python2. How do I tell torchpack to look for python3?
Hello, when I run bevfusion for MIT I get torchpack\launch\launchers\drunner.py", line 104, in main
os.execve('/bin/sh', ['/bin/sh', '-c', command], environ)
FileNotFoundError: [Errno 2] No such file or directory: '/bin/sh'。
Windows does not have this file, how to change os.execve('/bin/sh', ['/bin/sh', '-c', command], environ) under windows。
Thanks for providing this package. I am successfully able to use the torchpack dist-run -np ${_ngpu} command on a slurm cluster when using only 1 node.
Could you please explain how to use this with multiple nodes. I assume it involves setting the --hosts parameter but I'm not able to figure out how to identify the allocated nodes from the slurm script.
YAML lists should be formatted compactly as [1, 2, 3, 4]
rather than
- 1
- 2
- 3
- 4
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.