Comments (5)
I realize, the zombie subproc is definitely not one of the MPD workers, as you see that all the others are complete.
The first unnamed subproc is multiprocessing/resource_tracker
.
The second unnamed subproc is multiprocessing/managers
.
The main proc hangs in starting the subproc for PyTorch _MultiProcessingDataLoaderIter
. So this is the subproc which raised this exception and then crashed and which hangs in zombie state.
The corresponding code from the main proc in PyTorch dataloader _MultiProcessingDataLoaderIter
(torch/utils/data/dataloader.py:1039
) is:
w = multiprocessing_context.Process(
target=_utils.worker._worker_loop,
args=(self._dataset_kind, self._dataset, index_queue,
self._worker_result_queue, self._workers_done_event,
self._auto_collation, self._collate_fn, self._drop_last,
self._base_seed, self._worker_init_fn, i, self._num_workers,
self._persistent_workers, self._shared_seed))
w.daemon = True
# NB: Process.start() actually take some time as it needs to
# start a process and pass the arguments over via a pipe.
# Therefore, we only add a worker to self._workers list after
# it started, so that we do not call .join() if program dies
# before it starts, and __del__ tries to join but will get:
# AssertionError: can only join a started process.
w.start() # <---- hang here
The w
is a NonDaemonicSpawnProcess
, and it has an instance of _DataLoaderWorkerPreInitFunc
for pre_init_func
. The unpickling of this fails.
from returnn.
As the code happens in our custom pre_init_func
handling logic (not directly there, but we can catch it there), we will catch the exception there and try to do some proper cleanup such that the parent should properly get the SIGPIPE.
from returnn.
I found the bug in CPython itself, reported here: python/cpython#118981
I'm not sure that we can really find a good workaround for this. We could try to delay the exception. I need to think more about it.
from returnn.
(Note, the PR #1515 just adds a workaround for the hang. So it does not hang anymore now. The CPython PR python/cpython#118982 should fix the hang in another cleaner way as well. But we still get the UnicodeDecodeError
here.)
from returnn.
(Note, the UnicodeDecodeError
was fixed in aa41c4a. But it's really unrelated to the hang.)
from returnn.
Related Issues (20)
- PyTorch/RF (?): choosing on which epochs to save optimizer state
- Datasets: blocklist in addition to allowlist for segment list file
- Make batch_size configurable for cross validation HOT 1
- Ignore a single broken gradient HOT 2
- DistributeFilesDataset: _distribute_evenly_by_size suboptimal for multi-gpu sharding HOT 8
- multiprocessing: OSError: AF_UNIX path too long HOT 11
- ConcatSeqsDataset with extended functionality HOT 3
- Torch: print model at log verbosity 3 HOT 1
- RuntimeError: CUDA error: an illegal memory access was encountered HOT 1
- Torch gradient_checkpoint_scope _unregister_custom_saved_tensors_hooks error HOT 4
- RF parametrization breaks Conv
- Torch gradient_checkpoint_scope could trigger segmentation fault? HOT 16
- Torch gradient_checkpoint_scope potential memory leak
- Torch multiple simultaneous gradient_checkpoint_scope
- `rf.pack_padded` with PyTorch takes a lot of memory HOT 1
- `rf.RelPosCausalSelfAttention` fails with `single_step_dim` HOT 9
- Torch `report_profile` `check_events` based tests maybe unstable HOT 1
- Torch: gradient_clip wrong when grad_scaler is used
- Torch print step info on crash
- Make `FileCache` able to detect updated remote files HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn.