Git Product home page Git Product logo

Comments (5)

albertz avatar albertz commented on August 11, 2024

I realize, the zombie subproc is definitely not one of the MPD workers, as you see that all the others are complete.
The first unnamed subproc is multiprocessing/resource_tracker.
The second unnamed subproc is multiprocessing/managers.
The main proc hangs in starting the subproc for PyTorch _MultiProcessingDataLoaderIter. So this is the subproc which raised this exception and then crashed and which hangs in zombie state.

The corresponding code from the main proc in PyTorch dataloader _MultiProcessingDataLoaderIter (torch/utils/data/dataloader.py:1039) is:

            w = multiprocessing_context.Process(
                target=_utils.worker._worker_loop,
                args=(self._dataset_kind, self._dataset, index_queue,
                      self._worker_result_queue, self._workers_done_event,
                      self._auto_collation, self._collate_fn, self._drop_last,
                      self._base_seed, self._worker_init_fn, i, self._num_workers,
                      self._persistent_workers, self._shared_seed))
            w.daemon = True
            # NB: Process.start() actually take some time as it needs to
            #     start a process and pass the arguments over via a pipe.
            #     Therefore, we only add a worker to self._workers list after
            #     it started, so that we do not call .join() if program dies
            #     before it starts, and __del__ tries to join but will get:
            #     AssertionError: can only join a started process.
            w.start()   # <---- hang here

The w is a NonDaemonicSpawnProcess, and it has an instance of _DataLoaderWorkerPreInitFunc for pre_init_func. The unpickling of this fails.

from returnn.

albertz avatar albertz commented on August 11, 2024

As the code happens in our custom pre_init_func handling logic (not directly there, but we can catch it there), we will catch the exception there and try to do some proper cleanup such that the parent should properly get the SIGPIPE.

from returnn.

albertz avatar albertz commented on August 11, 2024

I found the bug in CPython itself, reported here: python/cpython#118981

I'm not sure that we can really find a good workaround for this. We could try to delay the exception. I need to think more about it.

from returnn.

albertz avatar albertz commented on August 11, 2024

(Note, the PR #1515 just adds a workaround for the hang. So it does not hang anymore now. The CPython PR python/cpython#118982 should fix the hang in another cleaner way as well. But we still get the UnicodeDecodeError here.)

from returnn.

albertz avatar albertz commented on August 11, 2024

(Note, the UnicodeDecodeError was fixed in aa41c4a. But it's really unrelated to the hang.)

from returnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.