Git Product home page Git Product logo

Comments (6)

albertz avatar albertz commented on August 11, 2024

Also, I wonder a bit about the --multiprocessing-fork, which indicates that this is a fork, although it is actually started via spawn method?

This is just misleading. It does not mean that this is via the fork start method. It just means that it is a sub process started via the multiprocessing logic. Maybe it is even specifically only for the spawn method.

from returnn.

albertz avatar albertz commented on August 11, 2024

Looking at the stacktrace of the TDL worker, we see that it waits in a multiprocessing Queue _finalize_join.

from returnn.

albertz avatar albertz commented on August 11, 2024

Then I manually send a SIGINT. And now the log continued:

As this worked, we could also extend our NonDaemonicSpawnProcess atexit handler logic, to not just send a single SIGINT, but instead send one, then wait a bit, check waitpid (with timeout), if not yet finished, send SIGINT again, etc. Maybe after trying this N times, after T seconds, try with SIGTERM, and then SIGKILL, and at some point, just gave up.

But we should still investigate why it hangs in the Queue _finalize_join.

from returnn.

albertz avatar albertz commented on August 11, 2024

Looking at the stacktrace of the TDL worker, we see that it waits in a multiprocessing Queue _finalize_join.

So, it waits for the Queue feed thread to finish. It should already have send the sentinel to it. But we see from the py-spy output, that the Queue feed thread hangs here:

Thread 2882887 (idle): "QueueFeederThread"
    _send (multiprocessing/connection.py:367)
    _send_bytes (multiprocessing/connection.py:404)
    send_bytes (multiprocessing/connection.py:199)
    _feed (multiprocessing/queues.py:250)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

Probably it tries to send some data but hangs.

I wonder a bit why this hangs if the relevant subproc already quitted. Shouldn't this lead to a sigpipe? Or is the relevant subproc still alive? But it did not look like that. Also, from the TDL worker py-spy output, we see that the main thread is in _exit_function below the point where it terminated and joined all the procs. This implies that there are no alive subprocs anymore.

from returnn.

albertz avatar albertz commented on August 11, 2024

Ok, the proc continued, and then when it got the next OOM, it hanged again at exit:

...
ep 159 train, step 349, ctc_4 1.890, ctc_8 1.582, ce 1.492, fer 0.275, num_seqs 22, max_size:time 279752, max_size:out-spatial 65, mem_usage:cuda 19.0GB, 0.809 sec/step
ep 159 train, step 350, ctc_4 1.854, ctc_8 1.529, ce 1.333, fer 0.263, num_seqs 22, max_size:time 281512, max_size:out-spatial 67, mem_usage:cuda 19.0GB, 0.692 sec/step
ep 159 train, step 351, ctc_4 1.789, ctc_8 1.459, ce 1.227, fer 0.258, num_seqs 22, max_size:time 286264, max_size:out-spatial 67, mem_usage:cuda 19.0GB, 0.879 sec/step
ep 159 train, step 352, ctc_4 1.700, ctc_8 1.383, ce 1.160, fer 0.230, num_seqs 22, max_size:time 290576, max_size:out-spatial 64, mem_usage:cuda 19.0GB, 0.680 sec/step
OutOfMemoryError: CUDA out of memory. Tried to allocate 264.00 MiB. GPU 0 has a total capacty of 22.03 GiB of which 238.88 MiB is free. Including non-PyTorch memory, this process has 21.80 GiB memory in use. Of the allocated memory 19.31 GiB is allocated by PyTorch, and 1.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Unhandled exception <class 'torch.cuda.OutOfMemoryError'> in thread <_MainThread(MainThread, started 140688618846016)>, proc 2884751. 

...
OutOfMemoryError: CUDA out of memory. Tried to allocate 264.00 MiB. GPU 0 has a total capacty of 22.03 GiB of which 238.88 MiB is free. Including non-PyTorch memory, this process has 21.80 GiB memory in use. Of the allocated memory 19.31 GiB is allocated by PyTorch, and 1.12 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Module call stack:
(No module call frames.)
MEMORY: proc <unknown-dead>(2885470) exited, old: rss=737.9MB pss=712.4MB uss=711.7MB shared=26.2MB
MEMORY: total (main 2884751, 2024-01-20, 07:32:32, 30 procs): pss=16.4GB uss=15.4GB
[2024-01-20 07:32:33,168] INFO: Run time: 20:38:14 CPU: 0.20% RSS: 15.80GB VMS: 121.23GB
<OggZipDataset 'train' epoch=62>, epoch 62. No filter for this epoch.
<OggZipDataset 'train' epoch=63>, epoch 63. No filter for this epoch.
<OggZipDataset 'train' epoch=64>, epoch 64. No filter for this epoch.
<OggZipDataset 'train' epoch=65>, epoch 65. No filter for this epoch.
<OggZipDataset 'train' epoch=66>, epoch 66. No filter for this epoch.
<OggZipDataset 'train' epoch=67>, epoch 67. No filter for this epoch.
<OggZipDataset 'train' epoch=68>, epoch 68. No filter for this epoch.
<OggZipDataset 'train' epoch=69>, epoch 69. No filter for this epoch.
<OggZipDataset 'train' epoch=70>, epoch 70. No filter for this epoch.
<OggZipDataset 'train' epoch=71>, epoch 71. No filter for this epoch.
<OggZipDataset 'train' epoch=72>, epoch 72. No filter for this epoch.
<OggZipDataset 'train' epoch=73>, epoch 73. No filter for this epoch.
<OggZipDataset 'train' epoch=74>, epoch 74. No filter for this epoch.
<OggZipDataset 'train' epoch=75>, epoch 75. No filter for this epoch.
<OggZipDataset 'train' epoch=76>, epoch 76. No filter for this epoch.
<OggZipDataset 'train' epoch=77>, epoch 77. No filter for this epoch.
<OggZipDataset 'train' epoch=78>, epoch 78. No filter for this epoch.
<OggZipDataset 'train' epoch=79>, epoch 79. No filter for this epoch.
<OggZipDataset 'train' epoch=80>, epoch 80. No filter for this epoch.
<OggZipDataset 'train' epoch=81>, epoch 81. No filter for this epoch.
<OggZipDataset 'train' epoch=82>, epoch 82. No filter for this epoch.
<OggZipDataset 'train' epoch=83>, epoch 83. No filter for this epoch.
<OggZipDataset 'train' epoch=84>, epoch 84. No filter for this epoch.
<OggZipDataset 'train' epoch=85>, epoch 85. No filter for this epoch.
<OggZipDataset 'train' epoch=86>, epoch 86. No filter for this epoch.
<OggZipDataset 'train' epoch=87>, epoch 87. No filter for this epoch.
<OggZipDataset 'train' epoch=88>, epoch 88. No filter for this epoch.
<OggZipDataset 'train' epoch=89>, epoch 89. No filter for this epoch.
<OggZipDataset 'train' epoch=90>, epoch 90. No filter for this epoch.
<OggZipDataset 'train' epoch=91>, epoch 91. No filter for this epoch.
<OggZipDataset 'train' epoch=92>, epoch 92. No filter for this epoch.
<OggZipDataset 'train' epoch=93>, epoch 93. No filter for this epoch.
<OggZipDataset 'train' epoch=94>, epoch 94. No filter for this epoch.
<OggZipDataset 'train' epoch=95>, epoch 95. No filter for this epoch.
<OggZipDataset 'train' epoch=96>, epoch 96. No filter for this epoch.
<OggZipDataset 'train' epoch=97>, epoch 97. No filter for this epoch.
<OggZipDataset 'train' epoch=98>, epoch 98. No filter for this epoch.
<OggZipDataset 'train' epoch=99>, epoch 99. No filter for this epoch.
<OggZipDataset 'train' epoch=100>, epoch 100. No filter for this epoch.
<OggZipDataset 'train' epoch=101>, epoch 101. No filter for this epoch.
<OggZipDataset 'train' epoch=102>, epoch 102. No filter for this epoch.
<OggZipDataset 'train' epoch=103>, epoch 103. No filter for this epoch.
<OggZipDataset 'train' epoch=104>, epoch 104. No filter for this epoch.
<OggZipDataset 'train' epoch=105>, epoch 105. No filter for this epoch.
<OggZipDataset 'train' epoch=106>, epoch 106. No filter for this epoch.
<OggZipDataset 'train' epoch=107>, epoch 107. No filter for this epoch.
<OggZipDataset 'train' epoch=108>, epoch 108. No filter for this epoch.
<OggZipDataset 'train' epoch=109>, epoch 109. No filter for this epoch.
<OggZipDataset 'train' epoch=110>, epoch 110. No filter for this epoch.
<OggZipDataset 'train' epoch=111>, epoch 111. No filter for this epoch.
<OggZipDataset 'train' epoch=112>, epoch 112. No filter for this epoch.
<OggZipDataset 'train' epoch=113>, epoch 113. No filter for this epoch.
<OggZipDataset 'train' epoch=114>, epoch 114. No filter for this epoch.
<OggZipDataset 'train' epoch=115>, epoch 115. No filter for this epoch.
<OggZipDataset 'train' epoch=116>, epoch 116. No filter for this epoch.
<OggZipDataset 'train' epoch=117>, epoch 117. No filter for this epoch.
<OggZipDataset 'train' epoch=118>, epoch 118. No filter for this epoch.
<OggZipDataset 'train' epoch=119>, epoch 119. No filter for this epoch.
<OggZipDataset 'train' epoch=120>, epoch 120. No filter for this epoch.
<OggZipDataset 'train' epoch=121>, epoch 121. No filter for this epoch.
<OggZipDataset 'train' epoch=122>, epoch 122. No filter for this epoch.
<OggZipDataset 'train' epoch=123>, epoch 123. No filter for this epoch.
<OggZipDataset 'train' epoch=124>, epoch 124. No filter for this epoch.
<OggZipDataset 'train' epoch=125>, epoch 125. No filter for this epoch.
<OggZipDataset 'train' epoch=126>, epoch 126. No filter for this epoch.
<OggZipDataset 'train' epoch=127>, epoch 127. No filter for this epoch.
<OggZipDataset 'train' epoch=128>, epoch 128. No filter for this epoch.
<OggZipDataset 'train' epoch=129>, epoch 129. No filter for this epoch.
<OggZipDataset 'train' epoch=130>, epoch 130. No filter for this epoch.
<OggZipDataset 'train' epoch=131>, epoch 131. No filter for this epoch.
<OggZipDataset 'train' epoch=132>, epoch 132. No filter for this epoch.
<OggZipDataset 'train' epoch=133>, epoch 133. No filter for this epoch.
<OggZipDataset 'train' epoch=134>, epoch 134. No filter for this epoch.
<OggZipDataset 'train' epoch=135>, epoch 135. No filter for this epoch.
<OggZipDataset 'train' epoch=136>, epoch 136. No filter for this epoch.
<OggZipDataset 'train' epoch=137>, epoch 137. No filter for this epoch.
<OggZipDataset 'train' epoch=138>, epoch 138. No filter for this epoch.
<OggZipDataset 'train' epoch=139>, epoch 139. No filter for this epoch.
<OggZipDataset 'train' epoch=140>, epoch 140. No filter for this epoch.
<OggZipDataset 'train' epoch=141>, epoch 141. No filter for this epoch.
<OggZipDataset 'train' epoch=142>, epoch 142. No filter for this epoch.
<OggZipDataset 'train' epoch=143>, epoch 143. No filter for this epoch.
<OggZipDataset 'train' epoch=144>, epoch 144. No filter for this epoch.
<OggZipDataset 'train' epoch=145>, epoch 145. No filter for this epoch.
<OggZipDataset 'train' epoch=146>, epoch 146. No filter for this epoch.
<OggZipDataset 'train' epoch=147>, epoch 147. No filter for this epoch.
<OggZipDataset 'train' epoch=148>, epoch 148. No filter for this epoch.
<OggZipDataset 'train' epoch=149>, epoch 149. No filter for this epoch.
<OggZipDataset 'train' epoch=150>, epoch 150. No filter for this epoch.
<OggZipDataset 'train' epoch=151>, epoch 151. No filter for this epoch.
<OggZipDataset 'train' epoch=152>, epoch 152. No filter for this epoch.
<OggZipDataset 'train' epoch=153>, epoch 153. No filter for this epoch.
<OggZipDataset 'train' epoch=154>, epoch 154. No filter for this epoch.
<OggZipDataset 'train' epoch=155>, epoch 155. No filter for this epoch.
<OggZipDataset 'train' epoch=156>, epoch 156. No filter for this epoch.
<OggZipDataset 'train' epoch=157>, epoch 157. No filter for this epoch.
<OggZipDataset 'train' epoch=158>, epoch 158. No filter for this epoch.
<OggZipDataset 'train' epoch=159>, epoch 159. No filter for this epoch.
MEMORY: proc <unknown-dead>(2885245) exited, old: rss=625.7MB pss=494.4MB uss=457.8MB shared=167.9MB
MEMORY: proc <unknown-dead>(2885378) exited, old: rss=630.0MB pss=498.7MB uss=462.2MB shared=167.8MB
MEMORY: proc <unknown-dead>(2885426) exited, old: rss=737.9MB pss=712.5MB uss=711.8MB shared=26.2MB
MEMORY: proc <unknown-dead>(2885440) exited, old: rss=741.3MB pss=715.5MB uss=714.6MB shared=26.6MB
MEMORY: proc <unknown-dead>(2885455) exited, old: rss=737.6MB pss=712.1MB uss=711.3MB shared=26.3MB
MEMORY: proc <unknown-dead>(2885294) exited, old: rss=117.4MB pss=91.8MB uss=91.0MB shared=26.4MB
MEMORY: proc <unknown-dead>(2885308) exited, old: rss=118.9MB pss=93.5MB uss=92.8MB shared=26.1MB
MEMORY: proc <unknown-dead>(2885323) exited, old: rss=118.3MB pss=92.8MB uss=92.0MB shared=26.3MB
MEMORY: proc <unknown-dead>(2885338) exited, old: rss=117.4MB pss=91.9MB uss=91.2MB shared=26.3MB
MEMORY: proc <unknown-dead>(2885063) exited, old: rss=0.9GB pss=0.8GB uss=0.8GB shared=26.4MB
MEMORY: proc <unknown-dead>(2885077) exited, old: rss=0.9GB pss=805.2MB uss=784.6MB shared=102.1MB
MEMORY: proc <unknown-dead>(2885093) exited, old: rss=0.9GB pss=805.1MB uss=784.3MB shared=102.6MB
MEMORY: proc <unknown-dead>(2885108) exited, old: rss=0.9GB pss=804.8MB uss=776.3MB shared=98.8MB
MEMORY: total (main 2884751, 2024-01-20, 07:32:37, 17 procs): pss=9.8GB uss=9.0GB
[2024-01-20 07:32:38,185] INFO: Run time: 20:38:19 CPU: 0.40% RSS: 10.59GB VMS: 85.03GB

Current running procs:

zeyer@cn-504 ~ % ps a --forest -u $(whoami) -o pid,comm       
    PID COMMAND
2884584 sshd
2884585  \_ zsh
2893980      \_ ps
2874748 slurm_script
2874753  \_ python3.11
2874766      \_ python3.11
2884751          \_ python3.11
2884767              \_ python3.11
2884768              \_ watch memory
2884803              \_ MPD worker 0
2884804              \_ MPD worker 1
2884805              \_ MPD worker 2
2884806              \_ MPD worker 3
2884870              \_ python3.11
2884876              \_ MPD worker 0
2884877              \_ MPD worker 1
2884878              \_ MPD worker 2
2884879              \_ MPD worker 3
2884941              \_ MPD worker 0
2884942              \_ MPD worker 1
2884943              \_ MPD worker 2
2884944              \_ MPD worker 3
2885015              \_ TDL worker 0
2884573 systemd
2884574  \_ (sd-pam)
   1765 agetty

Main proc:

zeyer@cn-504 ~ % py-spy dump -p 2884751
Process 2884751: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.XBD0ELoPJvFX/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)

Thread 2884751 (idle): "MainThread"
    __call__ (returnn/returnn/util/multi_proc_non_daemonic_spawn.py:166)
Thread 2885030 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)
Thread 2885174 (idle)
Thread 2885261 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)
Thread 2885393 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

Last TDL worker:

zeyer@cn-504 ~ % py-spy dump -p 2885015
Process 2885015: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=8, pipe_handle=90) --multiprocessing-fork
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)

Thread 2885015 (idle): "MainThread"
    _wait_for_tstate_lock (threading.py:1132)
    join (threading.py:1112)
    _finalize_join (multiprocessing/queues.py:199)
    __call__ (multiprocessing/util.py:224)
    _run_finalizers (multiprocessing/util.py:300)
    _exit_function (multiprocessing/util.py:360)
    _bootstrap (multiprocessing/process.py:317)
    _main (multiprocessing/spawn.py:133)
    spawn_main (multiprocessing/spawn.py:120)
    <module> (<string>:1)
Thread 2885172 (idle): "QueueFeederThread"
    _send (multiprocessing/connection.py:367)
    _send_bytes (multiprocessing/connection.py:404)
    send_bytes (multiprocessing/connection.py:199)
    _feed (multiprocessing/queues.py:250)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)
Thread 2885173 (idle): "Thread-1 (_serve)"
    accept (socket.py:294)
    accept (multiprocessing/connection.py:608)
    accept (multiprocessing/connection.py:462)
    _serve (multiprocessing/resource_sharer.py:138)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

Via pystack:

zeyer@cn-504 ~ % pystack remote 2885015 --locals                                                                                                                
Traceback for thread 2885173 (TDL worker 0) [] (most recent call last):         
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/threading.py", line 995, in _bootstrap
        self._bootstrap_inner()                                                 
      Arguments:                                                                                                                                                
        self: <Thread at 0x7fdaeefd2b50>        
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/threading.py", line 1038, in _bootstrap_inner      
        self.run()                                                              
      Arguments:                                                                                                                                                
        self: <Thread at 0x7fdaeefd2b50>                                        
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/threading.py", line 975, in run                                             
        self._target(*self._args, **self._kwargs)                               
      Arguments:                        
        self: <Thread at 0x7fdaeefd2b50>                               
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/resource_sharer.py", line 138, in _serve
        with self._listener.accept() as conn:                                                                                                                   
      Arguments:                                                                                                                                                
        self: <_ResourceSharer at 0x7fdaf2dfa590>                               
      Locals:                                                                                                                                                   
        close: <function at 0x7fdaeeb1b1a0>                                                                                                                     
        send: <function at 0x7fdaeeb1b420>                                                                                                                      
        destination_pid: 2884751                                                                                                                                
        key: 288432                                                                                                                                             
        msg: (288432, 2884751)                                                  
        conn: <Connection at 0x7fdaf1725f10>                                                                                                                    
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/connection.py", line 462, in accept
        c = self._listener.accept()
      Arguments:
        self: <Listener at 0x7fdaefa38d50>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/connection.py", line 608, in accept
        s, self._last_accepted = self._socket.accept()
      Arguments:
        self: <SocketListener at 0x7fdaeefd1c90>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/socket.py", line 294, in accept
        fd, addr = self._accept()
      Arguments:
        self: <socket at 0x7fdaeeb5bb80> 

Traceback for thread 2885172 (TDL worker 0) [] (most recent call last):
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/threading.py", line 995, in _bootstrap
        self._bootstrap_inner()
      Arguments:
        self: <Thread at 0x7fdaeefd2510> 
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
        self.run()
      Arguments:
        self: <Thread at 0x7fdaeefd2510> 
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/threading.py", line 975, in run
        self._target(*self._args, **self._kwargs)
      Arguments:
        self: <Thread at 0x7fdaeefd2510> 
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/queues.py", line 250, in _feed
        send_bytes(obj)
      Arguments:
        queue_sem: <BoundedSemaphore at 0x7fdaeeb16110>
        onerror: <function at 0x7fdaeeb18ae0>
        writer_close: <method at 0x7fdaeeb15dc0>
        reader_close: <method at 0x7fdaef3d26c0>
        writelock: <Lock at 0x7fdaeeb16050>
        ignore_epipe: False
        send_bytes: <method at 0x7fdaeeb162c0>
        notempty: <Condition at 0x7fdaeeb16290>
        buffer: <collections.deque at 0x7fdaeeb127a0>
      Locals:
        obj: <memoryview at 0x7fdaef402ec0>
        wrelease: <builtin_function_or_method at 0x7fdaeeb026b0>
        wacquire: <builtin_function_or_method at 0x7fdaeeb02660>
        sentinel: <object at 0x7fdaf05eec00>
        bpopleft: <builtin_function_or_method at 0x7fdaeeb40e50>
        nwait: <method at 0x7fdaeefd18c0>
        nrelease: <builtin_function_or_method at 0x7fdaeeb027f0>
        nacquire: <builtin_function_or_method at 0x7fdaeeb027a0>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/connection.py", line 199, in send_bytes
        self._send_bytes(m[offset:offset + size])
      Arguments:
        size: 30056
        offset: 0
        buf: <memoryview at 0x7fdaef402ec0>
        self: <Connection at 0x7fdaeeb15f90>
      Locals:
        n: 30056
        m: <memoryview at 0x7fdaef403580>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/connection.py", line 404, in _send_bytes           [43/1574]
        self._send(buf)
      Arguments:
        buf: <memoryview at 0x7fdaef403340>
        self: <Connection at 0x7fdaeeb15f90>
      Locals:
        header: b"uh"
        n: 30056
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/connection.py", line 367, in _send
        n = write(self._handle, buf)
      Arguments:
        write: <builtin_function_or_method at 0x7fdc074741d0>
        buf: <memoryview at 0x7fdaef4031c0>
        self: <Connection at 0x7fdaeeb15f90>
      Locals:
        n: 25960
        remaining: 4096

Traceback for thread 2885015 (TDL worker 0) [] (most recent call last):
    (Python) File "<string>", line 1, in <module>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/spawn.py", line 120, in spawn_main
        exitcode = _main(fd, parent_sentinel)
      Arguments:
        tracker_fd: 8
        parent_pid: None
        pipe_handle: 90
      Locals:
        resource_tracker: <module at 0x7fdc071a3e20>
        parent_sentinel: 3
        fd: 90
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/spawn.py", line 133, in _main
        return self._bootstrap(parent_sentinel)
      Arguments:
        parent_sentinel: 3
        fd: 90
      Locals:
        self: <NonDaemonicSpawnProcess at 0x7fdaefc76cd0>
        preparation_data: {"log_to_stderr": False, "authkey": <BINARY>, ...}
        from_parent: <_io.BufferedReader at 0x7fdc07189fe0>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/process.py", line 317, in _bootstrap
        util._exit_function()
      Arguments:
        parent_sentinel: 3
        self: <NonDaemonicSpawnProcess at 0x7fdaefc76cd0>
      Locals:
        exitcode: 0
        context: <module at 0x7fdc07322b10>
        util: <module at 0x7fdc07200ae0> 
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/util.py", line 360, in _exit_function
        _run_finalizers()
      Arguments:
        current_process: <function at 0x7fdc07366e80>
        active_children: <function at 0x7fdc07366fc0>
        _run_finalizers: <function at 0x7fdc0726a200>
        debug: <function at 0x7fdc07235e40>
        info: <function at 0x7fdc07236480>
      Locals:
        p: <NonDaemonicSpawnProcess at 0x7fdaeeb88090>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/util.py", line 300, in _run_finalizers
        finalizer()
      Arguments:
        minpriority: <cell at 0x7fdaeeb0fdc0>
      Locals:
        f: <cell at 0x7fdaeeb0fcd0>
        finalizer: <Finalize at 0x7fdaeeb29990>
        key: (-5, 4)
        keys: [(-5, 4), (-100, 6)]
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/util.py", line 224, in __call__
        res = self._callback(*self._args, **self._kwargs)
      Arguments:
        getpid: <builtin_function_or_method at 0x7fdc07482d40>
        sub_debug: <function at 0x7fdc071f1300>
        _finalizer_registry: {(None, 0): <Finalize at 0x7fdaf054ae90>, ...}
        wr: None
        self: <Finalize at 0x7fdaeeb29990>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/queues.py", line 199, in _finalize_join
        thread.join()
      Arguments:
        twr: <weakref.ReferenceType at 0x7fdaeeb41080>
      Locals:
        thread: <Thread at 0x7fdaeefd2510>
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/threading.py", line 1112, in join
        self._wait_for_tstate_lock()
      Arguments:
        timeout: None
        self: <Thread at 0x7fdaeefd2510> 
    (Python) File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/threading.py", line 1132, in _wait_for_tstate_lock
        if lock.acquire(block, timeout):
      Arguments:
        timeout: -1
        block: True
        self: <Thread at 0x7fdaeefd2510> 
      Locals:
        lock: <_thread.lock at 0x7fdaeefd2300>

from returnn.

albertz avatar albertz commented on August 11, 2024

As this worked, we could also extend our NonDaemonicSpawnProcess atexit handler logic, to not just send a single SIGINT, but instead send one, then wait a bit, check waitpid (with timeout), if not yet finished, send SIGINT again, etc. Maybe after trying this N times, after T seconds, try with SIGTERM, and then SIGKILL, and at some point, just gave up.

That's exactly what I implemented now, and this should solve the issue of the hang in the mainproc here. But we should still figure out why it hangs in the subproc. This is still not clear to me.

(Anyway, leaving it closed now, as the priority is low now. The problem probably existed all the time but was always unnoticed, as a normal daemon proc got just terminated.)

from returnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.