Currently, when external tools are used in cluster or multiprocessing mode, a single u

Improve error reporting and recovery from errors about bcbio-nextgen HOT 13 CLOSED

bcbio commented on June 30, 2024

Improve error reporting and recovery from errors

from bcbio-nextgen.

Comments (13)

chapmanb commented on June 30, 2024

Luca;
Thanks for sharing your good experiences with joblib. Would integrating that to replace multiprocessing/concurrent.futures resolve these issues? I heard good reports about joblib at SciPy so am open to including that if it helps clean up the hangs and bad error reporting from multiprocessing.

from bcbio-nextgen.

lbeltrame commented on June 30, 2024

In data venerdì 5 luglio 2013 03:46:09, Brad Chapman ha scritto:

Thanks for sharing your good experiences with joblib. Would integrating that
to replace multiprocessing/concurrent.futures resolve these issues? I heard

I'm not sure per se (of course it will require a bit of testing) but it's such
a better improvement that I would really recommend it. Even just for its
excellent error reporting and the ability to turn off MP if needed so you can
look at the code in a non-parallel way (a godsend for deugging) it would be a
worthwhile addition.

from bcbio-nextgen.

chapmanb commented on June 30, 2024

Luca;
That makes good sense. bcbio-nextgen has the feature of turning multiprocessing off for single cores, for exactly the debugging reason you mention, but using an external library for this is an improvement.

I pushed a joblib based implementation to the development version. Let me know if this helps with the issues you were seeing. I did manage to simulate a few lockups by randomly killing processes during a run. It seems to get stuck acquiring it's internal lock. In reading the code it seems strange they use a threading Lock instead of a multiprocessing one:

https://github.com/joblib/joblib/blob/master/joblib/parallel.py#L489

but I haven't debugged further yet. Let me know how this works in your hands.

from bcbio-nextgen.

lbeltrame commented on June 30, 2024

Hmm... the one I'm seeing looks like a hang in bedtools / pybedtools instead. It passes by if the pipeline is run with no data, but hangs wihen there is already data present (the pipeline on the earlier run had an error because gemini wasn't installed).

from bcbio-nextgen.

lbeltrame commented on June 30, 2024

This is what I got by attaching gdb to the hung processes (State: "[2013-07-08 08:47] ipython: piped_bamprep -- local; checkpoint passed")

Parent process:

Thread 1 (Thread 0x7f405ecff700 (LWP 19190)):
  41                               "{record.message}"])
  42
  43        log_dir = _get_log_dir(config)
  44        if log_dir:
  45            utils.safe_makedir(log_dir)
 >46            time.sleep(1)
  47            handlers.append(logbook.FileHandler(os.path.join(log_dir, "%s.log" % LOG_NAME),
  48                                                format_string=format_str, level="INFO",
  49                                                filter=_not_cl))
  50            handlers.append(logbook.FileHandler(os.path.join(log_dir, "%s-debug.log" % LOG_NAME),
  51                                                format_string=format_str, level="DEBUG", bubble=True,

Children are hung on ZMQ polling:

(gdb) bt
#0  0x00007f405dd890d3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f405bbe637b in zmq::epoll_t::loop (this=0x3016620) at epoll.cpp:142
#2  0x00007f405bc02ac6 in thread_routine (arg_=0x3016690) at thread.cpp:83
#3  0x00007f405e8deb50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f405dd88a7d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x0000000000000000 in ?? ()

Often the bedtools call also fails and hangs.

Notice that this happens at a local level (but on a NFS-mounted volume) and only if I rerun an interrupted pipeline due to an error (first it was snpEff, then gemini). Notice that if I start from scratch, this does not happen.

from bcbio-nextgen.

lbeltrame commented on June 30, 2024

More interesting notices: the process "unstuck" after 30 minutes:

[2013-07-08 09:06]  Timing: alignment post-processing
[2013-07-08 09:06]  ipython: piped_bamprep -- local; checkpoint passed
[2013-07-08 09:39]  Timing: variant calling

from bcbio-nextgen.

chapmanb commented on June 30, 2024

Luca;
It doesn't sound like this is stuck, but rather processing to get up to the point where it can re-start processing. There is sometimes work that needs to be done to recover to the right state. Is the bcbio_nextgen process using processor, or idle?

from bcbio-nextgen.

lbeltrame commented on June 30, 2024

In data lunedì 8 luglio 2013 03:12:20, Brad Chapman ha scritto:

to be done to recover to the right state. Is the bcbio_nextgen process
using processor, or idle?

Funnily enough, idle. There were also no noticeable uses of I/O (checked with
iotop).

The bedtools process it spawned was also in zombie state (killing at that
point, should one do it, would cause a segfault).

Since the pipeline failed again (I keep on forgetting all those gemini
data files ;) I'll try to monitor more closely once I restart it.

from bcbio-nextgen.

chapmanb commented on June 30, 2024

Luca;
My guess is that the process is waiting to read files from NFS. I'm not sure of a good way to monitor this short of looking at network traffic. On a re-run, bedtools should only be used for reading a BED file of regions to process, so I wouldn't expect any IO/processor intensive work there. If you find it locking up, rather than being slow, I'll be happy to dig further.

from bcbio-nextgen.

lbeltrame commented on June 30, 2024

In data lunedì 8 luglio 2013 06:01:51, Brad Chapman ha scritto:

My guess is that the process is waiting to read files from NFS. I'm not sure
of a good way to monitor this short of looking at network traffic. On a

I wondered about that, although in this case it is "artificial NFS" (the VMs
all share the same physical hardware). Neverthless it may as well be that.
I'll check network traffic.

Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79

from bcbio-nextgen.

lbeltrame commented on June 30, 2024

At last I was able to rerun the pipeline (it's running right now) and network traffic seems negligible. I'll let it run overnight and check tomorrow.

from bcbio-nextgen.

chapmanb commented on June 30, 2024

Luca;
Can you check if that fix resolves it for you? Apologies, I thought through this and realized I introduced a delay by adding in a defensive sleep for network issues. I think this is the issue, but let me know if not.

from bcbio-nextgen.

lbeltrame commented on June 30, 2024

It is, thanks. It passed the step without a hitch now.

from bcbio-nextgen.

Improve error reporting and recovery from errors about bcbio-nextgen HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent