Git Product home page Git Product logo

Comments (13)

chapmanb avatar chapmanb commented on June 30, 2024

Luca;
Thanks for sharing your good experiences with joblib. Would integrating that to replace multiprocessing/concurrent.futures resolve these issues? I heard good reports about joblib at SciPy so am open to including that if it helps clean up the hangs and bad error reporting from multiprocessing.

from bcbio-nextgen.

lbeltrame avatar lbeltrame commented on June 30, 2024

In data venerdì 5 luglio 2013 03:46:09, Brad Chapman ha scritto:

Thanks for sharing your good experiences with joblib. Would integrating that
to replace multiprocessing/concurrent.futures resolve these issues? I heard

I'm not sure per se (of course it will require a bit of testing) but it's such
a better improvement that I would really recommend it. Even just for its
excellent error reporting and the ability to turn off MP if needed so you can
look at the code in a non-parallel way (a godsend for deugging) it would be a
worthwhile addition.

from bcbio-nextgen.

chapmanb avatar chapmanb commented on June 30, 2024

Luca;
That makes good sense. bcbio-nextgen has the feature of turning multiprocessing off for single cores, for exactly the debugging reason you mention, but using an external library for this is an improvement.

I pushed a joblib based implementation to the development version. Let me know if this helps with the issues you were seeing. I did manage to simulate a few lockups by randomly killing processes during a run. It seems to get stuck acquiring it's internal lock. In reading the code it seems strange they use a threading Lock instead of a multiprocessing one:

https://github.com/joblib/joblib/blob/master/joblib/parallel.py#L489

but I haven't debugged further yet. Let me know how this works in your hands.

from bcbio-nextgen.

lbeltrame avatar lbeltrame commented on June 30, 2024

Hmm... the one I'm seeing looks like a hang in bedtools / pybedtools instead. It passes by if the pipeline is run with no data, but hangs wihen there is already data present (the pipeline on the earlier run had an error because gemini wasn't installed).

from bcbio-nextgen.

lbeltrame avatar lbeltrame commented on June 30, 2024

This is what I got by attaching gdb to the hung processes (State: "[2013-07-08 08:47] ipython: piped_bamprep -- local; checkpoint passed")

Parent process:

Thread 1 (Thread 0x7f405ecff700 (LWP 19190)):
  41                               "{record.message}"])
  42
  43        log_dir = _get_log_dir(config)
  44        if log_dir:
  45            utils.safe_makedir(log_dir)
 >46            time.sleep(1)
  47            handlers.append(logbook.FileHandler(os.path.join(log_dir, "%s.log" % LOG_NAME),
  48                                                format_string=format_str, level="INFO",
  49                                                filter=_not_cl))
  50            handlers.append(logbook.FileHandler(os.path.join(log_dir, "%s-debug.log" % LOG_NAME),
  51                                                format_string=format_str, level="DEBUG", bubble=True,

Children are hung on ZMQ polling:

(gdb) bt
#0  0x00007f405dd890d3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f405bbe637b in zmq::epoll_t::loop (this=0x3016620) at epoll.cpp:142
#2  0x00007f405bc02ac6 in thread_routine (arg_=0x3016690) at thread.cpp:83
#3  0x00007f405e8deb50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4  0x00007f405dd88a7d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5  0x0000000000000000 in ?? ()

Often the bedtools call also fails and hangs.

Notice that this happens at a local level (but on a NFS-mounted volume) and only if I rerun an interrupted pipeline due to an error (first it was snpEff, then gemini). Notice that if I start from scratch, this does not happen.

from bcbio-nextgen.

lbeltrame avatar lbeltrame commented on June 30, 2024

More interesting notices: the process "unstuck" after 30 minutes:

[2013-07-08 09:06]  Timing: alignment post-processing
[2013-07-08 09:06]  ipython: piped_bamprep -- local; checkpoint passed
[2013-07-08 09:39]  Timing: variant calling

from bcbio-nextgen.

chapmanb avatar chapmanb commented on June 30, 2024

Luca;
It doesn't sound like this is stuck, but rather processing to get up to the point where it can re-start processing. There is sometimes work that needs to be done to recover to the right state. Is the bcbio_nextgen process using processor, or idle?

from bcbio-nextgen.

lbeltrame avatar lbeltrame commented on June 30, 2024

In data lunedì 8 luglio 2013 03:12:20, Brad Chapman ha scritto:

to be done to recover to the right state. Is the bcbio_nextgen process
using processor, or idle?

Funnily enough, idle. There were also no noticeable uses of I/O (checked with
iotop).

The bedtools process it spawned was also in zombie state (killing at that
point, should one do it, would cause a segfault).

Since the pipeline failed again (I keep on forgetting all those gemini
data files ;) I'll try to monitor more closely once I restart it.

from bcbio-nextgen.

chapmanb avatar chapmanb commented on June 30, 2024

Luca;
My guess is that the process is waiting to read files from NFS. I'm not sure of a good way to monitor this short of looking at network traffic. On a re-run, bedtools should only be used for reading a BED file of regions to process, so I wouldn't expect any IO/processor intensive work there. If you find it locking up, rather than being slow, I'll be happy to dig further.

from bcbio-nextgen.

lbeltrame avatar lbeltrame commented on June 30, 2024

In data lunedì 8 luglio 2013 06:01:51, Brad Chapman ha scritto:

My guess is that the process is waiting to read files from NFS. I'm not sure
of a good way to monitor this short of looking at network traffic. On a

I wondered about that, although in this case it is "artificial NFS" (the VMs
all share the same physical hardware). Neverthless it may as well be that.
I'll check network traffic.

Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79

from bcbio-nextgen.

lbeltrame avatar lbeltrame commented on June 30, 2024

At last I was able to rerun the pipeline (it's running right now) and network traffic seems negligible. I'll let it run overnight and check tomorrow.

from bcbio-nextgen.

chapmanb avatar chapmanb commented on June 30, 2024

Luca;
Can you check if that fix resolves it for you? Apologies, I thought through this and realized I introduced a delay by adding in a defensive sleep for network issues. I think this is the issue, but let me know if not.

from bcbio-nextgen.

lbeltrame avatar lbeltrame commented on June 30, 2024

It is, thanks. It passed the step without a hitch now.

from bcbio-nextgen.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.