Comments (13)
Luca;
Thanks for sharing your good experiences with joblib. Would integrating that to replace multiprocessing/concurrent.futures resolve these issues? I heard good reports about joblib at SciPy so am open to including that if it helps clean up the hangs and bad error reporting from multiprocessing.
from bcbio-nextgen.
In data venerdì 5 luglio 2013 03:46:09, Brad Chapman ha scritto:
Thanks for sharing your good experiences with joblib. Would integrating that
to replace multiprocessing/concurrent.futures resolve these issues? I heard
I'm not sure per se (of course it will require a bit of testing) but it's such
a better improvement that I would really recommend it. Even just for its
excellent error reporting and the ability to turn off MP if needed so you can
look at the code in a non-parallel way (a godsend for deugging) it would be a
worthwhile addition.
from bcbio-nextgen.
Luca;
That makes good sense. bcbio-nextgen has the feature of turning multiprocessing off for single cores, for exactly the debugging reason you mention, but using an external library for this is an improvement.
I pushed a joblib based implementation to the development version. Let me know if this helps with the issues you were seeing. I did manage to simulate a few lockups by randomly killing processes during a run. It seems to get stuck acquiring it's internal lock. In reading the code it seems strange they use a threading
Lock instead of a multiprocessing
one:
https://github.com/joblib/joblib/blob/master/joblib/parallel.py#L489
but I haven't debugged further yet. Let me know how this works in your hands.
from bcbio-nextgen.
Hmm... the one I'm seeing looks like a hang in bedtools / pybedtools instead. It passes by if the pipeline is run with no data, but hangs wihen there is already data present (the pipeline on the earlier run had an error because gemini wasn't installed).
from bcbio-nextgen.
This is what I got by attaching gdb to the hung processes (State: "[2013-07-08 08:47] ipython: piped_bamprep -- local; checkpoint passed")
Parent process:
Thread 1 (Thread 0x7f405ecff700 (LWP 19190)):
41 "{record.message}"])
42
43 log_dir = _get_log_dir(config)
44 if log_dir:
45 utils.safe_makedir(log_dir)
>46 time.sleep(1)
47 handlers.append(logbook.FileHandler(os.path.join(log_dir, "%s.log" % LOG_NAME),
48 format_string=format_str, level="INFO",
49 filter=_not_cl))
50 handlers.append(logbook.FileHandler(os.path.join(log_dir, "%s-debug.log" % LOG_NAME),
51 format_string=format_str, level="DEBUG", bubble=True,
Children are hung on ZMQ polling:
(gdb) bt
#0 0x00007f405dd890d3 in epoll_wait () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f405bbe637b in zmq::epoll_t::loop (this=0x3016620) at epoll.cpp:142
#2 0x00007f405bc02ac6 in thread_routine (arg_=0x3016690) at thread.cpp:83
#3 0x00007f405e8deb50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00007f405dd88a7d in clone () from /lib/x86_64-linux-gnu/libc.so.6
#5 0x0000000000000000 in ?? ()
Often the bedtools call also fails and hangs.
Notice that this happens at a local level (but on a NFS-mounted volume) and only if I rerun an interrupted pipeline due to an error (first it was snpEff, then gemini). Notice that if I start from scratch, this does not happen.
from bcbio-nextgen.
More interesting notices: the process "unstuck" after 30 minutes:
[2013-07-08 09:06] Timing: alignment post-processing
[2013-07-08 09:06] ipython: piped_bamprep -- local; checkpoint passed
[2013-07-08 09:39] Timing: variant calling
from bcbio-nextgen.
Luca;
It doesn't sound like this is stuck, but rather processing to get up to the point where it can re-start processing. There is sometimes work that needs to be done to recover to the right state. Is the bcbio_nextgen process using processor, or idle?
from bcbio-nextgen.
In data lunedì 8 luglio 2013 03:12:20, Brad Chapman ha scritto:
to be done to recover to the right state. Is the bcbio_nextgen process
using processor, or idle?
Funnily enough, idle. There were also no noticeable uses of I/O (checked with
iotop).
The bedtools process it spawned was also in zombie state (killing at that
point, should one do it, would cause a segfault).
Since the pipeline failed again (I keep on forgetting all those gemini
data files ;) I'll try to monitor more closely once I restart it.
from bcbio-nextgen.
Luca;
My guess is that the process is waiting to read files from NFS. I'm not sure of a good way to monitor this short of looking at network traffic. On a re-run, bedtools should only be used for reading a BED file of regions to process, so I wouldn't expect any IO/processor intensive work there. If you find it locking up, rather than being slow, I'll be happy to dig further.
from bcbio-nextgen.
In data lunedì 8 luglio 2013 06:01:51, Brad Chapman ha scritto:
My guess is that the process is waiting to read files from NFS. I'm not sure
of a good way to monitor this short of looking at network traffic. On a
I wondered about that, although in this case it is "artificial NFS" (the VMs
all share the same physical hardware). Neverthless it may as well be that.
I'll check network traffic.
Luca Beltrame - KDE Forums team
KDE Science supporter
GPG key ID: 6E1A4E79
from bcbio-nextgen.
At last I was able to rerun the pipeline (it's running right now) and network traffic seems negligible. I'll let it run overnight and check tomorrow.
from bcbio-nextgen.
Luca;
Can you check if that fix resolves it for you? Apologies, I thought through this and realized I introduced a delay by adding in a defensive sleep for network issues. I think this is the issue, but let me know if not.
from bcbio-nextgen.
It is, thanks. It passed the step without a hitch now.
from bcbio-nextgen.
Related Issues (20)
- Failed in generating genome files HOT 1
- umi stat HOT 1
- Error during alignment using STAR HOT 2
- Does bcbio support smooth restart when a job is stopped? HOT 2
- Can bcbio customize the user-defined GLIBC directory? HOT 7
- bcbio run not running samples in parallel HOT 2
- mm10 Genome Installation Error
- miRBase certificate error
- miRBase genomes URLs have changed
- miRBase download files are not compressed
- Files are no longer available on miRBase ftp site
- miRBase old versions are no longer available
- Anaconda channels HOT 1
- resources assignment when perform parallel jobs HOT 1
- vcfanno bug
- Error upgrading bcbio-nextgen to add the genome and an aligner
- ATAC-seq
- ATAC-seq pipeline: what exampley is ready.bam? HOT 1
- MultiQC error: cannot import name 'TypedDict' from 'typing' HOT 2
- KeyError: 'MB1' ' returned non-zero exit status 1. when running scRNA-seq analysis for indropv3 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bcbio-nextgen.