Git Product home page Git Product logo

bcbio-nextgen's People

Contributors

a113n avatar alexcoman avatar apastore avatar biocyberman avatar bogdang989 avatar brainstorm avatar brentp avatar cbrueffer avatar chapmanb avatar guillermo-carrasco avatar hackdna avatar hammer avatar kern3020 avatar lbeltrame avatar lpantano avatar mariogiov avatar matted avatar matthdsm avatar mistrm82 avatar mjafin avatar naumenko-sa avatar peterjc avatar porterjamesj avatar roryk avatar skanwal avatar smoe avatar tanglingfung avatar tetianakh avatar vals avatar vladsavelyev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bcbio-nextgen's Issues

Increase timeout for hub connection (or --timeout does nothing?)

I have a cluster of 34 nodes, 32 of which are totally diskless over NFS, running SGE: while most of the heavy-IO paths are in RAM, there is still a noticeable delay in getting the clusters set up in an IPython cluster because they all access the NFS folder keeping the work directory.

This is particularly evident when I want to use all the cores available (700+).

bcibio-nextgen keeps on failing with:

IPython.parallel.error.TimeoutError: Hub connection request timed out

no matter how much I set the timeout option to.

So, is the timeout option what I am needing, or is it a shortcoming of IPython, or is it hardcoded?

capture performance metrics

Hi,

when variant_region is set, since realign_bam would be 'trimmed', hs_metrics with hybrid_target specified on the exome may not be reflecting the truth. (e.g. off target information would be missed).

I think the summary stats (hs_metrics) would be generated earlier (e.g. after deduplication), how do you think?

Thanks,
Paul

installation error after xz -dc hg19-novoalign.tar.xz | tar -xvpf -

Hi Brad -

Sorry to keep hitting you with daily bug reports first thing in the morning. The cloudbiolinux fix in issue 6 (#6) allowed us to progress to the point where were were able to download the human reference, but we seem to have hit another error after the "xz -dc hg19-novoalign.tar.xz | tar -xvpf -" step. Running this command on its own seems to have worked and the contents of the hg19 directory suggest the xz step worked:

$ ls /home/cbergman/data_directory/genomes/Hsapiens/hg19
bowtie2 bwa hg19-novoalign.tar.xz novoalign

Any ideas on what to do next?

Best regards,
Casey

[SNIP]
INFO: This is a Base Flavor - no overrides
DBG [init.py]: Minimal Edition 1.6.0
INFO: This is a minimal
INFO: Distribution ubuntu
INFO: Get local environment
INFO: Ubuntu setup
DBG [distribution.py]: Debian-shared setup
DBG [distribution.py]: Source=quantal
DBG [distribution.py]: NixPkgs: Ignored
[localhost] Login password for 'cbergman':
INFO: Now, testing connection to host...
INFO: Connection to host appears to work!
DBG [utils.py]: Expand paths
INFO: List of genomes to get (from the config file at '/home/cbergman/tmpbcbio-install/biodata.yaml'): Human (hg19), Human (GRCh37)
INFO: Downloading genome hg19 to /home/cbergman/data_directory/genomes/Hsapiens/hg19
INFO: Downloading genome hg19 to /home/cbergman/data_directory/genomes/Hsapiens/hg19
INFO: Downloading genome hg19 to /home/cbergman/data_directory/genomes/Hsapiens/hg19

Fatal error: run() received nonzero return code 2 while executing!

Requested: xz -dc hg19-novoalign.tar.xz | tar -xvpf -
Executed: /bin/bash --noprofile -i -l -c "cd /home/cbergman/data_directory/genomes/Hsapiens/hg19 && xz -dc hg19-novoalign.tar.xz | tar -xvpf -"

Aborting.
Traceback (most recent call last):
File "bcbio_nextgen_install.py", line 188, in
main(parser.parse_args())
File "bcbio_nextgen_install.py", line 43, in main
install_data(cbl["data_fabfile"], fabricrc, biodata)
File "bcbio_nextgen_install.py", line 84, in install_data
"-c", fabricrc, "install_data_s3:%s" % biodata])
File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['fab', '-f', '/home/cbergman/tmpbcbio-install/cloudbiolinux/data_fabfile.py', '-H', 'localhost', '-c', '/home/cbergman/tmpbcbio-install/fabricrc.txt', 'install_data_s3:/home/cbergman/tmpbcbio-install/biodata.yaml']' returned non-zero exit status 1

python-yaml, git and fabric dependencies

Working on a vanilla installation of BioLinux 7.0, we found some additional dependencies (python-yaml, git and fabric) that prevented the bcbio_nextgen_install.py script from completing. It might be worth adding checks to see if these resources are installed and, if not, attempt to install them automatically.

To get the bcbio_nextgen_install.py to run to the point where resources from CloudBioLinux were being installed, we needed to execute the following:

$ wget https://raw.github.com/chapmanb/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
$ sudo apt-get install python-yaml
$ sudo apt-get install git
$ sudo apt-get install fabric
$ python bcbio_nextgen_install.py install_directory data_directory

Our need to install python-yaml is a bit curious since it states that PyYAML is installed automatically at: https://bcbio-nextgen.readthedocs.org/en/latest/contents/installation.html

sudo() received nonzero return code 1 error after GATK install

Hi Brad -

We tried the following install process on vanilla biolinux 7 but hit a new error:

sudo apt-get install python-yaml
sudo apt-get install git
wget https://raw.github.com/chapmanb/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py install_directory data_directory > bcbio_install_output_log

Any ideas?

-- Casey

[SNIP]
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [config.py]: Using config file /home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/../config/r-libs.yaml
DBG [config.py]: Using config file /home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/../config/r-libs.yaml
DBG [config.py]: Using config file /home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/../config/r-libs.yaml
DBG [config.py]: Using config file /home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/../config/r-libs.yaml
DBG [config.py]: Using config file /home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/../config/r-libs.yaml
DBG [config.py]: Using config file /home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/../config/r-libs.yaml
DBG [config.py]: Using config file /home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/../config/r-libs.yaml
DBG [config.py]: Using config file /home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/../config/r-libs.yaml
DBG [config.py]: Using config file /home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/../config/r-libs.yaml

Fatal error: sudo() received nonzero return code 1 while executing!

Requested: ant gsalib
Executed: sudo -S -p 'sudo password:' /bin/bash --noprofile -i -l -c "cd /home/cbergman/tmp/cloudbiolinux/gatk && ant gsalib"

Aborting.
Traceback (most recent call last):
File "bcbio_nextgen_install.py", line 211, in
main(parser.parse_args())
File "bcbio_nextgen_install.py", line 43, in main
install_tools(cbl["tool_fabfile"], fabricrc, venv)
File "bcbio_nextgen_install.py", line 93, in install_tools
"install_biolinux:flavor=ngs_pipeline"])
File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/cbergman/data_directory/bcbio-nextgen-virtualenv/bin/fab', '-f', '/home/cbergman/tmpbcbio-install/cloudbiolinux/fabfile.py', '-H', 'localhost', '-c', '/home/cbergman/tmpbcbio-install/fabricrc.txt', 'install_biolinux:flavor=ngs_pipeline']' returned non-zero exit status 1

"Could not find a tag or branch '9f313b4' ..." through bcbio_nextgen_install.py

Hi, I tried the following command (I'm not in sudoer's list)

$ python bcbio_nextgen_install.py ~/workspace/usr/local ~/workspace/usr/local/share/bcbio-nextgen --nosudo --nodata --notools

and I come up to this error:

Downloading/unpacking ipython from git+https://github.com/ipython/ipython.git@9f313b4#egg=ipython (from -r https://raw.github.com/chapmanb/bcbio-nextgen/master/requirements.txt (line 14))
Cloning https://github.com/ipython/ipython.git (to 9f313b4) to /export/home/malik/mal_workspace/usr/local/share/bcbio-nextgen/bcbio-nextgen-virtualenv/build/ipython
Could not find a tag or branch '9f313b4', assuming commit.
Running setup.py egg_info for package ipython
Cannot build / install IPython with unclean submodules
Please update submodules with
python setup.py submodule
or
git submodule update
or commit any submodule changes you have made.
Complete output from command python setup.py egg_info:
Cannot build / install IPython with unclean submodules

Please update submodules with

python setup.py submodule

or

git submodule update

or commit any submodule changes you have made.


Command python setup.py egg_info failed with error code 1 in /export/home/malik/mal_workspace/usr/local/share/bcbio-nextgen/bcbio-nextgen-virtualenv/build/ipython
Storing complete log in /export/home/malik/.pip/pip.log
Traceback (most recent call last):
File "bcbio_nextgen_install.py", line 211, in
main(parser.parse_args())
File "bcbio_nextgen_install.py", line 48, in main
install_bcbio_nextgen(remotes, args.datadir, args.tooldir, args.sudo, venv)
File "bcbio_nextgen_install.py", line 80, in install_bcbio_nextgen
subprocess.check_call([venv["pip"], "install", "-r", remotes["requirements"]])
File "/usr/local/lib/python2.7/subprocess.py", line 504, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/export/home/malik/mal_workspace/usr/local/share/bcbio-nextgen/bcbio-nextgen-virtualenv/bin/pip', 'install', '-r', 'https://raw.github.com/chapmanb/bcbio-nextgen/master/requirements.txt']' returned non-zero exit status 1

any help would be great, and thanks in advance..

404 not found for cramtools-1.0.jar

When running bcbio_nextgen_install.py on a vanilla BioLinux 7.0 machine (plus python-yaml, fabric and git), the script fails when trying to install cram apparently due to a 404 not found on the cramtools-1.0.jar file. It looks like the cramtools-1.0.jar file has been replaced by the cramtools-2.0.jar file on the cram github site. We weren't able to identify where to make this change in the bcbio_nextgen_install.py pipeline and thought it might be easier to fix in the master branch. Error output is below:

DBG [fabfile.py]: Import cram
[localhost] sudo: mkdir -p /home/cbergman/install_directory/share/java/cram-1.0
[localhost] out: sudo password:
[localhost] out:
[localhost] sudo: ln -s /home/cbergman/install_directory/share/java/cram-1.0 /home/cbergman/install_directory/share/java/cram
[localhost] out: sudo password:
[localhost] out:
[localhost] run: echo $HOME
[localhost] out: /home/cbergman
[localhost] out:

[localhost] run: mkdir -p /home/cbergman/tmp/cloudbiolinux
[localhost] run: wget --no-check-certificate https://github.com/vadimzalunin/crammer/raw/master/cramtools-1.0.jar
[localhost] out: --2013-03-28 18:49:35-- https://github.com/vadimzalunin/crammer/raw/master/cramtools-1.0.jar
[localhost] out: Resolving github.com (github.com)... 207.97.227.239
[localhost] out: Connecting to github.com (github.com)|207.97.227.239|:443... connected.
[localhost] out: HTTP request sent, awaiting response... 404 Not Found
[localhost] out: 2013-03-28 18:49:37 ERROR 404: Not Found.
[localhost] out:
[localhost] out:

Fatal error: run() received nonzero return code 8 while executing!

Requested: wget --no-check-certificate https://github.com/vadimzalunin/crammer/raw/master/cramtools-1.0.jar
Executed: /bin/bash --noprofile -i -l -c "cd /home/cbergman/tmp/cloudbiolinux && wget --no-check-certificate https://github.com/vadimzalunin/crammer/raw/master/cramtools-1.0.jar"

Aborting.
Disconnecting from localhost... done.
Traceback (most recent call last):
File "bcbio_nextgen_install.py", line 188, in
main(parser.parse_args())
File "bcbio_nextgen_install.py", line 40, in main
install_tools(cbl["tool_fabfile"], fabricrc)
File "bcbio_nextgen_install.py", line 80, in install_tools
"install_biolinux:flavor=ngs_pipeline"])
File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['fab', '-f', '/home/cbergman/tmpbcbio-install/cloudbiolinux/fabfile.py', '-H', 'localhost', '-c', '/home/cbergman/tmpbcbio-install/fabricrc.txt', 'install_biolinux:flavor=ngs_pipeline']' returned non-zero exit status 1
cbergman@cbergman-VirtualBox[cbergman]

Document how the different flavors of bwa are used

My cluster is capable of using bwa mem without issues, however the pipeline switches to bwa aln. Perhaps it should be documented:

a. How the different types of bwa alignment are used;
b. How to force mem or aln, for example

variant_regions configuration ignored for many steps

I've reviewed the results produced by both my MuTect code and the generic GATK variant caller and I noticed the presence of regions outside the ones I had specified in the configuration.

Likewise, the pipeline does not seem to make use of this file when doing the preprocessing and recalibration: with my data set (multiplexed targeted sequencing) it should, since it's very small and except for alighment errors there is no material outside the regions I have defined in the BED file I supplied.

Is there a way to check whether it was used or not?

Is --noprofile really necessary?

Why is it necessary to have items installed using a subshell called with the --noprofile flag? On our cluster, it is very convenient to set environment variables to isolate applications. The install script for bcbio-nextgen makes this very difficult, as it forces all dependencies (libraries, cmake, python) be located in /usr/local or higher. Is there a way of preserving environment variables during the installation?

Getting started problem

Hello,
I again try to work with bcbio-nextgen.
I follow example pipeline from getting started section https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#exome-with-validation-against-reference-materials
I've modified path to tools and data/config files and start pipeline.
In bcbio-nextgen-commands.log I found the following command:

/data/bcbio/tools/bin/novoalign -o SAM '@RG\tID:1\tPL:illumina\tPU:1_2013-04-03_methodcmp\tSM:NA12878-1' -d /data/bcbio/data/genomes/Hsapiens/GRCh37/novoalign/GRCh37 -f /data/bcbio/data/exp1/work/../input/NA12878-NGv3-LAB1360-A_1.fastq.gz /data/bcbio/data/exp1/work/../input/NA12878-NGv3-LAB1360-A_2.fastq.gz   -c 4 -r None -o FullNW -k | /usr/bin/samtools view -b -S -u - | /usr/bin/samtools sort -@ 4 -m 2G - /data/bcbio/data/exp1/work/align/NA12878-1/tx/tmpdT_so1/1_2013-04-03_methodcmp-sort

And this gives me an error:

[2013-06-10 17:48]  Found YAML samplesheet, using ../config/NA12878-exome-methodcmp.yaml instead of Galaxy API
[2013-06-10 17:48]  Preparing 1_2013-04-03_methodcmp
[2013-06-10 17:48]  Preparing 2_2013-04-03_methodcmp
[2013-06-10 17:48]  Preparing 3_2013-04-03_methodcmp
[2013-06-10 17:48]  Preparing 4_2013-04-03_methodcmp
[2013-06-10 17:48]  multiprocessing: align_prep_full
[2013-06-10 17:48]  Aligning lane 1_2013-04-03_methodcmp with novoalign aligner
[2013-06-10 17:48]  Novoalign: NA12878-1
[2013-06-10 17:48]  sort: invalid option -- '@'
[2013-06-10 17:48]  open: No such file or directory
[2013-06-10 17:48]  [bam_sort_core] fail to open file 4
[2013-06-10 17:48]  # License expires in 24 days.
[2013-06-10 17:48]  # novoalign (V3.00.02 - Build Apr  1 2013 @ 09:45:36 - A short read aligner with qualities.
[2013-06-10 17:48]  # (C) 2008,2009,2010,2011 NovoCraft Technologies Sdn Bhd.
[2013-06-10 17:48]  # License file: /data/bcbio/tools/bin/novoalign.lic
[2013-06-10 17:48]  # Licensed to Genome Assembly Algorithms Laboratory@University ITMO
[2013-06-10 17:48]  #  novoalign -o SAM @RG\tID:1\tPL:illumina\tPU:1_2013-04-03_methodcmp\tSM:NA12878-1 -d /data/bcbio/data/genomes/Hsapiens/GRCh37/novoalign/GRCh37 -f /data/bcbio/data/exp1/work/../input/NA12878-NGv3-LAB1360-A_1.fastq.gz /data/bcbio/data/exp1/work/../input/NA12878-NGv3-LAB1360-A_2.fastq.gz -c 4 -r None -o FullNW -k
[2013-06-10 17:48]  # Starting at Mon Jun 10 21:48:42 2013
[2013-06-10 17:48]  # Interpreting input files as Illumina FASTQ, Casava Pipeline 1.8 & later.
[2013-06-10 17:49]  # Index Build Version: 2.7
[2013-06-10 17:49]  # Hash length: 14
[2013-06-10 17:49]  # Step size: 2
[2013-06-10 17:49]  [samopen] SAM header is present: 84 sequences.
[2013-06-11 07:50]  #       Paired Reads: 44078425
[2013-06-11 07:50]  #      Pairs Aligned: 43418004
[2013-06-11 07:50]  #     Read Sequences: 88156850
[2013-06-11 07:50]  #            Aligned: 87427223
[2013-06-11 07:50]  #   Unique Alignment: 77731327
[2013-06-11 07:50]  #   Gapped Alignment:  1498559
[2013-06-11 07:50]  #     Quality Filter:   264149
[2013-06-11 07:50]  # Homopolymer Filter:     2311
[2013-06-11 07:50]  #       Elapsed Time: 50498.020 (secs.)
[2013-06-11 07:50]  #           CPU Time: 3326.46 (min.)
[2013-06-11 07:50]  # Fragment Length Distribution
[2013-06-11 07:50]  #   From    To  Count
[2013-06-11 07:50]  #   30  44  4
[2013-06-11 07:50]  #   45  59  3269
[2013-06-11 07:50]  #   60  74  22701
[2013-06-11 07:50]  #   75  89  72663
[2013-06-11 07:50]  #   90  104 289169
[2013-06-11 07:50]  #   105 119 781818
[2013-06-11 07:50]  #   120 134 1862926
[2013-06-11 07:50]  #   135 149 3393752
[2013-06-11 07:50]  #   150 164 4624425
[2013-06-11 07:50]  #   165 179 4959152
[2013-06-11 07:50]  #   180 194 4516583
[2013-06-11 07:50]  #   195 209 3727953
[2013-06-11 07:50]  #   210 224 2917937
[2013-06-11 07:50]  #   225 239 2225183
[2013-06-11 07:50]  #   240 254 1676576
[2013-06-11 07:50]  #   255 269 1249882
[2013-06-11 07:50]  #   270 284 932839
[2013-06-11 07:50]  #   285 299 692617
[2013-06-11 07:50]  #   300 314 517169
[2013-06-11 07:50]  #   315 329 386262
[2013-06-11 07:50]  #   330 344 290154
[2013-06-11 07:50]  #   345 359 217423
[2013-06-11 07:50]  #   360 374 165495
[2013-06-11 07:50]  #   375 389 125020
[2013-06-11 07:50]  #   390 404 94671
[2013-06-11 07:50]  #   405 419 71954
[2013-06-11 07:50]  #   420 434 54724
[2013-06-11 07:50]  #   435 449 42020
[2013-06-11 07:50]  #   450 464 31922
[2013-06-11 07:50]  #   465 479 25056
[2013-06-11 07:50]  #   480 494 19614
[2013-06-11 07:50]  #   495 509 15172
[2013-06-11 07:50]  #   510 524 12195
[2013-06-11 07:50]  #   525 539 9614
[2013-06-11 07:50]  #   540 554 4156
[2013-06-11 07:50]  #   555 569 117
[2013-06-11 07:50]  #   570 584 0
[2013-06-11 07:50]  #   585 599 0
[2013-06-11 07:50]  #   600 614 0
[2013-06-11 07:50]  #   615 629 0
[2013-06-11 07:50]  #   630 644 1
[2013-06-11 07:50]  #   645 659 1
[2013-06-11 07:50]  # Mean   197, Std Dev  57.9
[2013-06-11 07:50]  # Done at Tue Jun 11 11:50:38 2013
[2013-06-11 07:50]  Did not find non-empty output file /data/bcbio/data/exp1/work/align/NA12878-1/tx/tmpdT_so1/1_2013-04-03_methodcmp-sort.bam
[2013-06-11 07:50]  Uncaught exception occurred
Traceback (most recent call last):
  File "/data/bcbio/data/bcbio-nextgen-virtualenv/local/lib/python2.7/site-packages/bcbio/provenance/do.py", line 21, in run
    _do_run(cmd, checks)
  File "/data/bcbio/data/bcbio-nextgen-virtualenv/local/lib/python2.7/site-packages/bcbio/provenance/do.py", line 52, in _do_run
    raise IOError("External command failed")
IOError: External command failed
......

Simply it doesn't create output file for first stage.
It also looks like it doesn't recognize the command correcly:

[2013-06-10 17:48]  Novoalign: NA12878-1
[2013-06-10 17:48]  sort: invalid option -- '@'
[2013-06-10 17:48]  open: No such file or directory
[2013-06-10 17:48]  [bam_sort_core] fail to open file 4

Probably I should start with more simple example to understand format of config files and special tool options better.

bcbio.variation throws error

Hi,

When I am trying to run bcbio-variation variant-compare it throws the following error :

lein variant-compare variant_comp.yml

variant-compare' is not a task. See 'lein help'

Kindly suggest what I am doing wrong.

java.lang.Exception: jar file corrupt: /share/ircf/ircfapps/share/java/jdk1.6.0_27_x64/jre/lib/ext/log4j-1.2.16.jar
at bultitude.core$namespaces_in_jar.invoke(core.clj:62)
at bultitude.core$file__GT_namespaces.invoke(core.clj:98)
at clojure.lang.AFn.applyToHelper(AFn.java:163)
at clojure.lang.AFn.applyTo(AFn.java:151)
at clojure.core$apply.invoke(core.clj:619)
at clojure.core$partial$fn__4190.doInvoke(core.clj:2396)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$map$fn__4207.invoke(core.clj:2487)
at clojure.lang.LazySeq.sval(LazySeq.java:42)
at clojure.lang.LazySeq.seq(LazySeq.java:60)
at clojure.lang.RT.seq(RT.java:484)
at clojure.core$seq.invoke(core.clj:133)
at clojure.core$apply.invoke(core.clj:617)
at clojure.core$mapcat.doInvoke(core.clj:2514)
at clojure.lang.RestFn.invoke(RestFn.java:423)
at bultitude.core$namespaces_on_classpath.doInvoke(core.clj:112)
at clojure.lang.RestFn.invoke(RestFn.java:421)
at leiningen.core.main$tasks.invoke(main.clj:108)
at leiningen.core.main$task_not_found$fn__2127.invoke(main.clj:127)
at leiningen.core.main$task_not_found.doInvoke(main.clj:125)
at clojure.lang.RestFn.invoke(RestFn.java:410)
at clojure.lang.Var.invoke(Var.java:415)
at leiningen.core.main$resolve_task.invoke(main.clj:150)
at leiningen.core.main$resolve_task.invoke(main.clj:151)
at leiningen.core.main$apply_task.invoke(main.clj:180)
at leiningen.core.main$resolve_and_apply.invoke(main.clj:192)
at leiningen.core.main$_main$fn__2223.invoke(main.clj:256)
at leiningen.core.main$_main.doInvoke(main.clj:246)
at clojure.lang.RestFn.invoke(RestFn.java:421)
at clojure.lang.Var.invoke(Var.java:419)
at clojure.lang.AFn.applyToHelper(AFn.java:163)
at clojure.lang.Var.applyTo(Var.java:532)
at clojure.core$apply.invoke(core.clj:617)
at clojure.main$main_opt.invoke(main.clj:335)
at clojure.main$main.doInvoke(main.clj:440)
at clojure.lang.RestFn.invoke(RestFn.java:457)
at clojure.lang.Var.invoke(Var.java:427)
at clojure.lang.AFn.applyToHelper(AFn.java:172)
at clojure.lang.Var.applyTo(Var.java:532)
at clojure.main.main(main.java:37)
Caused by: java.util.zip.ZipException: error in opening zip file
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.(ZipFile.java:127)
at java.util.jar.JarFile.(JarFile.java:135)
--More--(81%)

extract out jar caller stuff

It would be nice if some of the Jar Machinery stuff was pulled out and made available as an external lib.

Here's a start

"""
>>> c = Jar('GATK.jar')
>>> c.UnifiedGenoTyper(I='input.bam', O='output.bam')
'java -Xmx 2G -jar /home/brentp/GATK.jar UnifiedGenoTyper I=input.bam O=output.bam'

>>> c = JarDir('/usr/local/src/picard/picard-tools-1.93/')
>>> sorted(c.jars)[:5]
['AddOrReplaceReadGroups', 'BamIndexStats', 'BamToBfq', 'BuildBamIndex', 'CalculateHsMetrics']

>>> c.MarkDuplicates
JarCommand('java -Xmx 2G -jar /usr/local/src/picard/picard-tools-1.93/MarkDuplicates.jar ')

>>> c.MarkDuplicates(I="input.bam", O="output.bam")
'java -Xmx 2G -jar /usr/local/src/picard/picard-tools-1.93/MarkDuplicates.jar  I=input.bam O=output.bam'

"""
import os
from glob import glob

def expand_path(p):
    jar = os.path.expanduser(os.path.expanduser(p))
    return os.path.abspath(jar)

class JarCommand(object):

    def __init__(self, jar, cmd):
        self.cmd = cmd
        self.jar = jar

    def __call__(self, **kwargs):
        return "%s %s %s" % (str(self.jar), self.cmd,
                " ".join(("%s=%s" % (k, v)) for k, v in kwargs.iteritems()))

    def __repr__(self):
        return "JarCommand('{jar} {cmd}')".format(jar=self.jar,
                cmd=self.cmd.strip())


class JarDir(object):
    def __init__(self, dir, mx='2G'):
        self.dir = expand_path(dir)
        self.jars = [os.path.splitext(os.path.basename(p))[0].replace("-",
            "_").replace(".", "DOT") for p in glob("%s/*.jar" % self.dir)]
        self.mx = mx

    def __getattr__(self, jar_path):
        assert jar_path in self.jars
        jar_path = os.path.join(self.dir, jar_path) + ".jar"
        jar = Jar(jar_path, self.mx)
        return JarCommand(jar, '')


class Jar(object):
    available_commands = None

    def __init__(self, jar, mx='2G'):

        self.jar = expand_path(jar)
        self.mx = "-Xmx %s" % mx

    def __str__(self):
        return "java {mx} -jar {jar}".format(jar=self.jar, mx=self.mx)

    def __getattr__(self, cmd):
        if self.available_commands:
            assert cmd in self.available_commands
        return JarCommand(self,  cmd)

class GATK(Jar):
    available_commands = ["Unified", "ASDF"]

if __name__ == "__main__":
    import doctest
    doctest.testmod()

Dcoumentation for function parameters

Given that most of the functions are called from tasks that just use *args, it is important to know the meaning of the parameters.

For example:

def variantcall_sample(data, region=None, out_file=None):
    """Parallel entry point for doing genotyping of a region of a sample.
    """

In this case it is not clear what data is. I ran into this issue while figuring out how to add paired support for MuTect. I can probably add some of the definitions via PR if I get at least a rough idea of what they do. I'll start adding what is self-documenting.

SandboxViolation during install.

I've installed previous versions (last version was about 3 weeks ago) of the pipeline using the following commands in the bcbio-nextgen directory:

python setup.py build
python setup.py install

However, there is now a

error: SandboxViolation: open('/dev/null', 'w') {}

during the install, and the error further suggests EasyInstall is trying to access regions outside the build area and that I flag it as an issue. Incidentally, I can run

pip install numpy

so I am assuming that the problem may lie in the setup.py. I apologize in advance if this is an issue with my version of easy_install or some other python package.

Thanks!

The full output is below:

running install
running bdist_egg
running egg_info
writing requirements to bcbio_nextgen.egg-info/requires.txt
writing bcbio_nextgen.egg-info/PKG-INFO
writing namespace_packages to bcbio_nextgen.egg-info/namespace_packages.txt
writing top-level names to bcbio_nextgen.egg-info/top_level.txt
writing dependency_links to bcbio_nextgen.egg-info/dependency_links.txt
reading manifest file 'bcbio_nextgen.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '*.md'
writing manifest file 'bcbio_nextgen.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/bcbio
copying build/lib/bcbio/utils.py -> build/bdist.linux-x86_64/egg/bcbio
copying build/lib/bcbio/init.py -> build/bdist.linux-x86_64/egg/bcbio
creating build/bdist.linux-x86_64/egg/bcbio/upload
copying build/lib/bcbio/upload/s3.py -> build/bdist.linux-x86_64/egg/bcbio/upload
copying build/lib/bcbio/upload/shared.py -> build/bdist.linux-x86_64/egg/bcbio/upload
copying build/lib/bcbio/upload/filesystem.py -> build/bdist.linux-x86_64/egg/bcbio/upload
copying build/lib/bcbio/upload/galaxy.py -> build/bdist.linux-x86_64/egg/bcbio/upload
copying build/lib/bcbio/upload/init.py -> build/bdist.linux-x86_64/egg/bcbio/upload
creating build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/ipythontasks.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/split.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/ipython.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/manage.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/transaction.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/lsf.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/multitasks.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/messaging.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/sge.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/init.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
copying build/lib/bcbio/distributed/tasks.py -> build/bdist.linux-x86_64/egg/bcbio/distributed
creating build/bdist.linux-x86_64/egg/bcbio/broad
copying build/lib/bcbio/broad/metrics.py -> build/bdist.linux-x86_64/egg/bcbio/broad
copying build/lib/bcbio/broad/picardrun.py -> build/bdist.linux-x86_64/egg/bcbio/broad
copying build/lib/bcbio/broad/init.py -> build/bdist.linux-x86_64/egg/bcbio/broad
creating build/bdist.linux-x86_64/egg/bcbio/structural
copying build/lib/bcbio/structural/hydra.py -> build/bdist.linux-x86_64/egg/bcbio/structural
copying build/lib/bcbio/structural/init.py -> build/bdist.linux-x86_64/egg/bcbio/structural
creating build/bdist.linux-x86_64/egg/bcbio/hmmer
copying build/lib/bcbio/hmmer/search.py -> build/bdist.linux-x86_64/egg/bcbio/hmmer
copying build/lib/bcbio/hmmer/init.py -> build/bdist.linux-x86_64/egg/bcbio/hmmer
creating build/bdist.linux-x86_64/egg/bcbio/log
copying build/lib/bcbio/log/init.py -> build/bdist.linux-x86_64/egg/bcbio/log
creating build/bdist.linux-x86_64/egg/bcbio/galaxy
copying build/lib/bcbio/galaxy/api.py -> build/bdist.linux-x86_64/egg/bcbio/galaxy
copying build/lib/bcbio/galaxy/init.py -> build/bdist.linux-x86_64/egg/bcbio/galaxy
creating build/bdist.linux-x86_64/egg/bcbio/ngsalign
copying build/lib/bcbio/ngsalign/tophat.py -> build/bdist.linux-x86_64/egg/bcbio/ngsalign
copying build/lib/bcbio/ngsalign/novoalign.py -> build/bdist.linux-x86_64/egg/bcbio/ngsalign
copying build/lib/bcbio/ngsalign/split.py -> build/bdist.linux-x86_64/egg/bcbio/ngsalign
copying build/lib/bcbio/ngsalign/bowtie.py -> build/bdist.linux-x86_64/egg/bcbio/ngsalign
copying build/lib/bcbio/ngsalign/bwa.py -> build/bdist.linux-x86_64/egg/bcbio/ngsalign
copying build/lib/bcbio/ngsalign/bowtie2.py -> build/bdist.linux-x86_64/egg/bcbio/ngsalign
copying build/lib/bcbio/ngsalign/mosaik.py -> build/bdist.linux-x86_64/egg/bcbio/ngsalign
copying build/lib/bcbio/ngsalign/init.py -> build/bdist.linux-x86_64/egg/bcbio/ngsalign
creating build/bdist.linux-x86_64/egg/bcbio/bam
copying build/lib/bcbio/bam/callable.py -> build/bdist.linux-x86_64/egg/bcbio/bam
copying build/lib/bcbio/bam/trim.py -> build/bdist.linux-x86_64/egg/bcbio/bam
copying build/lib/bcbio/bam/fastq.py -> build/bdist.linux-x86_64/egg/bcbio/bam
copying build/lib/bcbio/bam/cram.py -> build/bdist.linux-x86_64/egg/bcbio/bam
copying build/lib/bcbio/bam/init.py -> build/bdist.linux-x86_64/egg/bcbio/bam
copying build/lib/bcbio/bam/counts.py -> build/bdist.linux-x86_64/egg/bcbio/bam
creating build/bdist.linux-x86_64/egg/bcbio/solexa
copying build/lib/bcbio/solexa/samplesheet.py -> build/bdist.linux-x86_64/egg/bcbio/solexa
copying build/lib/bcbio/solexa/flowcell.py -> build/bdist.linux-x86_64/egg/bcbio/solexa
copying build/lib/bcbio/solexa/init.py -> build/bdist.linux-x86_64/egg/bcbio/solexa
creating build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/alignment.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/demultiplex.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/lane.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/sample.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/region.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/run_info.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/cleanbam.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/shared.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/fastq.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/variation.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/config_utils.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/qcsummary.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/init.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/main.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/storage.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/merge.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
copying build/lib/bcbio/pipeline/toplevel.py -> build/bdist.linux-x86_64/egg/bcbio/pipeline
creating build/bdist.linux-x86_64/egg/bcbio/rnaseq
copying build/lib/bcbio/rnaseq/cufflinks.py -> build/bdist.linux-x86_64/egg/bcbio/rnaseq
copying build/lib/bcbio/rnaseq/init.py -> build/bdist.linux-x86_64/egg/bcbio/rnaseq
copying build/lib/bcbio/rnaseq/count.py -> build/bdist.linux-x86_64/egg/bcbio/rnaseq
creating build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/effects.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/phasing.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/split.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/freebayes.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/ensemble.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/varscan.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/samtools.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/cortex.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/annotation.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/bamprep.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/multi.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/recalibrate.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/init.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/genotype.py -> build/bdist.linux-x86_64/egg/bcbio/variation
copying build/lib/bcbio/variation/realign.py -> build/bdist.linux-x86_64/egg/bcbio/variation
creating build/bdist.linux-x86_64/egg/bcbio/picard
copying build/lib/bcbio/picard/metrics.py -> build/bdist.linux-x86_64/egg/bcbio/picard
copying build/lib/bcbio/picard/utils.py -> build/bdist.linux-x86_64/egg/bcbio/picard
copying build/lib/bcbio/picard/init.py -> build/bdist.linux-x86_64/egg/bcbio/picard
creating build/bdist.linux-x86_64/egg/bcbio/workflow
copying build/lib/bcbio/workflow/xprize.py -> build/bdist.linux-x86_64/egg/bcbio/workflow
copying build/lib/bcbio/workflow/stormseq.py -> build/bdist.linux-x86_64/egg/bcbio/workflow
copying build/lib/bcbio/workflow/init.py -> build/bdist.linux-x86_64/egg/bcbio/workflow
byte-compiling build/bdist.linux-x86_64/egg/bcbio/utils.py to utils.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/upload/s3.py to s3.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/upload/shared.py to shared.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/upload/filesystem.py to filesystem.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/upload/galaxy.py to galaxy.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/upload/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/ipythontasks.py to ipythontasks.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/split.py to split.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/ipython.py to ipython.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/manage.py to manage.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/transaction.py to transaction.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/lsf.py to lsf.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/multitasks.py to multitasks.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/messaging.py to messaging.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/sge.py to sge.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/distributed/tasks.py to tasks.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/broad/metrics.py to metrics.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/broad/picardrun.py to picardrun.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/broad/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/structural/hydra.py to hydra.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/structural/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/hmmer/search.py to search.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/hmmer/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/log/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/galaxy/api.py to api.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/galaxy/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/ngsalign/tophat.py to tophat.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/ngsalign/novoalign.py to novoalign.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/ngsalign/split.py to split.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/ngsalign/bowtie.py to bowtie.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/ngsalign/bwa.py to bwa.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/ngsalign/bowtie2.py to bowtie2.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/ngsalign/mosaik.py to mosaik.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/ngsalign/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/bam/callable.py to callable.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/bam/trim.py to trim.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/bam/fastq.py to fastq.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/bam/cram.py to cram.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/bam/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/bam/counts.py to counts.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/solexa/samplesheet.py to samplesheet.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/solexa/flowcell.py to flowcell.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/solexa/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/alignment.py to alignment.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/demultiplex.py to demultiplex.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/lane.py to lane.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/sample.py to sample.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/region.py to region.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/run_info.py to run_info.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/cleanbam.py to cleanbam.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/shared.py to shared.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/fastq.py to fastq.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/variation.py to variation.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/config_utils.py to config_utils.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/qcsummary.py to qcsummary.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/main.py to main.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/storage.py to storage.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/merge.py to merge.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/pipeline/toplevel.py to toplevel.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/rnaseq/cufflinks.py to cufflinks.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/rnaseq/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/rnaseq/count.py to count.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/effects.py to effects.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/phasing.py to phasing.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/split.py to split.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/freebayes.py to freebayes.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/ensemble.py to ensemble.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/varscan.py to varscan.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/samtools.py to samtools.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/cortex.py to cortex.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/annotation.py to annotation.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/bamprep.py to bamprep.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/multi.py to multi.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/recalibrate.py to recalibrate.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/genotype.py to genotype.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/variation/realign.py to realign.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/picard/metrics.py to metrics.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/picard/utils.py to utils.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/picard/init.py to init.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/workflow/xprize.py to xprize.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/workflow/stormseq.py to stormseq.pyc
byte-compiling build/bdist.linux-x86_64/egg/bcbio/workflow/init.py to init.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
installing scripts to build/bdist.linux-x86_64/egg/EGG-INFO/scripts
running install_scripts
running build_scripts
creating build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/bcbio_nextgen.py -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/bam_to_wiggle.py -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/barcode_sort_trim.py -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/illumina_finished_msg.py -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/nextgen_analysis_server.py -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
copying build/scripts-2.7/solexa_qseq_to_fastq.py -> build/bdist.linux-x86_64/egg/EGG-INFO/scripts
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/bcbio_nextgen.py to 775
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/bam_to_wiggle.py to 775
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/barcode_sort_trim.py to 775
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/illumina_finished_msg.py to 775
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/nextgen_analysis_server.py to 775
changing mode of build/bdist.linux-x86_64/egg/EGG-INFO/scripts/solexa_qseq_to_fastq.py to 775
copying bcbio_nextgen.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bcbio_nextgen.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bcbio_nextgen.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bcbio_nextgen.egg-info/namespace_packages.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bcbio_nextgen.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying bcbio_nextgen.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
zip_safe flag not set; analyzing archive contents...
creating 'dist/bcbio_nextgen-0.6.3a-py2.7.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing bcbio_nextgen-0.6.3a-py2.7.egg
Removing /mnt/iscsi_speed/devel/pipeline/all_nodes_pipeline_env/lib/python2.7/site-packages/bcbio_nextgen-0.6.3a-py2.7.egg
Copying bcbio_nextgen-0.6.3a-py2.7.egg to /mnt/iscsi_speed/devel/pipeline/all_nodes_pipeline_env/lib/python2.7/site-packages
bcbio-nextgen 0.6.3a is already the active version in easy-install.pth
Installing solexa_qseq_to_fastq.py script to /mnt/iscsi_speed/devel/pipeline/all_nodes_pipeline_env/bin
Installing bam_to_wiggle.py script to /mnt/iscsi_speed/devel/pipeline/all_nodes_pipeline_env/bin
Installing barcode_sort_trim.py script to /mnt/iscsi_speed/devel/pipeline/all_nodes_pipeline_env/bin
Installing illumina_finished_msg.py script to /mnt/iscsi_speed/devel/pipeline/all_nodes_pipeline_env/bin
Installing bcbio_nextgen.py script to /mnt/iscsi_speed/devel/pipeline/all_nodes_pipeline_env/bin
Installing nextgen_analysis_server.py script to /mnt/iscsi_speed/devel/pipeline/all_nodes_pipeline_env/bin

Installed /mnt/iscsi_speed/devel/pipeline/all_nodes_pipeline_env/lib/python2.7/site-packages/bcbio_nextgen-0.6.3a-py2.7.egg
Processing dependencies for bcbio-nextgen==0.6.3a
Searching for numpy>=1.7.0
Reading http://pypi.python.org/simple/numpy/
Reading http://numpy.scipy.org
Reading http://sourceforge.net/project/showfiles.php?group_id=1369&package_id=175103
Reading http://www.numpy.org
Reading http://sourceforge.net/projects/numpy/files/NumPy/
Reading http://numeric.scipy.org
Best match: numpy 1.7.0
Downloading http://pypi.python.org/packages/source/n/numpy/numpy-1.7.0.zip#md5=ca27913c59393940e880fab420f985b4
Processing numpy-1.7.0.zip
Running numpy-1.7.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-83_HXo/numpy-1.7.0/egg-dist-tmp-XJTB6D
Running from numpy source directory.
error: SandboxViolation: open('/dev/null', 'w') {}

The package setup script has attempted to modify files on your system
that are not within the EasyInstall build area, and has been aborted.

This package cannot be safely installed by EasyInstall, and may not
support alternate installation locations even if you run its setup
script by hand. Please inform the package's author and the EasyInstall
maintainers to find out if a fix or workaround is available.

piped commands that fail don't trigger an error in subprocess.check_call

With a command like this in ngsalign/bwa.py:

                cmd = ("{bwa} mem -M -t {num_cores} -R '{rg_info}' -v 1 {ref_file} "
                       "{fastq_file} {pair_file} "
                       "| {samtools} view -b -S -u - "
                       "| {samtools} sort -@ {num_cores} -m {max_mem} - {tx_out_prefix}")

if bwa mem fails, the process will continue running and error farther in the pipeline so the bam transaction completes and something later in the pipeline that's expecting a valid BAM files errors. Since this part of the transaction completes, it makes this harder to debug. I'm not sure what is portable, but, in bash, we can have the check_call fail by using:

                cmd = ("set -o pipefail && "
                       "{bwa} mem -M -t {num_cores} -R '{rg_info}' -v 1 {ref_file} "
                       "{fastq_file} {pair_file} "
                       "| {samtools} view -b -S -u - "
                       "| {samtools} sort -@ {num_cores} -m {max_mem} - {tx_out_prefix}")

where we added set -o pipefail at the start.

upload

it appears to be a good idea to upload the program.txt, bcbio_sample.yaml, and project-summary.csv to the sample folder (where it may be also better to rename with fc_date and fc_name.

installation error at install_data_s3

Brad -

Following the recipe here (#5 (comment)) on biolinux7, we encountered the following error with the installation of the test data (install_data_s3):

[SNIP]
INFO: Custom install for 'stampy' end time: 2013-04-04 11:30:05.473528; duration: 0:00:00.295210
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Config end time: 2013-04-04 11:30:05.475118; duration: 0:12:11.246524
INFO: Distribution ubuntu
INFO: Get local environment
INFO: Ubuntu setup
DBG [distribution.py]: Debian-shared setup
DBG [distribution.py]: Source=quantal
DBG [distribution.py]: Checking target distribution ubuntu
[localhost] Login password for 'cbergman':
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/fabric/main.py", line 739, in main
_args, *_kwargs
File "/usr/local/lib/python2.7/dist-packages/fabric/tasks.py", line 316, in execute
multiprocessing
File "/usr/local/lib/python2.7/dist-packages/fabric/tasks.py", line 213, in _execute
return task.run(_args, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/fabric/tasks.py", line 123, in run
return self.wrapped(_args, *_kwargs)
File "/home/cbergman/tmpbcbio-install/cloudbiolinux/data_fabfile.py", line 66, in install_data_s3
setup_environment()
File "/home/cbergman/tmpbcbio-install/cloudbiolinux/data_fabfile.py", line 41, in setup_environment
_setup_distribution_environment()
File "/home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/distribution.py", line 31, in _setup_distribution_environment
_validate_target_distribution(env.distribution, env.get('dist_name', None))
File "/home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/distribution.py", line 77, in _validate_target_distribution
if env.edition.short_name in ["minimal"]:
AttributeError: 'str' object has no attribute 'short_name'
Traceback (most recent call last):
File "bcbio_nextgen_install.py", line 188, in
main(parser.parse_args())
File "bcbio_nextgen_install.py", line 43, in main
install_data(cbl["data_fabfile"], fabricrc, biodata)
File "bcbio_nextgen_install.py", line 84, in install_data
"-c", fabricrc, "install_data_s3:%s" % biodata])
File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['fab', '-f', '/home/cbergman/tmpbcbio-install/cloudbiolinux/data_fabfile.py', '-H', 'localhost', '-c', '/home/cbergman/tmpbcbio-install/fabricrc.txt', 'install_data_s3:/home/cbergman/tmpbcbio-install/biodata.yaml']' returned non-zero exit status 1

The tail of our log file reads:

$tail -n 20 bcbio_install_output_log

[localhost] sudo: cp bin/* /home/cbergman/install_directory/bin
[localhost] out: sudo password:
[localhost] out:
[localhost] sudo: cp lib/* /home/cbergman/install_directory/lib
[localhost] out: sudo password:
[localhost] out:
[localhost] run: rm -rf /home/cbergman/tmp/cloudbiolinux

Done.
Disconnecting from localhost... done.
[localhost] Executing task 'install_data_s3'
[localhost] run: cat /proc/version
[localhost] out: Linux version 3.2.0-39-generic (buildd@lamiak) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
[localhost] out:

Disconnecting from localhost... done.
Checking required dependencies
Installing tools...
Installing data...

Thanks again for helping trouble shoot this installation, I think we are getting close to a working solution with your help!

--Casey

Clustering fails unexpectedly

After updating to the latest bcbio-nextgen and related dependencies (ipython-cluster-helper - latest released version, ipython - latest git) I find myself unable to run any cluster job, because the heartbeat is lost immediately (with no error) and thus the pipeline falls to an extremely slow single core operation.

flowcell id and date

Hi,

When would a pipeline generate a random fc_name? I tried to re-run the pipeline, but it generates another fc_date and fc_name that I have no idea about. I wonder where I could control that.

Thanks,
Paul

better message on missing executables

when missing a program (like pdflatex), execution will stop with a non-informative error. This makes it more obvious what is going on.

diff --git a/bcbio/broad/__init__.py b/bcbio/broad/__init__.py
index 76b2f76..2d70ea4 100644
--- a/bcbio/broad/__init__.py
+++ b/bcbio/broad/__init__.py
@@ -175,7 +175,10 @@ def _get_picard_ref(config):
       picard:
         cmd: picard-tools
     """
-    picard = config_utils.get_program("picard", config, default="notfound")
+    try:
+        picard = config_utils.get_program("picard", config, default="notfound")
+    except config_utils.CmdNotFound:
+        picard = "notfound"
     if picard == "notfound" or os.path.isdir(picard):
         picard = config_utils.get_program("picard", config, "dir")
     return picard
diff --git a/bcbio/pipeline/config_utils.py b/bcbio/pipeline/config_utils.py
index aec9d02..0034da1 100644
--- a/bcbio/pipeline/config_utils.py
+++ b/bcbio/pipeline/config_utils.py
@@ -60,6 +60,26 @@ def get_program(name, config, ptype="cmd", default=None):
     else:
         raise ValueError("Don't understand program type: %s" % ptype)

+class CmdNotFound(Exception):
+    pass
+
+import os
+
+def _get_check_program_cmd(fn):
+
+    def wrap(name, config, default):
+        program = expand_path(fn(name, config, default))
+        is_ok = lambda f: os.path.isfile(f) and os.access(f, os.X_OK)
+        if is_ok(program): return program
+
+        for adir in os.environ['PATH'].split(":"):
+            if is_ok(os.path.join(adir, program)):
+                return os.path.join(adir, program)
+        else:
+            raise CmdNotFound(" ".join(map(str, (fn.func_name, name, config, default))))
+    return wrap
+
+@_get_check_program_cmd
 def _get_program_cmd(name, config, default):
     """Retrieve commandline of a program.
     """
diff --git a/bcbio/provenance/programs.py b/bcbio/provenance/programs.py
index c2ff14c..95c7986 100644
--- a/bcbio/provenance/programs.py
+++ b/bcbio/provenance/programs.py
@@ -73,7 +73,10 @@ def _parse_from_parenflag(stdout, x):
 def _get_cl_version(p, config):
     """Retrieve version of a single commandline program.
     """
-    prog = config_utils.get_program(p["cmd"], config)
+    try:
+        prog = config_utils.get_program(p["cmd"], config)
+    except config_utils.CmdNotFound:
+        return "NA"
     args = p.get("args", "")

     cmd = "{prog} {args}"

tool-data/.loc file mismatches

I've tried to start some pipelines using https://bcbio-nextgen.readthedocs.org/en/latest/contents/testing.html#exome-with-validation-against-reference-materials example or new sample config: https://bcbio-nextgen.readthedocs.org/en/latest/contents/configuration.html#automated-sample-config
But each time I get an error:
No such file or directory: '/data/bcbio/tool-data/sam_fa_indices.loc'
Actually this file as I can see is in directory /data/bcbio/data/galaxy/tool-data/

Avoid installing the package as zipped egg

Currently bcio-nextgen installs itself as a zipped Python egg. It would be much better to prevent this, because it makes debugging harder (traces point to non-existent files, issues with gdb).

Example error

I've installed bcbio on our local cluster, and I tried to run your first example (Exome with validation against reference materials). It failed with this error:

ValueError: file header is empty (mode='rb') - is it SAM/BAM format?

From the -debug log file, it shows

[2013-07-10 18:08] [samopen] no @sq lines in the header.

Any ideas of what to look for?

installer with nosudo option run commands that need sudo

bcbio_nextgen_install.py script with --nosudo option actually run commands that need sudo rights.
My distribution is ubuntu. I've tried run installer both with option --distribution ubuntu and without.
Log looks like:

...
[localhost] local: mkdir -p /home/fedotov/tmp/cloudbiolinux
[localhost] local: wget --no-check-certificate -O MACS-1.4.2.tar.gz 'https://github.com/downloads/taoliu/MACS/MACS-1.4.2.tar.gz'
...
[localhost] local: tar --pax-option='delete=SCHILY.*,delete=LIBARCHIVE.*' -xzpf MACS-1.4.2.tar.gz
[localhost] local: pip install --upgrade .
Unpacking /home/fedotov/tmp/cloudbiolinux/MACS-1.4.2
  Running setup.py egg_info for package from file:///home/fedotov/tmp/cloudbiolinux/MACS-1.4.2
    /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'console'
      warnings.warn(msg)
    /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'app'
      warnings.warn(msg)

    warning: no files found matching 'MANIFEST'
    no previously-included directories found matching 'test'
    no previously-included directories found matching 'web'
Installing collected packages: MACS
  Running setup.py install for MACS
    /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'console'
      warnings.warn(msg)
    /usr/lib/python2.7/distutils/dist.py:267: UserWarning: Unknown distribution option: 'app'
      warnings.warn(msg)
    changing mode of build/scripts-2.7/macs14 from 664 to 775
    changing mode of build/scripts-2.7/elandmulti2bed from 664 to 775
    changing mode of build/scripts-2.7/elandresult2bed from 664 to 775
    changing mode of build/scripts-2.7/elandexport2bed from 664 to 775
    changing mode of build/scripts-2.7/sam2bed from 664 to 775
    changing mode of build/scripts-2.7/wignorm from 664 to 775
    error: could not create '/usr/local/lib/python2.7/dist-packages/MACS14': Permission denied
...
Fatal error: local() encountered an error (return code 1) while executing 'pip install --upgrade .'

Aborting.
Disconnecting from localhost... done.
Traceback (most recent call last):
  File "bcbio_nextgen_install.py", line 211, in <module>
    main(parser.parse_args())
  File "bcbio_nextgen_install.py", line 43, in main
    install_tools(cbl["tool_fabfile"], fabricrc, venv)
  File "bcbio_nextgen_install.py", line 93, in install_tools
    "install_biolinux:flavor=ngs_pipeline"])
  File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/fedotov/soft/bcbio_tmp/data/bcbio-nextgen-virtualenv/bin/fab', '-f', '/home/fedotov/soft/bcbio_tmp/tmpbcbio-install/cloudbiolinux/fabfile.py', '-H', 'localhost', '-c', '/home/fedotov/soft/bcbio_tmp/tmpbcbio-install/fabricrc.txt', 'install_biolinux:flavor=ngs_pipeline']' returned non-zero exit status 1

ENH: simple scripts to create yaml's from list of fastqs/bams

In addition to the yaml's in example/, it'd be nice to have templates that would expand for my full list of samples, so that I could run, e.g.:

python scripts/create-yaml.py --template examples/variant2-base.yaml *.bam > my.yaml

and have it create a reasonable yaml with the current best-practices for variant calling. This could also be e.g.:

python scripts/create-yaml.py --template examples/variant2-ensembl.yaml *.bam > my.ensembl.yaml

I'm happy to try to implement this, just opening for discussion to decide best way to proceed. E.g. how much should we have a separate template for each analysis vs how much to include in the script itself.

Also things like, if the args are a list of fastqs, can we automatically pair them and create a yaml that will align them with bwa and then run variant2?

Basically, it'd be nice if there were more guidance on how to get things running given a list of input files.

csamtools crash when running a gatk-variant project

Hi @chapmanb, @roryk,

I'm setting the pipeline in our HPC cluster, using the test data as input files:

bcbio_nextgen.py -w template gatk-variant project1 tests/data/100326_FC6107FAAXX/7_100326_FC6107FAAXX.bam tests/data/100326_FC6107FAAXX/7_100326_FC6107FAAXX_1_fastq.txt tests/data/100326_FC6107FAAXX/7_100326_FC6107FAAXX_2_fastq.txt

cd ~/roman/dev/bcbio-nextgen/project1/work

bcbio_nextgen.py ~/config/bcbio_system.yaml ~/roman/dev/bcbio-nextgen/project1/config/project1.yaml

But pysam is not very happy about the header(s) of the test bam file:

(...)
  File "~/roman/.virtualenvs/bcbio/lib/python2.7/site-packages/bcbio_nextgen-0.7.0a-py2.7.egg/bcbio/pipeline/shared.py", line 51, in _do_work
    with closing(pysam.Samfile(bam_file, "rb")) as work_bam:
  File "csamtools.pyx", line 597, in csamtools.Samfile.__cinit__ (pysam/csamtools.c:5982)
  File "csamtools.pyx", line 760, in csamtools.Samfile._open (pysam/csamtools.c:7675)
ValueError: file header is empty (mode='rb') - is it SAM/BAM format?

Are those headers critical for the rest of the pipeline to go on smoothly?

Cheers!
Roman

@guillermo-carrasco

multiprocessing hangs at calc_callable_loci

Test system: Debian Wheezy
bcbio-nextgen version: git
Setup: 12 8-core VMs clustered using SGE

I tested the pipeline with one MiSeq run from our laboratory, and it hangs on the calc_callable_loci step.

Command line:

IPYTHONDIR=/mnt/data/software/ipython bcbio_nextgen.py -s sge -q main.q -t ipython -n 96 /mnt/data/references/bcbio_system.yaml ../config/project1.yaml

(main.q is my queue)

This is what the log says

[2013-07-04 14:35] MN-07: Combining fastq and BAM files ('', 'Sample_mysample')
[2013-07-04 14:36] MN-07: multiprocessing: calc_callable_loci

(MN-07 is one of our nodes)

pstree on the node itself shows

job_starter.sh(20704)───python(20705)─┬─java(20756)
                                      ├─python(20775)─┬─java(24104)
                                      │               └─{python}(21046)
                                      ├─python(20776)─┬─java(24118)
                                      │               └─{python}(20998)
                                      ├─python(20777)───{python}(20931)
                                      ├─python(20778)─┬─java(24047)
                                      │               └─{python}(21031)
                                      ├─python(20779)─┬─java(24028)
                                      │               └─{python}(20964)
                                      ├─python(20780)─┬─java(24085)
                                      │               └─{python}(21045)
                                      ├─python(20781)─┬─java(24066)
                                      │               └─{python}(20970)
                                      ├─python(20782)─┬─java(24002)
                                      │               └─{python}(20948)
                                      ├─{python}(20710)
                                      ├─{python}(20711)
                                      ├─{python}(20712)
                                      ├─{python}(20713)
                                      ├─{python}(20714)
                                      ├─{python}(20783)
                                      └─{python}(20784)

All the java processes are zombie processes.

Current project definition:

details:
- algorithm:
    aligner: bwa
    coverage_depth: high
    coverage_interval: regional
    mark_duplicates: false
    platform: illumina
    quality_format: Standard
    realign: gatk
    recalibrate: gatk
    variantcaller: gatk
    trim_reads: false
    write_summary: true
  analysis: variant2
  lane: 1
  description: Sample_mysample
  variant_regions: /mnt/data/projects/miSeq/regions/miSeq.bed
  metadata:
      batch: Batch1
  files:
  - /mnt/data/projects/miSeq/raw_data/Sample_mysample_R1.fastq.gz
  - /mnt/data/projects/miSeq/raw_data/Sample_mysample_R2.fastq.gz
  genome_build: hg19
- algorithm:
    aligner: bwa
    coverage_depth: high
    coverage_interval: regional
    mark_duplicates: false
    platform: illumina
    quality_format: Standard
    realign: gatk
    recalibrate: gatk
    variantcaller: gatk
    trim_reads: false
    write_summary: true
  analysis: variant2
  lane: 2
  description: Sample_mysample2
  metadata:
      batch: Batch1
  files:
  - /mnt/data/projects/miSeq/raw_data/Sample_mysample2_R1.fastq.gz
  - /mnt/data/projects/miSeq/raw_data/Sample_mysample2_R2.fastq.gz
  genome_build: hg19
  variant_regions: /mnt/data/projects/miSeq/regions/miSeq.bed
fc_date: '130703'
fc_name: project1

I can supply other logs as needed. The GATK version used is 2.5-2-gf57256b. Java is 1.7 (same version on all nodes).

Wish: clean up work directory after finishing

Currently the pipeline generates a lot of intermediate files and logs in the work directory it is launched in.

Perhaps it would be nice to have a "--clean" option to remove all work files after the run is complete.

Improve error reporting and recovery from errors

Currently, when external tools are used in cluster or multiprocessing mode, a single uncaught error can hang the whole pipeline.

I just fixed my GATK error only to find another related to bedtools (yet another hang). Therefore there should be an option to let out the debug from all tools so that errors are identified, and perhaps set a fixed timeout for processes so that they don't hang forever .

population.prep_db_parallel

Line 240 in https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/pipeline/main.py

I want to add an if-statement for gemini, either as a flag for user to specify if they want to create the db or as a function to check if 'gemini' is installed; it appears to me that validate and ensemble has a similar check in their corresponding function, but I am not sure 1. if this flag is better put in the main.py or in variation/population.py and 2. if result in population.py is an input of another module (I couldn't find it). Please advice.

Add licenses and copyrights to all source files

Some files are without licenses and copyright. While the MIT license in the main repository tree is good, in order to prevent issues while reusing this code elsewhere license headers should be added to all files.

(I'm quite involved in other non-scientific FOSS projects, where this issue surfaced already).

Consider attaching the git revision hash to non-released versions

It would make versioning better. For an example, taken from some internal code I wrote which followed the example set by the pandas library (the code is taken from there as-is):

version = "0.8"
development = True

fullversion = version

if development:

    fullversion += ".dev"

    try:
        import subprocess
        try:
            pipe = subprocess.Popen(["git", "rev-parse", "--short", "HEAD"],
                                    stdout=subprocess.PIPE).stdout
        except OSError:
            # msysgit compatibility
            pipe = subprocess.Popen(["git.cmd", "rev-parse", "--short",
                                     "HEAD"], stdout=subprocess.PIPE).stdout
        rev = pipe.read().strip()
        # makes distutils blow up on Python 2.7
        if sys.version_info[0] >= 3:
            rev = rev.decode('ascii')

        fullversion += "-%s" % rev
    except:
        warnings.warn("WARNING: Couldn't get git revision")

Cmake error while bcbio-nextgen installation

Hello,

I am trying to install tools for NGS_pipeline flavor using the command:

"fab -f cloudbiolinux/fabfile.py -H localhost install_biolinux:flavor=ngs_pipeline"

and I get the following error:

screen shot 2013-07-17 at 5 15 59 pm

Please suggest accordingly.

Thanks

Shalabh Suman

bcbio_nextgen_install.py install error after GATK download

Hi Brad -

In trying to test whether the new cramtools URL has been sorted in bcbio_nextgen_install.py, it looks like we're stumbled over another issue before the cramtools install step. (setup is vanilla biolinux 7 with add-ons discussed previously, bcbio_nextgen_install.py downloaded today 3 April)

This error arises after the GATK download step (see terminal output below from Freebayes install step to error). It appears that GATK is being downloaded:

$ ls -lrt install_directory/share/java/gatk-2.3-9-gdcdccbb/GenomeAnalysisTKLite.jar
-rw-r--r-- 1 cbergman cbergman 13646664 Jan 12 01:04 install_directory/share/java/gatk-2.3-9-gdcdccbb/GenomeAnalysisTKLite.jar

So it doesn't appear to be a missing external dependency. Any ideas on what might be causing this?

Many thanks,
Casey

[localhost] sudo: mv -f bin/bamleftalign /home/cbergman/install_directory/bin
[localhost] out: sudo password:
[localhost] sudo: mv -f bin/freebayes /home/cbergman/install_directory/bin
[localhost] out: sudo password:
[localhost] run: rm -rf /home/cbergman/tmp/cloudbiolinux
INFO: Custom install for 'freebayes' end time: 2013-04-03 12:26:45.966992; duration: 0:05:06.331671
INFO: Custom install for 'freebayes' end time: 2013-04-03 12:26:45.966992; duration: 0:05:06.331671
INFO: Custom install for 'freebayes' end time: 2013-04-03 12:26:45.966992; duration: 0:05:06.331671
INFO: Custom install for 'freebayes' end time: 2013-04-03 12:26:45.966992; duration: 0:05:06.331671
INFO: Custom install for 'freebayes' end time: 2013-04-03 12:26:45.966992; duration: 0:05:06.331671
INFO: Custom install for 'freebayes' end time: 2013-04-03 12:26:45.966992; duration: 0:05:06.331671
INFO: Custom install for 'freebayes' end time: 2013-04-03 12:26:45.966992; duration: 0:05:06.331671
INFO: Custom install for 'freebayes' end time: 2013-04-03 12:26:45.966992; duration: 0:05:06.331671
INFO: Custom install for 'gatk' start time: 2013-04-03 12:26:45.968072
INFO: Custom install for 'gatk' start time: 2013-04-03 12:26:45.968072
INFO: Custom install for 'gatk' start time: 2013-04-03 12:26:45.968072
INFO: Custom install for 'gatk' start time: 2013-04-03 12:26:45.968072
INFO: Custom install for 'gatk' start time: 2013-04-03 12:26:45.968072
INFO: Custom install for 'gatk' start time: 2013-04-03 12:26:45.968072
INFO: Custom install for 'gatk' start time: 2013-04-03 12:26:45.968072
INFO: Custom install for 'gatk' start time: 2013-04-03 12:26:45.968072
INFO: Custom install for 'gatk' start time: 2013-04-03 12:26:45.968072
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
DBG [fabfile.py]: Import gatk
[localhost] sudo: mkdir -p /home/cbergman/install_directory/share/java/gatk-2.3-9-gdcdccbb
[localhost] out: sudo password:
[localhost] sudo: ln -s /home/cbergman/install_directory/share/java/gatk-2.3-9-gdcdccbb /home/cbergman/install_directory/share/java/gatk
[localhost] out: sudo password:
[localhost] run: echo $HOME
[localhost] out: /home/cbergman

[localhost] run: mkdir -p /home/cbergman/tmp/cloudbiolinux
[localhost] run: wget --no-check-certificate -O GenomeAnalysisTKLite-2.3-9-gdcdccbb.tar.bz2 'ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK/GenomeAnalysisTKLite-2.3-9-gdcdccbb.tar.bz2'
[localhost] out: --2013-04-03 13:26:47-- ftp://ftp.broadinstitute.org/pub/gsa/GenomeAnalysisTK/GenomeAnalysisTKLite-2.3-9-gdcdccbb.tar.bz2
[localhost] out: => `GenomeAnalysisTKLite-2.3-9-gdcdccbb.tar.bz2'
[localhost] out: Resolving ftp.broadinstitute.org (ftp.broadinstitute.org)... 69.173.80.251
[localhost] out: Connecting to ftp.broadinstitute.org (ftp.broadinstitute.org)|69.173.80.251|:21... connected.
[localhost] out: Logging in as anonymous ... Logged in!
[localhost] out: ==> SYST ... done. ==> PWD ... done.
[localhost] out: ==> TYPE I ... done. ==> CWD (1) /pub/gsa/GenomeAnalysisTK ... done.
[localhost] out: ==> SIZE GenomeAnalysisTKLite-2.3-9-gdcdccbb.tar.bz2 ... 12456236
[localhost] out: ==> PASV ... done. ==> RETR GenomeAnalysisTKLite-2.3-9-gdcdccbb.tar.bz2 ... done.
[localhost] out: Length: 12456236 (12M) (unauthoritative)

[localhost] out: 0% [ ] 0 --.-K
[localhost] out: 0% [ ] 7,240 27.4K
[localhost] out: 0% [ ] 22,672 46.6K
[localhost] out: 0% [ ] 45,840 64.7K
[localhost] out: 0% [ ] 76,248 81.9K
[localhost] out: 0% [ ] 124,032 107K
[localhost] out: 1% [ ] 197,880 143K
[localhost] out: 2% [ ] 297,792 185K
[localhost] out: 3% [> ] 420,872 229K
[localhost] out: 4% [> ] 590,288 286K
[localhost] out: 6% [=> ] 803,144 351K
[localhost] out: 8% [==> ] 1,058,488 421K
[localhost] out: 10% [===> ] 1,337,456 488K
[localhost] out: 13% [====> ] 1,630,448 551K
[localhost] out: 13% [====> ] 1,728,416 542K
[localhost] out: 14% [====> ] 1,858,736 545K
[localhost] out: 16% [=====> ] 2,025,256 558K
[localhost] out: 17% [=====> ] 2,235,216 581K
[localhost] out: 20% [======> ] 2,504,544 615K
[localhost] out: 22% [=======> ] 2,836,136 661K
[localhost] out: 26% [=========> ] 3,292,256 730K
[localhost] out: 31% [===========> ] 3,878,696 867K
[localhost] out: 36% [=============> ] 4,595,456 1024K
[localhost] out: 43% [===============> ] 5,377,376 1.17M
[localhost] out: 49% [==================> ] 6,159,296 1.33M
[localhost] out: 55% [====================> ] 6,941,216 1.49M
[localhost] out: 62% [=======================> ] 7,788,296 1.66M
[localhost] out: 70% [==========================> ] 8,765,696 1.86M
[localhost] out: 77% [=============================> ] 9,612,776 2.02M
[localhost] out: 85% [================================> ] 10,655,336 2.21M
[localhost] out: 92% [===================================> ] 11,567,576 2.37M
[localhost] out: 100%[======================================>] 12,456,236 2.49M/s in 6.7s

[localhost] out: 2013-04-03 13:26:55 (1.77 MB/s) - `GenomeAnalysisTKLite-2.3-9-gdcdccbb.tar.bz2' saved [12456236]

[localhost] run: tar --pax-option='delete=SCHILY.,delete=LIBARCHIVE.' -xjpf GenomeAnalysisTKLite-2.3-9-gdcdccbb.tar.bz2
[localhost] sudo: mv _.jar /home/cbergman/install_directory/share/java/gatk-2.3-9-gdcdccbb
[localhost] out: sudo password:
[localhost] run: rm -rf /home/cbergman/tmp/cloudbiolinux
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/fabric/main.py", line 674, in main
*args, *_kwargs
File "/usr/lib/python2.7/dist-packages/fabric/tasks.py", line 229, in execute
task.run(_args, *_new_kwargs)
File "/usr/lib/python2.7/dist-packages/fabric/tasks.py", line 105, in run
return self.wrapped(_args, *_kwargs)
File "/home/cbergman/tmpbcbio-install/cloudbiolinux/fabfile.py", line 70, in install_biolinux
_perform_install(target, flavor)
File "/home/cbergman/tmpbcbio-install/cloudbiolinux/fabfile.py", line 93, in _perform_install
_custom_installs(pkg_install, custom_ignore)
File "/home/cbergman/tmpbcbio-install/cloudbiolinux/fabfile.py", line 150, in _custom_installs
install_custom(p, True, pkg_to_group)
File "/home/cbergman/tmpbcbio-install/cloudbiolinux/fabfile.py", line 201, in install_custom
fn(env)
File "/home/cbergman/tmpbcbio-install/cloudbiolinux/cloudbio/custom/bio_nextgen.py", line 446, in install_gatk
with quiet():
NameError: global name 'quiet' is not defined
Disconnecting from localhost... done.
Traceback (most recent call last):
File "bcbio_nextgen_install.py", line 188, in
main(parser.parse_args())
File "bcbio_nextgen_install.py", line 40, in main
install_tools(cbl["tool_fabfile"], fabricrc)
File "bcbio_nextgen_install.py", line 80, in install_tools
"install_biolinux:flavor=ngs_pipeline"])
File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['fab', '-f', '/home/cbergman/tmpbcbio-install/cloudbiolinux/fabfile.py', '-H', 'localhost', '-c', '/home/cbergman/tmpbcbio-install/fabricrc.txt', 'install_biolinux:flavor=ngs_pipeline']' returned non-zero exit status 1

Problem during installation

Hi,
Im trying to install bcbio-nextgen, and the following error is displayed:

~$ python bcbio_nextgen_install.py /usr/local /usr/local/share/bcbio-nextgen
Checking required dependencies
git version 1.7.9.5
Installing base virtual environment...
error: None
Traceback (most recent call last):
File "bcbio_nextgen_install.py", line 216, in
main(parser.parse_args())
File "bcbio_nextgen_install.py", line 36, in main
venv = install_virtualenv_base(remotes, args.datadir)
File "bcbio_nextgen_install.py", line 67, in install_virtualenv_base
subprocess.check_call([ei_cmd, "pip==1.2.1"])
File "/usr/lib/python2.7/subprocess.py", line 511, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/local/share/bcbio-nextgen/bcbio-nextgen-virtualenv/bin/easy_install', 'pip==1.2.1']' returned non-zero exit status 1

What am I missing? I don't have much experience in python so I know this is probably a silly error. Any help will be appreciated.
Thanks

Jo

chrY

when I used the best practice to call variant for NA12878 (who is a female), genotype call from chrY shows up. This is probably a caller or aligner issue, would there be any ways to eliminate these artifact?

Thanks.

line 35, effects.py

guess os.path.dirname(config_file) should be on its own

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/effects.py

def _find_snpeff_datadir(config_file):
with open(config_file) as in_handle:
for line in in_handle:
if line.startswith("data_dir"):
data_dir = config_utils.expand_path(line.split("=")[-1].strip())
if not data_dir.startswith("/"):
data_dir = os.path.join(os.path.dirname(config_file), data_dir)
return data_dir
raise ValueError("Did not find data directory in snpEff config file: %s" % config_file)

variant vs variant2

i am not sure where the issue is now, but see if you would have any idea.

I run two workflow (variant vs variant2) with the same set of fastq files; with the same aligner and genotyper. the number of reads, near bait bases, and off target bases are all off (or lower than expected) with the variant pipeline. The insert size is reported to be '1'. I am still looking into it to see if there is anything wrong in my setting. Any ideas are welcome.

Thanks.

variant2

so, for the variant2 pipeline, trim read is not recommended?

and for some reasons, i have issues generating insert size, duplicates and variant summary, where should I trace after for possible mistake on my side?

Thanks.

snpEff is called with non-existent genome "hg19.tar.bz2"

While debugging the hangs, I noticed that the snpEff annotation step was failing: when running the command standalone, I saw

 java -Xms750m -Xmx32G -jar /mnt/data/software/snpeff/snpEff.jar eff -c /mnt/data/software/snpeff/snpEff.config -1 -i vcf -o vcf hg19.tar.bz2 /mnt/data/project1/work/gatk/2_130703_project1-sort-variants-combined.vcf

java.lang.RuntimeException: Property: 'hg19.tar.bz2.genome' not found

Clearly the pipeline is picking the filename, instead of using "hg19". This occurs with snpEff 3.2.

make galaxy and s3 optional?

Not sure how embedded these are, but I am able to get things to run with this change without a proper galaxy setup:

diff --git a/bcbio/upload/__init__.py b/bcbio/upload/__init__.py
index 81cebb0..92c6487 100644
--- a/bcbio/upload/__init__.py
+++ b/bcbio/upload/__init__.py
@@ -3,11 +3,15 @@
 import datetime
 import os

-from bcbio.upload import shared, filesystem, galaxy, s3
+from bcbio.upload import shared, filesystem

-_approaches = {"filesystem": filesystem,
-               "galaxy": galaxy,
-               "s3": s3}
+_approaches = {"filesystem": filesystem}
+
+try:
+    from bcbio.upload import galaxy, s3
+    _approches.update({"galaxy": galaxy, "s3": s3})
+except ImportError:
+    pass

 def from_sample(sample):
     """Upload results of processing from an analysis pipeline sample.

Add STAR aligner support

Tophat is by far the slowest step in the RNA-seq process and the STAR aligner has good reviews.

QC metrics

(sorry, I am still using variant, instead of varinat2 pipeline)

duplicates metrics is not correctly reported since it's trying to get this information from the recalibrated realigned bam (which has already had duplicates removed). I would actually prefer 'mark duplicates' to 'duplicates removal' in the earlier step

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.