scilifelab / bcbb Goto Github PK

View Code? Open in Web Editor NEW

This project forked from chapmanb/bcbb

10.0 10.0 11.0 5.97 MB

Useful bioinformatics code, primarily in Python and R

Home Page: http://bcbio.wordpress.com

Python 98.85% Clojure 1.01% R 0.15%

bcbb's Introduction

Scilifelab modules

Installation

Installation is as simple as

python setup.py install

If you are running several virtual environments, where one (e.g. devel) is used for development, you can install a development version by running

workon devel
python setup.py develop

Documentation

Docs are located in the doc directory. To install, cd to doc and run

make html

Documentation output is found in build.

Running the tests

The modules are shipped with a number of unit tests, located in the tests directory. To run a test, issue the command

python setup.py nosetests

or if you want to run individual tests, cd to tests and run (for example)

nosetests -v -s test_db.py

bcbb's People

Contributors

Stargazers

Watchers

Forkers

costeapaul jingwen b97pla edajeda vals guillermo-carrasco parlundin remiolsen mariogiov aminmg libingnan11

bcbb's Issues

no unicode support in logbook module

from bcbio.pipeline import log
log.warn("hej höj håj häj")
Traceback (most recent call last):
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 214, in handle
self.emit(record)
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 552, in emit
self.write(self.format_and_encode(record))
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 537, in format_and_encode
rv = self.format(record) + '\n'
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 193, in format
return self.formatter(record, self)
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 363, in call
line = self.format_record(record, handler)
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 357, in format_record
return self._formatter.format(record=record, handler=handler)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
Logged from file , line 1

Program PoCs for celeryd

To enable other scripts plugins to intereact with the pipeline
Resort to manual AMQP way if too cumbersome

http://celeryproject.org/

Save manually entered comments in summary

In the project read counts summary worksheet, the information is currently regenerated when a new flowcell is added. This means that manually entered edits and comments are lost. Implement manually editable columns that will be preserved across updates.

Remove PhiX sequences

Remove PhiX sequences from fastq files.

Qseq->Fastq conversion fails on v3 illumina flowcells

$ solexa_qseq_to_fastq.py 110617_Ac00hfabxx 4
Traceback (most recent call last):
  File ".virtualenvs/production/bin/solexa_qseq_to_fastq.py", line 7, in 
    execfile(__file__)
  File "bcbb/nextgen/scripts/solexa_qseq_to_fastq.py", line 179, in 
    main(args[0], args[1].split(","), options.do_fail, options.outdir)
  File "bcbb/nextgen/scripts/solexa_qseq_to_fastq.py", line 55, in main
    write_lane(lane_prefix, out_prefix, outdir, fail_dir)
  File "bcbb/nextgen/scripts/solexa_qseq_to_fastq.py", line 59, in write_lane
    one_files, two_files, bc_files = _split_paired(qseq_files)
  File "bcbb/nextgen/scripts/solexa_qseq_to_fastq.py", line 147, in _split_paired
    cur_size = _get_qseq_seq_size(f)
  File "bcbb/nextgen/scripts/solexa_qseq_to_fastq.py", line 167, in _get_qseq_seq_size
    return len(parts[8])
IndexError: list index out of range

line 0: undefined variable: inside when plotting insert size metrics

[Wed Apr 06 13:28:29 CEST 2011] net.sf.picard.analysis.CollectInsertSizeMetrics HISTOGRAM_FILE=/....-sort-dup-insert.pdf INPUT=/....-sort-dup.bam OUTPUT=/...-sort-dup.insert_metrics VALIDATION_STRINGENCY=SILENT TAIL_LIMIT=10000 MINIMUM_PCT=0.01 ASSUME_SORTED=true ST
OP_AFTER=0 TMP_DIR=/tmp/roman VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
INFO 2011-04-06 13:28:29 ProcessExecutor [1] "FR = red"
INFO 2011-04-06 13:28:29 ProcessExecutor null device
INFO 2011-04-06 13:28:29 ProcessExecutor 1
[Wed Apr 06 13:28:29 CEST 2011] net.sf.picard.analysis.CollectInsertSizeMetrics done.
Runtime.totalMemory()=378404864
line 0: undefined variable: inside

gnuplot> plot '/...._fastq_stats.txt' using 1:7:11:12:9 with candlesticks lt 1 lw 1 title 'Quartiles' whiskerbars, '' using 1:8:8:8:8 with candlesticks lt -1 lw 2 title 'Medians'
^
line 0: ';' expected

NGSRich and FastQ screen

A couple of "nice-to-have's" for the automated pipeline:

http://sourceforge.net/projects/ngsrich/files/
http://www.bioinformatics.bbsrc.ac.uk/projects/fastq_screen/

Which tools support compression ?

Would it be safe to compress FastQ files and zcat everything (to save space) ?

Which downstream tools would support compressed files right away and which doesn't ?

Should we use CRAM or is it still immature ?

Adapter contamination screening

Need to include some form of adapter contamination screening before multiplexingm perhaps as part of a general screening module (together with PhiX removal, contamination screen and mate-pair linker removal)

Race condition on illumina_finished_msg finalization ?

Sometimes transfer/transferred.db contains duplicated entries for a Run, but it does not get processed. After removing the duplicate entries and re-running illumina_finished_msg.py it is processed without further issues.

Add target read count per sample to project read count summary

...on google docs. Info in column "M reads/sample passed filter" in worksheet "Pågående" of spreadsheet "Genomics project list" should be added and compared with in summary sheet.

Mate pair linker removal

Need to include Paul's script for mate-pair linker removal - ONLY for mate-pair runs. Perhaps as part of general screening module together with PhiX, contamination + adapter screeners.

Add TopHat to automated initial analysis

It would be more useful for us to have TopHat (http://tophat.cbcb.umd.edu/) as an alternative to bwa. TopHat uses the same reference genome index as Bowtie and requires a Bowtie executable to be available. Cheers!

Quota check on storage machine before delivery

Check whether there is enough space on the destination (before delivery/copying the files).

unsorted lanes in samplesheet creates error in run_info.yaml

unsorted lanes in samplesheet creates duplicate entries in run_info.yaml which are not handled correctly in the pipeline

Error rates

Include the error rate information for each run in the project read count summary.

Multiple project ids on run_info.yaml

Apparently if there are multiple projects per lane, samplesheet.py only shows the first one.

This has greater implications when it comes to data delivery on a cluster facility since the data cannot be delivered automatically.

@b97pla, @hussius, feel free to put/describe some examples about this if you have those.

Variable expansion on config files

$HOME among other env variables are not expanded correctly from the YAML files.

Fetching Uppnex id from gdocs spreadsheet

get_uppnex_project_id method in bcbb/nextgen/bcbio/google/project_data.py takes a project name (e.g. 'J.Doe_11_01') and the data structure obtained from parsing the post_process.yaml configuration file and returns the corresponding Uppnex id, or 'N/A' if none was found.

Requires the name of the spreadsheet and worksheet, where the uppnex id can be found, to be present in the configuration file along with the credentials for a google account with read permissions on this spreadsheet.

Read count summary per project and sample

Generate a spreadsheet on Google docs that summarizes the total yield per sample (possibly across multiple flowcells) for each project.

Handle unicode strings on samplesheet

"Åre svärlöng" and other swedish chars (found in SampleID) are misinterpreded when generating the run_info.yaml:

  multiplex:
  - barcode_id: 12
    barcode_type: SampleSheet
    name: !!binary |
      MjIxMDdfooozM=
    sequence: TTAGGCA

Single barcodes should be considered too

If the sample has a single barcode, has to be included in the run_info.yaml too...

Single lanes, if they have barcode, must be trimmed downstream too

A samplesheet/run.yaml containing:

- analysis: Standard
  description: '1'
  flowcell_id: 666IYTBXX
  genome_build: hg19
  lane: '1'

Must also specify the Barcode imported from the samplesheet too, so that barcode_sort_trim knows how/what to trim.

Alternatively it could just trim the last 6 chars that contain the barcode.

Fix AMQP locking when another job is running

Should not happen... maybe related to re-creating queues while pushing the messages ?

unify the demultiplex read count methods

Counts for demutiplexed reads are gathered once immediately after demultiplexing and uploaded to gdocs and again at the end of the analysis when written to run_summary.yaml (and possibly for the automated report generation). Should look into unifying the code for this.

Mail notifications attached to log handler

If an email is present, it should notify on INFO level to the indicated address

Handle Standard PE runs automatically (those don't have run_info.yaml nor samplesheet)

Detect and treat Standard PE the same way we do with multiplexing + run_info.yaml, running preliminary analysis on it too.

Detecting run type

Where a run has had the index bases correctly specified during the run set-up the xml block you have identified in the RunInfo.xml will suffice to inform whether this is multiplexed or not - note that there must be a entry in this block.

You will also find the following block () in the run-folder/Data/Intensities/BaseCalls/config.xml file if you require another source.

(skipped xml tags)
 BaseCallAnalysis
  Run Name="BaseCalls"
    RunParameters
      Barcode
        Cycle Use="true" 31/Cycle
        Cycle Use="true" 32/Cycle
        Cycle Use="true" 33/Cycle
        Cycle Use="true"34/Cycle
        Cycle Use="true"35/Cycle
        Cycle Use="true"36/Cycle
        Cycle Use="true"37/Cycle
      Barcode

Log versions of software that runs

bwa, bowite, GATK... both on the log or in stdout/stderr.

unmatched reads for lanes with multiple projects

The unmatched read count for lanes with multiple projects should be reported for all projects and not just the first one of them.

Automated dataset removal on dump machine

Come up with a good, reliable dataset removal strategy:

Queuing system handling removal of datasets: deepmd5 of fastq folder on both ends ?
In case of emergency, remove intermediate files until there's enough space back: qseq, fastq, bcl
Lab people only need InterOp folder to report/debug to Illumina.
Substitute transfer/transferred.db with something that depicts the states better (by using rabbitMQ too ?)

Incomplete PUSH architecture

Still needs ssh connections from ANALYSIS back to DUMP machine. Eliminate this dependence.

Verify whether any barcode sequences and ids are hardcoded in bcbb pipeline

Are the barcode sequences and corresponding barcode ids hardcoded in the bcbb pipeline or are they accepted and enumerated as specified in samplesheet?

md5deep sums reported to rabbitMQ

Using the glob on illumina_finished_msg, we should be able to checksum all the files after fastq generation and then check those on demand on other ends after the files have been transferred.

Thanks @b97pla for the discussion on this.

Barcode id to sample name

Reverse resolution of barcode id, in the delivery note.

Delivery to google docs ?

plaintext pw in config file

Hi there,

in nextgen/config/universe_wsgi.ini

I just searched for scilifelab to see where the rest of you where, and noted this in passing. I hope comicbookguy has a strong jail for the galaxy user. ;-)

Anyway, great initiative to put it all up here! Cheers!

// Daniel

Handle samples with no barcode

When a whole lane is used without sample multiplexing:

['fcid', '1', 'index2', 'unknown', '', 'desc', 'N', 'R1', 'DvT']

Related to issue #12

common module for parsing run_info.yaml into data structure

This could be abstracted into a class with get/set-methods for the fields?

Illumina run flowcell identifier changed

Illumina has changed the flowcell identifier on the new high-density flowcell release from:

110610_SN666_3572_AB52LABXX (old)

To something like:

110617_SN666_3572_Ad00pqabxx

Glob from the script must be adapted accordingly.

Filter those files specifically to ease processing

In run folder:
RunInfo.xml
runParameters.xml
entire InterOp folder

In run folder/Data
Status.htm
entire Status_Files folder
entire reports folder

Load run_info.yaml back into galaxy ngLIMS

After generating the run_info.yaml file from the Illumina samplesheet, its attributes can be automatically loaded back to galaxy in order to:

Get better trackability of the samples.
Fill most of the form fields automatically so that the user only has to fill the missing ones (bulk import).

Test transition to CASAVA 1.8

Code and test the many changes reflected here:

http://seqanswers.com/forums/showthread.php?t=8895

Mechanism for applying local patches post-install

We need a mechanism to automatically apply our own customizations to third-party software post-install.
Possible solutions:

Create custom packages/distros on github and point pip to those
Local hosting of custom packages á la: http://brandonkonkle.com/blog/2010/mar/25/creating-personal-pypi-chishop/
Create "post-install" shell script that can download and apply patches

Integrate demultiplex and uppnex id code with delivery report code

No of samples

The number of samples that has been run on a flowcell for the particular project should be included in the project read count summary. This information should also be displayed in the delivery note (e.g. "4 samples out of the ordered 6 was run and yielded...")

Substitute scp for rdiff-backup

Template on yaml config file ?

sample names interpreted as numbers

a sample named e.g. '1E2' will appear as '100' in the gdocs spreadsheet

State SampleID on the filename

Should have "SampleID" on the filename, along with FCID:

_fastq

Create transfer/transferred.db if not present

Add code in illumina_finished_msg.py.

Additionally, consider moving it into rabbitmq directly, would need some userfriendly push & pop helper scripts though to assist occasional manual intervention, i.e:

queued_runs.py [--drop]

Demultiplex based on barcodes found in reads

Instead of blindly trusting the samplesheet/manually specified barcodes, Clustering them, in order to find contamination or other problems.

In other words, we should be able to detect which barcodes are there just by processing the fastq files (since we know that the barcodes are attached at the 3' end by default).

Suggested by Ellen, Max and others.

.fastq instead of .fastq.txt

Run a test to see if these changes break the analysis:

Should have "SampleID" on the filename, along with FCID.
Should not have *.txt endings.