Git Product home page Git Product logo

bcbb's Introduction

Scilifelab modules

Installation

Installation is as simple as

python setup.py install

If you are running several virtual environments, where one (e.g. devel) is used for development, you can install a development version by running

workon devel
python setup.py develop

Documentation

Docs are located in the doc directory. To install, cd to doc and run

make html

Documentation output is found in build.

Running the tests

The modules are shipped with a number of unit tests, located in the tests directory. To run a test, issue the command

python setup.py nosetests

or if you want to run individual tests, cd to tests and run (for example)

nosetests -v -s test_db.py

bcbb's People

Contributors

alneberg avatar b97pla avatar brainstorm avatar chapmanb avatar ewels avatar galithil avatar guillermo-carrasco avatar kwoklab-user avatar mariogiov avatar mayabrandi avatar parlundin avatar percyfal avatar peterjc avatar remiolsen avatar senthil10 avatar skinner avatar vals avatar vezzi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bcbb's Issues

no unicode support in logbook module

from bcbio.pipeline import log
log.warn("hej höj håj häj")
Traceback (most recent call last):
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 214, in handle
self.emit(record)
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 552, in emit
self.write(self.format_and_encode(record))
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 537, in format_and_encode
rv = self.format(record) + '\n'
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 193, in format
return self.formatter(record, self)
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 363, in call
line = self.format_record(record, handler)
File "/bubo/home/h27/pontusla/.virtualenvs/devel/lib/python2.6/site-packages/Logbook-0.3-py2.6-linux-x86_64.egg/logbook/handlers.py", line 357, in format_record
return self._formatter.format(record=record, handler=handler)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
Logged from file , line 1

Save manually entered comments in summary

In the project read counts summary worksheet, the information is currently regenerated when a new flowcell is added. This means that manually entered edits and comments are lost. Implement manually editable columns that will be preserved across updates.

Qseq->Fastq conversion fails on v3 illumina flowcells

$ solexa_qseq_to_fastq.py 110617_Ac00hfabxx 4
Traceback (most recent call last):
  File ".virtualenvs/production/bin/solexa_qseq_to_fastq.py", line 7, in 
    execfile(__file__)
  File "bcbb/nextgen/scripts/solexa_qseq_to_fastq.py", line 179, in 
    main(args[0], args[1].split(","), options.do_fail, options.outdir)
  File "bcbb/nextgen/scripts/solexa_qseq_to_fastq.py", line 55, in main
    write_lane(lane_prefix, out_prefix, outdir, fail_dir)
  File "bcbb/nextgen/scripts/solexa_qseq_to_fastq.py", line 59, in write_lane
    one_files, two_files, bc_files = _split_paired(qseq_files)
  File "bcbb/nextgen/scripts/solexa_qseq_to_fastq.py", line 147, in _split_paired
    cur_size = _get_qseq_seq_size(f)
  File "bcbb/nextgen/scripts/solexa_qseq_to_fastq.py", line 167, in _get_qseq_seq_size
    return len(parts[8])
IndexError: list index out of range

line 0: undefined variable: inside when plotting insert size metrics

[Wed Apr 06 13:28:29 CEST 2011] net.sf.picard.analysis.CollectInsertSizeMetrics HISTOGRAM_FILE=/....-sort-dup-insert.pdf INPUT=/....-sort-dup.bam OUTPUT=/...-sort-dup.insert_metrics VALIDATION_STRINGENCY=SILENT TAIL_LIMIT=10000 MINIMUM_PCT=0.01 ASSUME_SORTED=true ST
OP_AFTER=0 TMP_DIR=/tmp/roman VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
INFO 2011-04-06 13:28:29 ProcessExecutor [1] "FR = red"
INFO 2011-04-06 13:28:29 ProcessExecutor null device
INFO 2011-04-06 13:28:29 ProcessExecutor 1
[Wed Apr 06 13:28:29 CEST 2011] net.sf.picard.analysis.CollectInsertSizeMetrics done.
Runtime.totalMemory()=378404864
line 0: undefined variable: inside

gnuplot> plot '/...._fastq_stats.txt' using 1:7:11:12:9 with candlesticks lt 1 lw 1 title 'Quartiles' whiskerbars, '' using 1:8:8:8:8 with candlesticks lt -1 lw 2 title 'Medians'
^
line 0: ';' expected

Which tools support compression ?

Would it be safe to compress FastQ files and zcat everything (to save space) ?

Which downstream tools would support compressed files right away and which doesn't ?

Should we use CRAM or is it still immature ?

Adapter contamination screening

Need to include some form of adapter contamination screening before multiplexingm perhaps as part of a general screening module (together with PhiX removal, contamination screen and mate-pair linker removal)

Race condition on illumina_finished_msg finalization ?

Sometimes transfer/transferred.db contains duplicated entries for a Run, but it does not get processed. After removing the duplicate entries and re-running illumina_finished_msg.py it is processed without further issues.

Mate pair linker removal

Need to include Paul's script for mate-pair linker removal - ONLY for mate-pair runs. Perhaps as part of general screening module together with PhiX, contamination + adapter screeners.

Error rates

Include the error rate information for each run in the project read count summary.

Multiple project ids on run_info.yaml

Apparently if there are multiple projects per lane, samplesheet.py only shows the first one.

This has greater implications when it comes to data delivery on a cluster facility since the data cannot be delivered automatically.

@b97pla, @hussius, feel free to put/describe some examples about this if you have those.

Fetching Uppnex id from gdocs spreadsheet

get_uppnex_project_id method in bcbb/nextgen/bcbio/google/project_data.py takes a project name (e.g. 'J.Doe_11_01') and the data structure obtained from parsing the post_process.yaml configuration file and returns the corresponding Uppnex id, or 'N/A' if none was found.

Requires the name of the spreadsheet and worksheet, where the uppnex id can be found, to be present in the configuration file along with the credentials for a google account with read permissions on this spreadsheet.

Handle unicode strings on samplesheet

"Åre svärlöng" and other swedish chars (found in SampleID) are misinterpreded when generating the run_info.yaml:

  multiplex:
  - barcode_id: 12
    barcode_type: SampleSheet
    name: !!binary |
      MjIxMDdfooozM=
    sequence: TTAGGCA

Single lanes, if they have barcode, must be trimmed downstream too

A samplesheet/run.yaml containing:

- analysis: Standard
  description: '1'
  flowcell_id: 666IYTBXX
  genome_build: hg19
  lane: '1'

Must also specify the Barcode imported from the samplesheet too, so that barcode_sort_trim knows how/what to trim.

Alternatively it could just trim the last 6 chars that contain the barcode.

unify the demultiplex read count methods

Counts for demutiplexed reads are gathered once immediately after demultiplexing and uploaded to gdocs and again at the end of the analysis when written to run_summary.yaml (and possibly for the automated report generation). Should look into unifying the code for this.

Handle Standard PE runs automatically (those don't have run_info.yaml nor samplesheet)

Detect and treat Standard PE the same way we do with multiplexing + run_info.yaml, running preliminary analysis on it too.

Detecting run type

Where a run has had the index bases correctly specified during the run set-up the xml block you have identified in the RunInfo.xml will suffice to inform whether this is multiplexed or not - note that there must be a entry in this block.

You will also find the following block () in the run-folder/Data/Intensities/BaseCalls/config.xml file if you require another source.

(skipped xml tags)
 BaseCallAnalysis
  Run Name="BaseCalls"
    RunParameters
      Barcode
        Cycle Use="true" 31/Cycle
        Cycle Use="true" 32/Cycle
        Cycle Use="true" 33/Cycle
        Cycle Use="true"34/Cycle
        Cycle Use="true"35/Cycle
        Cycle Use="true"36/Cycle
        Cycle Use="true"37/Cycle
      Barcode

Automated dataset removal on dump machine

Come up with a good, reliable dataset removal strategy:

  • Queuing system handling removal of datasets: deepmd5 of fastq folder on both ends ?
  • In case of emergency, remove intermediate files until there's enough space back: qseq, fastq, bcl
  • Lab people only need InterOp folder to report/debug to Illumina.
  • Substitute transfer/transferred.db with something that depicts the states better (by using rabbitMQ too ?)

md5deep sums reported to rabbitMQ

Using the glob on illumina_finished_msg, we should be able to checksum all the files after fastq generation and then check those on demand on other ends after the files have been transferred.

Thanks @b97pla for the discussion on this.

plaintext pw in config file

Hi there,

in nextgen/config/universe_wsgi.ini

I just searched for scilifelab to see where the rest of you where, and noted this in passing. I hope comicbookguy has a strong jail for the galaxy user. ;-)

Anyway, great initiative to put it all up here! Cheers!

// Daniel

Handle samples with no barcode

When a whole lane is used without sample multiplexing:

['fcid', '1', 'index2', 'unknown', '', 'desc', 'N', 'R1', 'DvT']

Related to issue #12

Illumina run flowcell identifier changed

Illumina has changed the flowcell identifier on the new high-density flowcell release from:

110610_SN666_3572_AB52LABXX (old)

To something like:

110617_SN666_3572_Ad00pqabxx

Glob from the script must be adapted accordingly.

Load run_info.yaml back into galaxy ngLIMS

After generating the run_info.yaml file from the Illumina samplesheet, its attributes can be automatically loaded back to galaxy in order to:

  1. Get better trackability of the samples.
  2. Fill most of the form fields automatically so that the user only has to fill the missing ones (bulk import).

No of samples

The number of samples that has been run on a flowcell for the particular project should be included in the project read count summary. This information should also be displayed in the delivery note (e.g. "4 samples out of the ordered 6 was run and yielded...")

Create transfer/transferred.db if not present

Add code in illumina_finished_msg.py.

Additionally, consider moving it into rabbitmq directly, would need some userfriendly push & pop helper scripts though to assist occasional manual intervention, i.e:

queued_runs.py [--drop] 

Demultiplex based on barcodes found in reads

Instead of blindly trusting the samplesheet/manually specified barcodes, Clustering them, in order to find contamination or other problems.

In other words, we should be able to detect which barcodes are there just by processing the fastq files (since we know that the barcodes are attached at the 3' end by default).

Suggested by Ellen, Max and others.

.fastq instead of .fastq.txt

Run a test to see if these changes break the analysis:

  • Should have "SampleID" on the filename, along with FCID.
  • Should not have *.txt endings.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.