dkfz-odcf / alignmentandqcworkflows Goto Github PK

View Code? Open in Web Editor NEW

7.0 6.0 5.0 5.34 MB

The DKFZ alignment workflow plugin originally developed at the eilslabs

Home Page: https://github.com/DKFZ-ODCF/AlignmentAndQCWorkflows/wiki

License: Other

Python 25.15% Shell 21.60% R 8.86% Perl 19.62% Awk 0.28% D 3.28% Java 8.25% Groovy 12.97%

workflow bwa wgbs quality-control alignment fastq-files roddy wes bwa-mem bam-files

alignmentandqcworkflows's Introduction

Alignment and Quality Control Plugin for Roddy

This plugins contains alignment and quality control related Roddy workflows:

PanCancer alignment workflow for whole genome (WGS) and exome (WES)
Bisulfite core workflow (WGBS) using methylCtools

These are basically BWA alignment (bwa mem) workflows with plenty of additional quality control steps. QC-steps acting on the BAM files are mostly done through piping the input data into multiple QC tools simultaneously to achieve a high performance. All workflows can be run for each sample or with combinations of tumor- and control-samples.

Documentation

Please refer to the wiki for detailed information on all aspects of the workflow, including installation, configuration structure and interpretation of results.

German Network on Bioinformatic Infrastructure (de.NBI)

Your opinion matters! The development of this workflow is supported by the German Network for Bioinformatic Infrastructure (de.NBI). By completing this very short (30-60 seconds) survey you support our efforts to improve this tool.

Change Logs

Please see the changelogs file for details.

alignmentandqcworkflows's People

Contributors

Stargazers

Watchers

Forkers

abusamra nagacombio 0xethengineer vschnei

alignmentandqcworkflows's Issues

WGBS: Include Option for non-directional protocol

Tagmentation data is sometimes produced using a non-directional sequencing protocol. To bypass this issue we have to pre-sort read1, and read2 fastq files according to T <-> C ratio and A <-> G ratio. Charles provides me @schmattes42 with that script.

methylCtools should write metadata into BAM attribute instead of identifier

Currently, methylCtools (fqconv) currently encodes information in a string that is attached to the identifier. This causes two problems:

newer BWA checks whether identifiers are identical and fail (if unpatched) on such data
if reads are with /{12} suffixes the concatenation prevents BWA to strip of the read number suffix and downstream tools fail

fqconv should write the meta information into the FASTQ comment in a suitable SAM-spec compatible attribute format and BWA should be called with -C option. bconv needs to parse the comment information back instead of taking it from the extended identifier.

See https://github.com/samtools/hts-specs/blob/master/SAMtags.pdf:

e.g. X?:Z:$value

roddy.sh run fails on existing output (WGBS)

Consider the following example Roddy call:

roddy.sh run testCoBaseConfigs-Roddy-3-0.Picard.SoftwareBwa.WGBS@alignment methylationpid --usemetadatatable=wgbsTableTest1.tsv --useconfig=applicationProperties-workstation-to-lsf.ini --usefeaturetoggleconfig=featureToggles.ini --configurationDirectories=configs,AlignmentAndQCWorkflows/,AlignmentAndQCWorkflows/resources --usePluginVersion=AlignmentAndQCWorkflows:1.2.73 --useiodir=wgbsTableTest1.testCoBaseConfigs-Roddy-3-0.Picard.SoftwareBwa.WGBS.noInputDir --cvalues="INDEX_PREFIX:reference_genomes/bwa06_methylCtools_hs37d5_PhiX_Lambda/hs37d5_PhiX_Lambda.conv.fa,CHROM_SIZES_FILE:reference_genomes/bwa06_methylCtools_hs37d5_PhiX_Lambda/stats/hs37d5_PhiX_Lambda.fa.chrLenOnlyACGT.tab,CYTOSINE_POSITIONS_INDEX:reference_genomes/bwa06_methylCtools_hs37d5_PhiX_Lambda/hs37d5_PhiX_Lambda.pos.gz,CHROMOSOME_INDICES:(21 22),outputAllowAccessRightsModification:false,usedResourcesSize:xs,runFastQC:true,runFingerprinting:true,usedResourcesSize:m" --additionalImports=wgbs-standard-lsf --useRoddyVersion=3.0.9

When rerunning the WGBS analysis on existing data, the Picard merge/mark duplicates step dies with exit code 100 (99):

To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: Value was put into PairInfoMap more than once.  1: run1_id2:HWI-xxxxxxx:104:XXXXXXXXX:1:1102:19656:66811
        at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:132)
        at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
        at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
        at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:285)
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:114)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:187)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:89)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:99)

This is probably because the script tries to keep an existing BAM file and instead merges in a new lane that actually is already in the BAM file (because this a simple "run").

The logic for handling BAMs involves 3 factors:

The default BAM file (FILENAME in mergeAndMarkOrRemoveDuplicatesSlim) is present
A specific BAM file is provided via the bam parameter.
The useOnlyExistingTargetBam parameter is set to true or false (false by default).

TODO: Clean up the logic and fix the bug.

Move alignment files from temporary to final location only after the BWA checks have finished

We had two cases where alignment jobs failed, but the alignment files where already moved to the final location, so that a rerun required finding and deleting those cases of wrongly 'completed' alignment files.

Not sure why this happened. One likely explanation is that the files are not moved to their final destination at the very end of the script, but rather before a couple of final BWA checks. So if one of those fails, the job fails, but the alignment files are not marked as temporary. This is unwanted, as the BWA checks are used to indicate alignments which need to be replaced/taken care of.

Things to do in summary:

Alignment files should be moved to the final path at the very end of the alignment workflow
are the BWA checks carried out at the end of the alignment workflow still necessary?
If yes, change the code for the BWA checks so that they work on the temporary alignment file paths (since we will now move the alignments only after the checks)

Check: Chromosomes in stats file need to be subset of those in genome (index)

Chromosomose in the stats file (chromosome lengths ATCG) need to be checked to be a subset of those in the FASTA file/index. If they are not, then probably the workflows is misconfigured. A descriptive error message needs to be thrown. All tools taking the CHROM_SIZES_FILE should do this check. Similarly, alse the CHROMOSOME_INDICES variable of the WGBS workflow should get checked

Failure to check may result in partial statistics or statistics mislabeled, e.g. in case of xenograph alignments, without any error raised. In the worst case these remain unnoticed. Alternatively, an incorrect CHROM_SIZES_FILE may yield an unspecific message from the filter_readbins.pl script (that also occurs in other situations, e.g. if the input is truncated due to pipe-errors).

This should be checked early in the job scripts or earlier, e.g. in checkExecutability() in the Java code. If there is no BAM file yet, it should be checked against the BWA index directly to prevent mapping against the wrong INDEX. For the BWA index the .annfile can be used.

Calculate target region size from TARGET_REGION_FILE

It is a nuissance having to add target region size, although it can simply and efficiently be calculated from the input target region file.

Merge overlapping regions (BAM reads are counted only once even when overlapping multiple target region features, consequently the size is effectively the size of the covered regions)
Calculate the total size of the regions covered by target regions.

De/Compressor detection will not work properly in all cases

The current detection code is:

      for (int i = 0; i < er.resultLines.size(); i++) {
            //Look at Matthias ExomePipeline extension. The code is mainly taken from there.
            String dCString = null
            String dRString = "gzip -c" //POSSIBLE-ERROR: Not zipper set for 'cat'/ASCII
            String type = er.resultLines[i]
            if (type.startsWith("setgid "))
                type = type[7..-1].trim().split(" ")[0]
            if (type == "0") continue
            if (type == "gzip") {
                dCString = "gunzip -c"
                dRString = "gzip -c"
            } else if (type == "bzip2") {
                dCString = "bunzip2 -c -k"
                dRString = "bzip2 -c -k"
            } else if (type == "ASCII" || type == "ASCII text") {
                dCString = "cat"
                dRString = "head -n E"
            }
            allLaneFiles[i].setDecompressionString(dCString)
            allLaneFiles[i].setRecompressionString(dRString)
        }

file -bL used to work? (or we did not recognize that it never worked properly due to other circumstances). It needs to recognize strings e.g. from:

ASCII text with very long lines

Lift methylation meta job to WFMS level

The methylationcalling meta job starts bisulfite calling processes with bash -- always in pairs of largest (front) and smallest (back) chromosome. This has a number of disadvantages

Chromosomes are processed sequentially (in pairs), which increases the throughput of the workflow.
The batch processing system's scheduling is not exploited. Instead manual scheduling is used!
There is job-management code in Bash (argh)
Checkpointing does not work correctly (Why restart both chromosomes, if only one chromosome should be restarted?)

This job should be done by the workflow manager! Move the logic from the meta job into the workflow manager!

Metadata Table access for WES/WGS

In contrast to WGBS, the WES and WGS workflows currently still use the extraction of execution parameters (sample, pid, etc.) from paths.

Centralize all path parsing code (maybe in a service)
Can the service be made a ParameterAccumulator that collects the data incrementally?
Can the execution model be changed to first read all parameters (e.g. from the paths or the metadata table) and afterwards execute?

Furthermore,

accept BAM file path as input and merge into existing BAM (inkremental merging)
check that all read groups in the BAM are also provided in the metadata table (+ possibility to turn off this check)
if all fastqs are already in the BAM do (1) nothing?, (2) mark again?, (3) only QC? -- best provide all three options as configurable behaviour.

Note that path-parsing code is also in RoddyCore. E.g. de.dkfz.roddy.core.RuntimeService#extractDataSetIDFromPath
It may be a possible to start in the AQCWF, generalize into COWF and even later in the RoddyCore, as was done for the MetadataTable. Then MetadataTable and PathExtraction may become strategies (Strategy Pattern). Others may be e.g. getting data from a database (such as OTP).

`ConfigurationValues::toStringList(String, String)` cannot deal with parentheses that are not separated by spaces from values

Bash arrays provided to Roddy need to be of the form ( a b c ), while the actual Bash-compatible form is (a b c), i.e. without spaces after ( and before ).

WGBS Bug: coveragePlot.R

Bug in R-Script coveragePlot.R, finishes with error:

Error in axis(1, at = pretty(c(0, nrow(sampleOneCon[[i]]))), labels = paste(pretty(c(0,  :
  'at' and 'labels' lengths differ, 8 != 5
Execution halted

Move Bash library code to DefaultPlugin or own repo

Move the PID registry, Tempfile registry, and linear pipe chain extension code into the DefaultPlugin or even its own library, to make it accessibly in all plugins. Don't forget the tests!

coveragePlot.R fix

Please fix a small bug in the coveragePlot.R script: the plotting device is only closed if twoSamples is true. The dev.off() should be moved out of the if(twoSamples) block. Otherwise, the script fails with a 'Too many open devices error'. See snippet below:

I would do it myself but there seems to be a central place where this change has to be made and committed, and I am not sure how to do that.

for( i in names(sampleOneCon)){
    
    chr.name = i
    chr.name = gsub("^0*", "", i)
    chr.name = paste("chr", chr.name, sep = "")
    
    fname = outputFile
    fname = gsub("\\.png", paste("_", chr.name, ".png", sep = ""), fname)
    
        png(fname, width=1100,height=1000)
        par(mfrow=c(3,1),cex=1.5,mar=c(2,4,2,0.1))#
        #sampleOne#
        plot(log(sampleOneCon[[i]]$reads+1,2),pch=".",main=paste(sampleOneName,"_", chr.name,sep=""),ylab=paste("log2 #",countType," per ",windowSize,"kb",sep=""),axes=FALSE)
        box()#
        axis(2)#
        axis(1,at=pretty(c(0,nrow(sampleOneCon[[i]]))),labels=paste(pretty(c(0,nrow(sampleOneCon[[i]]))/bp_scale),"Mb",sep=""))#
        #sampleTwo#
        if (twoSamples) {
          plot(log(sampleTwoCon[[i]]$reads+1,2),pch=".",main=paste(sampleTwoName,"_", chr.name,sep=""),ylab=paste("log2 #",countType," per ",windowSize,"kb",sep=""),axes=FALSE)
          box()#
          axis(2)#for( i in names(sampleOneCon)){
          axis(1,at=pretty(c(0,nrow(sampleTwoCon[[i]]))),labels=paste(pretty(c(0,nrow(sampleTwoCon[[i]]))/bp_scale),"Mb",sep=""))#
          #
          #Ratio between the 2 samples normalized to #reads all chromosomes
          plot(log(((sampleTwoCon[[i]]$reads)/sumSampleTwo)/((sampleOneCon[[i]]$reads)/sumSampleOne),2),pch=".",main=paste("log2 ratio ",sampleTwoName," vs. ", sampleOneName, " ", chr.name, sep=""),ylab=paste("#",countType," per ", windowSize, "kb normalized to total reads",sep=""),axes=FALSE)#
          box()#
          axis(2)#
          axis(1,at=pretty(c(0,nrow(sampleOneCon[[i]]))),labels=paste(pretty(c(0,nrow(sampleOneCon[[i]]))/bp_scale),"Mb",sep=""))#
          abline(h=0,col=2)#
          dev.off() # THIS LINE HAS TO BE MOVED DOWN OUT OF THE IF BLOCK, SO THAT DEVICES ARE ALWAYS CLOSED
        }
}

BAM integrity check job

If two jobs are (by mistake or bug in an automation system or batch processing system -- our scenario) run during the same time their files get mixed and may be corrupt. For the statistics files this might by acceptible (they can be recalculated), but not for the BAM (e.g. if the FASTQs are deleted after some time -- which is our use case). The costs of such mistakes are more expensive than the costs of implementing and maintaining this feature.

Implement a simple job that checks the integrity of the final merged BAM file, e.g. by samtools flagstat $bam > /dev/null.

Make output MultiQC-compatible

MultiQC simplifies the visual representation of QC data. For display in MultiQC result files either should be one of the already supported formats or can be annotated. Check the output files and restructure their content to support MultiQC.

Switch PID registry to associative array

Currently the PID registry is a plain array. With associative arrays one can give a names to the PIDs and report them upon errors, which is great for debugging and development.

Make the PID registry an associative array. See here for extensive good docu: https://www.artificialworlds.net/blog/2012/10/17/bash-associative-array-examples/.

Allow setting the number of bases at read start that should should be ignored for conversion statistics

According to this article the biased region during PBAT may extend further than 9 base pairs. However, the current script only ignores 9 bps when creating conversion statistics.

Make the number of bases to be ignored configurable, including the value 0 (=ignore no base) and values larger than 9.
Keep only the new version of the script while removing the original methylCtools script.
Make a pull request for methylCtools.

Tagmentation WGS and WES

Tagmentation WGS and WES data come up and need to be processed in a way similar to the T-WGBS data, namely by first doing duplication marking per library followed by simple merging of library-BAMs.

The following steps need to be taken:

allow specification of metadata table also for WES and WGS data (#19)
adapt the WGBS merge-job code for WGS and WES

Clean-up code around adapter trimming

The current code concerned with adapter clipping is a PITA. Variables i1, i2, o2, etc. (please use descriptive names) are defined in bwaCommonAlignmentSettings.sh. Partially they are used, partially not (o2?). The exit status of the trimmomatic call (why the hell is there an eval?) is not caught and checked.

Clean this mess up.

Add check in checkExecutability() for correct markDuplicatesVariant.

In method mergeAndRemoveDuplicatesSlim, the check and throw of an exception is insufficient during submission time. Instead there should be a check in the workflow/plugin method already earlier before submission starts in checkExecutability().

Implementation:

Extract the case statement from de.dkfz.b080.co.common.BasicCOProjectsRuntimeService#extractLibrariesFromSampleDirectory into a static method. Leave the throw where it is, because the exception is only useful here. You could also change the code into a hash { biobambam: MERGEANDMORMDUP_SLIM_BIOBAMBAM, ... }
Call that new method via a new check method from checkExecutability().

Document all output files

Detailed documentation of all input files is needed.

Get high-level documentation of all input, outputs, and the general workflow structure.
Document principles used (plugged together components) and requirements (throughput-optimization -> piping).

Important side task: While documenting keep track of necessary improvements in output and open new issues.

Merge Methylation Calls per library instead of performing methylation calling on merged bam

At the moment methylation calling is done per library bam and afterwards on the merged bam. Instead of doing this it would be more preformance friendly to only add up the counts for all libraries.

Xenograft needs separate coverage for grafted and host genome

In xenograft samples the coverage is expected to be different for the grafted (human) genome and the host's (mouse) genome. This needs to be visible in the statistics.

The workflow usually has one set of chromosomes also in xenograft, but in xenograft either the human or the mouse genome are prefixe. (CHR_PREFIX). This can be used to calculate separate statistics.

Add @PG entry with parameter for sambamba merging/marking

Sambambam does not add a @pg entry into the header of the merged and duplication marked BAM.

Add the @pg entry by directly operating on the sambamba output stream.

checkExecutability(): adapter file for trimming exists

Currently, the file defined by the CLIP_INDEX variable is checked, but only if it does not start with $DIR_EXECUTION. The file should always be checked and the workflow cancelled, if the file is not readable. The check should already be done in QCPipeline, as clipping can be done with all alignment data (currently in BisulfiteCoreWorkflow)

Use ExecutionContext.getExecutionDirectory for this. Alternatively, it may be sufficient to substitute the $DIR_EXECUTION by ${DIR_EXECUTION}.

bwa-mem2 support

bwa-mem2 is 1.7-2 times faster with ~2 times more memory. For time-critical data bwa-mem2 should be used.

bwa-mem2 command (CLI) seems to be compatible to bwa-mem. Therefore,

Change binary name and module load
Adapt conda environment
Is there anything necessary except for the path to the FASTA for working with bwa-mem2 reference genomes?

Check all read BAM files for MD5 sum on the fly

Data can (and does) degenerate on the harddisk. This can be detected by whenever reading from a BAM calculating the MD5 sum comparing the result to the MD5 sum stored in the .md5 file.

Whenever a non-temporary big file is written, an MD5 sum should be calculated on the fly.
Whenever a non-temporary big file is read, the MD5 sum should be calculate and compared to the reference.
Candidate files are
- FASTQ (MD5 sums from disk or OTP)
- BAM (MD5 sums from alignment workflow already implemented)

This issue would also increase compliance with the Rahmendatenschutzkonzept.

CRAM support

De/Compressor detection will not work properly with differently compressed fastqs

E.g. if two lane files are compressed differently:

lane1.fastq (uncompressed)
lane1.fastq.gz (gzip)
lane2.fastq.gz (gzip)
lane2.fastq.bz2 (bzip2)

one could argue, that this should not be the case, but Roddy should be able to detect it and pass it properly to its alignment jobs.

_de novo_ adapter detection

Currently, adapters are only trimmed if an adapter sequence is known before. Implement de novo adapter detect with fastp.

Allow setting:
(a) don't trim
(b) trim with trimmomatic (should remain default when trimming -- for backwards compatibility)
(c) trim with fastp without known adapter
(d) trim with fastp including known adapter
(e) only check whether de novo (observed) adapter is compatible with provided (expected) adapter.

By default, the quality-trimming should be similar (by cutoffs) to the quality-trimming currently done with trimmomatic.

The implementation should allow for flexible configuration of fastp, in case any of the other trimming options are required.

Implementation details.

fastp \
  --thread 12 \
  --report_title $title"_"$id \
  -i $r1 \
  -I $r2 \
  -o $OUTDIR"/"$title"_"$id"_"$(baseOutput "$r1")"_R1.fastq" \
  -O $OUTDIR"/"$title"_"$id"_"$(baseOutput "$r1")"_R2.fastq"  \
  -Q \
  -l 8 \
  --detect_adapter_for_pe \
  --json $title"_"$id"_"$(baseOutput "$r1")"_fastp.json" \
  --html $title"_"$id"_"$(baseOutput "$r1")"_fastp.html" \
  2> $title"_"$id"_"$(baseOutput "$r1")"_fastp_out.log"

This call needs to be extended to additionally do quality-based trimming.

Check wait -n for task management in bash code

Single-end alignment

Add for-loop to waitForRegisteredPids

The two functions waitForRegisteredPids and the _BashSucksVersion wait for all pids en bloc. It is not possible to say exactly, which PID produced the error code.

Use a for-loop and report the PID and maybe index. After switching the PID registry to be a associative array, report the array keys.
Report all failed PIDs with error codes. Accumulate the errors and fail if one or more failed.

Track number of reads lost by read-trimming

To be able to compare input and output numbers of reads if trimming is turned on (WGBS!), the number of trimmed reads should be collected from the trimmomatic output. Add the value to the qualitycontrol.json (required for OTP internal consistency tests similar to T1263).

Example output:

Input Read Pairs: 95530459 Both Surviving: 90669916 (94.91%) Forward Only Surviving: 4023149 (4.21%) Reverse Only Surviving: 604635 (0.63%) Dropped: 232759 (0.24%)

Note that all these numbers have the unit "read pairs".

Number of non-supplementary reads in the BAM

$ samtools view -F 0x800 -c ../../replicate1_libNA_run160429_SN7001355_0065_BC7D03ACXX_41_Hf03_LiHe_Ct_WGBS_S_2_CTTGTA_L007_paired.bam.sorted.bam 181339832

This means that only the read pairs are kept, that are long enough (i.e. not dropped) and where both reads are retained.

Catch PIPESTATUS for pipe expressions

The bash variable PIPESTATUS contains an array of the most recent pipe expression. So

$ false | true | false | true
$ echo ${PIPESTATUS[@]}
0 1 0 1

Rescue the pipestatus for all pipes in the scripts. Write the status to dedicated files, as is done for the BWA error code.

Note that PIPESTATUS is also available in functions:

$ rescuePipeStatus () {
  declare -la pipstat=(${PIPESTATUS[@]})
  local outfile="$1"
  echo "${pipstat[@]}" > "$outfile"
}
$ (false | true | true; rescuePipeStatus pipstat) &
$ wait
$ cat pipstat
0 1 1

Extract library name from path in BasicCOProjectsRuntimeService#extractLibrariesFromSampleDirectory using the XML config

Currently, de.dkfz.b080.co.common.BasicCOProjectsRuntimeService#extractLibrariesFromSampleDirectory has the position of the library in the path hardcoded. However, it also coded in sampleDirectories config-variable. Use this variable!

Handle incremental addition of new lanes when dealing with multiple sequencing libraries

Sooner or later we have to remove the per lane, and per library bam files and only keep the final merged bam file. Therefore, whenever a new lane for a specific library is sequenced, the aligned reads for that library have to be extracted from the merged bam file, and the newly aligned lane has to merged and duplicate marked against the library bam file and later merged into the final bam file.

Better IO checks in Perl scripts

Instead of

while (my $line = <$fh>) {
    doSomethingWithLine($line);
}

the following paradigm should be used

while (!eof($fh)) {
   my $line = readline($fh);
   if (! defined $line) {
     die "IO Error: $!"
   }
  doSomethingWithLine($line);
}

This correctly recognizes and handles IO errors during the processing (in contrast to the previous simple method).

Remove FileStage Hack from LaneFile

Search for this and correct it.

  /**
     * Hack method! Allow to decrease a file stage
     */
    public LaneFile getFSDecreasedCopy() {
        LaneFile lf = new LaneFile(this, this.getExecutionContext());
        lf.fileStageSettings = fileStageSettings.decreaseLevel();
        return lf;
    }

Safer online QC statistics on the SAM output stream for post-mortem analysis of alignment (or other processes)

Motivation: If an alignment job (or whatever other job) takes exceptionally long to run, requires exceptionally large memory resources, or shows similar anomalies, it is time-consuming to identify possible characteristics in the data itself that may cause the anomaly. In OTP the online statistics could be shown and indicate to the researcher problems with their sample.

Goal: Do certain QC statistics "on line" on the output (SAM) stream of the alignment (VCF or whatever) and secure these at regular intervals. "On line" here means, that the statistics should be written for individual chunks of reads in the stream, e.g. every 10e6 reads, and/or aggregated over the full sample seen up to the moment. The online statistics need to saved repeatedly to disc during the processing and must not be deleted in the end. Currently, all statistics file are empty and QC scripts just dump their results at the end of processing.

For the alignment and merging steps, the following statistics should be culled from the per-lane SAM stream at regular intervals

Insert-Size Statistics (min, max, Q1, Q2, Q3) (too long fragments)
Fraction of read pairs aligned to different chromosomes
Soft-Clipping Rate (shorter reads are ambiguous to align)
Length distribution of remaining soft-clipped reads (shorter reads are ambiguous to align)
Number+Proportion of FF, FR, RR read pairs
Number+Proportion of reads aligned with large gaps
Number+Proportion of reads aligned with Smith-Waterman secondary alignment step in BWA (slower; XT attribute)
Number+Proportion of unaligned reads
Distribution parameters of suboptimal hits in BWA (X1 attribute)
Distribution parameters of number of best hits (X0 attribute)
others? (please add!)

All statistics are interesting that may relate to an exceptionally long runtime or otherwise failing jobs during alignment or any of its follow-up processing steps.

Add collected ambiguously classified undirected WGBS reads in QC-JSON

Currently, reads whose orientation can not unambiguously be classified by WGBS_read_pair_reconstruction.py are simply ignored and no statistics are reported.

Collect stats and add to QC-JSON
Optional saving of ambiguously classified reads (for debugging and rescue attempts)

Merge without duplication marking

Currently, merging of lanes is only possible together duplication marking.

Implement mere BAM merging, without marking in the WGS and WES workflows.

Agglomerate methylation rates over specific sets of chromosomes for QC

The context-dependent (CG+CH) methylation rates on autosomes need to be combined across chromosomes into a single number.

Read-count consistency check before/after processing to improve operation safety

count(input) <= count(flagstats) [samtools < 1.2, sambamba < 0.5.9]
count(input) <= non_supplementary_count(flagstats) [samtools >= 1.2, sambamba >= 0.5.9]

Relocate library information into the CO or Alignment Plugin

Currently, there is code in the COWorkflows plugin referring to the library, however, the library seems only used in the AlignmentAndQCWorkflows plugin (and therein only by the WGBS pipeline). Consider relocating all that code to the AQCWF plugin.

WF should check if all lane file pairs are matching

Currently this is not the case and the workflow will fail e.g. in this case:

lane1.fastq
lane1.fastq.gz
lane2.fastq

=> This will fail

One could argue, that this should not be the case, but people tend to do funny things.

Filename length restriction in R diffchrom script

The diffchrom plot script (in R) truncates the output file name (-o option) to 511 characters (probably 512 including the \0 string terminator in C).

Benchmark the tool BIS-Sniper for Methylation- and SNP-Calling

According to Naveed the tool BIS-Sniper is a very fast tool for Methyation Calling as well as SNP-Calling. It could be worthwile to replace our current Methylation Caller by BIS-Sniper. Another advantage of this tool is, that we get SNP calls for free!!!

Process data from object storage

Preferentially S3.

Stream directly into pipes e.g. s3cmd. Combine with decompression
How to determine the compression of input data? Add extra compression determination step?
Write all output data to target bucket

Problems:

Roddy cannot yet accept remote files (Are they present? Roddy checks them on FS in the moment but there is no support for remote file checking in S3.

We may first have to implement some basic support in Roddy. See TheRoddyWMS/Roddy#331

Check content of CHROM_SIZES_FILE and CHROMOSOME_INDICES against BAM file (maybe also the BWA index?)

In the WGBS workflow the CHROMOSOME_INDICES variable is not cross-checked against the content of the BAM file. An incorrect CHROM_SIZES_FILE in the may yield an unspecific message from the filter_readbins.pl script (that also occurs in other situations, e.g. if the input is truncated due to pipe-errors.

This is a nuissance every time a new assembly is included.

The contents of CHROMOSOME_INDICES, CHROM_SIZES_FILE must be subsets of what is found in the BAM header. This should be checked early in the job scripts or earlier, e.g. in checkExecutability() in the Java code. If there is no BAM file yet, it should be checked against the BWA index directly to prevent mapping against the wrong INDEX. For the BWA index the .annfile can be used.