Git Product home page Git Product logo

blr's Introduction

Travis CI

NB! For the latest version of this pipeline go to: AfshinLab/BLR

Barcode-Linked Reads Analysis

Content:

Usage

1. Setup an analysis folder

Activate your Conda environment.

conda activate blr

Choose a name for the analysis. It will be output_folder in this example. Create the analysis directory.

blr init --reads1=path/to/sample.R1.fastq.gz path/to/output_folder

Note that BLR expects paired-end reads. However, only the path to the R1 file needs to be provided. The R2 file will be found automatically.

To use the other blr commands, make sure you working directory is your newly created analysis folder.

cd path/to/output_folder

Then, you may need to edit the configuration file blr.yaml, in particular to enter the path to your reference genome.

blr set --genome_reference=path/to/GRCh38.fasta

To see what other configurations can be altered, read the documentation in the blr.yaml file.

2. Running an analysis

Change working directory to your analysis folder

cd path/to/output_folder

The pipeline automatically runs all steps.

blr run

For more options, see the documentation.

blr run -h

One-time installation

1. Prerequisite: Conda

You could also try copy-pasting the following to your terminal. This will download miniconda, install it to you $HOME folder and enable the bioconda channel.

if [[ $OSTYPE = "linux-gnu" ]]; then 
    wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
elif [[ $OSTYPE = "darwin"* ]]; then 
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh
fi
bash miniconda.sh -b -p $HOME/miniconda
source $HOME/miniconda/etc/profile.d/conda.sh
conda config --add channels bioconda

2. Install

Clone the git repository.

  git clone https://github.com/FrickTobias/BLR.git

Create & activate a new Conda environment, in which all dependencies will be installed.

  conda env create -n blr -f environment.yml
  conda activate blr

Install blr into the environment in "editable install" mode.

  pip install -e .

This will install blr in such a way that you can still modify the source code and get any changes immediately without re-installing.

2.1 Linux users (not macOS)

To enable DeepVariant, install it separately to your environment.

conda activate blr
conda install deepvariant

This will enable the variant_caller: deepvariant option in the analysis config file.

3. Updating

Change working directory to your blr git folder and update.

cd path/to/BLR
git pull

Old version

To run the analysis described in High throughput barcoding method for genome-scale phasing, look at the stable branch for this git repository.

That version of BLR Analysis is also available at OMICtools.

blr's People

Contributors

fricktobias avatar marcelm avatar pontushojer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

blr's Issues

Implement a "blr run" command or similar

It should not be required for the user to explicitly run snakemake (and to know the location of the Snakefile). Similar to the way IgDiscover works, we can have some kind of blr init command that initalizes an analysis directory with a configuration file that creates sane defaults (see #94). Then, calling blr run from within the directory would run the pipeline.

Compress barcodes.fasta after use.

barcodes.fasta containing the uncorrected barcodes is now left uncompressed after the workflow. In the latest dataset this took up ~31Gb of space. After it has been used for clustering and tagging the FASTQ it should therefore be compressed to save space.

Integrate WhatsHap for phasing

HapCUT2 was preferred over WhatsHap because it is unclear how to use the barcode tags. Check whether 10X data is actually supported by WhatsHap and if so, add it to the pipeline as an alternative to HapCUT2.

Performance improvements

I am collecting here some ideas for how to improve the performance of the pipeline.

  • On UPPMAX, /scratch/ should be used for temporary files. It is faster than writing to the network file system. (See #115)
  • More streaming should be employed. Currently, many huge intermediate FASTQ files are generated, some uncompressed, some compressed. This is slow: When writing them uncompressed, file I/O performance is likely a bottleneck. When writing them compressed, that takes a lot of CPU hours. Instead, one can let one program write uncompressed and send the output to the next program via a pipe. This avoids the intermediate file to disk entirely.
  • The bc_extract.py script should be rewritten so that it can work in a streaming fashion.
  • It should be possible to combine the last two Cutadapt commands into one.
  • Check whether it is faster to run multiple CD-HITs in parallel (instead of giving each CD-HIT the full number of cores).
  • Check whether a different clustering method (perhaps a custom one) is faster than CD-HIT (see #109)
  • Avoid the intermediate samtools view step (SAM can be piped directly into samtools sort).
  • Use FIFO tricks to run samtools index at the same time as generating the BAM that is being indexed. Again, this avoids I/O.
  • Test other read mappers (BWA-MEM, minimap2) (See #111 and #112)
  • Test alternatives to MarkDuplicates (sambamba markdup, samblaster) (Moved to #113)

Maximum available threads requested from piped rules.

So I have noticed a recurring error which I think relates to the piped rules I introduced in out rules/trim.smk file. Since there are three rules connected through the pipe, snakemake requires a minimum of three threads to run the trimming. Problem is that snakemake does not automatically adjust this for the three rules so that if i set threads: 20 for each of the three rules and give three threads the first rule will take all three and snakemake raise a WorkflowError requesting five threads (one more for each other rule). This is already noticed in a current snakemake issue on bitbucket.

In my solution I used the same solution as in the issue to cap the number of threads requested by the rule using the available_cpu_count function from snakemake utils with some modifications. Currently it looks as follows:

rule trim_r1_handle_pipe:
    ...
    threads: ceil((available_cpu_count()-1)/2)
    ...

rule extract_barcodes_pipe:
    ...
    (Only runs one thread)
    ...

rule final_trim_pipe:
    ...
    threads: int((available_cpu_count()-1)/2)
    ...

I thought this function would give the actual number of threads given in the argument (snakemake -j) but instead this is based on the total number of available CPUs on the local machine. Thus it requires that you run the pipeline with the full number of CPUs or it will raise an error such as this (from our local server with 24 CPUs:

WorkflowError:
Job needs threads=24 but only threads=20 are available. This is likely because two jobs are connected via a pipe and have to run simultaneously. Consider providing more resourc
es (e.g. via --cores).

I have tested reading the arguments given in the command i.e. taking whatever comes after -jin the sys.argv list, but this seems as a bad solution. Alternatively I could move from using the snakemake pipe function on three rules to a single rule which uses regular unix pipes (i.e. |).

Tag reads with barcode sequence

Currently reads are tagged with barcode cluster ids. This comes from when we used cd-hit and got a cluster index rather than a canonical sequence for the cluster. Using the barcode includes more valuable information and is also more similar to the longranger format (like it or not this will probably be the basis of manny tools we want to apply). Possibly we also want to include a GEM group there also.

Add athena-meta pipeline

The meta branch now contains snakemake rules to run athena-meta. To try it out

  • update your environment to install megahit and bwa
  • run the test script (bash tests/run.sh) to generate the trimmed files and barcode cluster file
  • run the metagenomic part:

snakemake -s src/blr/rules/meta.smk --use-conda -np outdir/athena.done (remove the '-n' to trigger the actual run)

All seems to work fine (on linux) up until the Flye step of athena-meta. Here it crashes with something seemingly related to this issue. Note: running with just one thread does not fix the problem. Also, running with the athena-meta-flye-docker docker image does not work either.

I'll try to look into the problem further.

tagfastq: cannot allocate memory

Problem

I'm trying to run library with 440 M read pairs and I'm getting a traceback due to being unable to allocate memory in the tagfastq step (see log file at the bottom).

Since Cutadapt works fine with the given amount of reads I'm a bit confused as to what might be giving this error, are we misusing dnaio or could this be because we've already allocated a lot of memory to holding the barcoding information?

If I remember correctly cutadapt threads by assigning a specific core for reading, could this be related since the traceback specifies it cannot allocate memory (rather than running out of it).

I'm running the pipe with 20 cores, 128 Gb RAM.

Related code snippets

Barcode reading functions

def parse_corrected_barcodes(open_file):
    """
    Parse starcode cluster output and return a dictionary with raw sequences pointing to a
    corrected canonical sequence
    :param open_file: starcode tabular output file.
    :return: dict: raw sequences pointing to a corrected canonical sequence.
    """
    corrected_barcodes = dict()
    for cluster in tqdm(open_file, desc="Clusters processed"):
        canonical_seq, _, cluster_seqs = cluster.strip().split("\t", maxsplit=3)
        corrected_barcodes.update({raw_seq: canonical_seq for raw_seq in cluster_seqs.split(",")})
    return corrected_barcodes
def parse_raw_barcodes(open_file):
    """
    Parse FASTA/FASTQ containing barcode sequences and return a dictionary with entry headers
    pointing to a raw barcodes sequence.
    :param open_file: dnaio odject.
    :return: dict: entry headers pointing to a raw barcodes sequence
    """
    raw_barcodes = dict()
    for barcode in tqdm(open_file, desc="Raw barcodes processed"):
        header, _ = barcode.name.split(maxsplit=1)
        raw_barcodes[header] = barcode.sequence
    return raw_barcodes

Loop at traceback

 with dnaio.open(args.input1, file2=args.input2, interleaved=in_interleaved, mode="r") as reader, \
            dnaio.open(args.output1, file2=args.output2, interleaved=out_interleaved, mode="w") as writer:
        for read1, read2 in tqdm(reader, desc="Read pairs processed"):
            # Header parsing
            name_and_pos_r1, read_and_index_r1 = read1.name.split(maxsplit=1)
            name_and_pos_r2, read_and_index_r2 = read2.name.split(maxsplit=1)

            try:
                raw_barcode_seq = raw_barcodes[name_and_pos_r1]
            except KeyError:
                raw_barcode_seq = None
                reads_missing_barcode += 1

            if raw_barcode_seq:
                corr_barcode_seq = corrected_barcodes[raw_barcode_seq]

                raw_barcode_id = f"{args.sequence_tag}:Z:{raw_barcode_seq}"
                corr_barcode_id = f"{args.barcode_tag}:Z:{corr_barcode_seq}"

                # Create new name with barcode information.
                new_name = separator.join([name_and_pos_r1, raw_barcode_id, corr_barcode_id])

                # Save header to read instances
                read1.name = " ".join([new_name, read_and_index_r1])
                read2.name = " ".join([new_name, read_and_index_r2])

            # Write to out
            writer.write(read1, read2)

Log file

SETTINGS FOR: tagfastq raw_barcodes: barcodes.fasta
 corrected_barcodes: barcodes.clstr
 input1: trimmed.1.fastq.gz
 input2: trimmed.2.fastq.gz
 output1: trimmed_barcoded.1.fastq.gz
 output2: trimmed_barcoded.2.fastq.gz
 barcode_tag: BX
 sequence_tag: RX
 sep: _
tagfastq - INFO: Starting
Clusters processed: 7774460it [00:21, 362636.59it/s]
Raw barcodes processed: 433169093it [27:54, 258621.92it/s]
tagfastq - INFO: Input detected as paired FASTQ.
tagfastq - INFO: Output detected as paired FASTQ.
Traceback (most recent call last):
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/blr", line 11, in <module>
    load_entry_point('blr', 'console_scripts', 'blr')()
  File "/crex/proj/uppstore2018173/private/scripts/BLR/src/blr/__main__.py", line 45, in main
    module.main(args)
  File "/crex/proj/uppstore2018173/private/scripts/BLR/src/blr/cli/tagfastq.py", line 57, in main
    with dnaio.open(args.input1, file2=args.input2, interleaved=in_interleaved, mode="r") as reader, \
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/dnaio/__init__.py", line 83, in open
    return PairedSequenceReader(file1, file2, fileformat)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/dnaio/__init__.py", line 225, in __init__
    self.reader1 = stack.enter_context(_open_single(file1, fileformat=fileformat))
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/dnaio/__init__.py", line 130, in _open_single
    file = xopen(path, mode + 'b')
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/xopen/__init__.py", line 404, in xopen
    return _open_gz(filename, mode, compresslevel, threads)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/xopen/__init__.py", line 353, in _open_gz
    return open_with_threads()
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/xopen/__init__.py", line 339, in open_with_threads
    return PipedGzipReader(filename, mode, threads=threads)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/xopen/__init__.py", line 218, in __init__
    self.process = Popen(pigz_args, stdout=PIPE, stderr=PIPE)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/subprocess.py", line 1275, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

Leaving out reference_variants configuration crashes

WorkflowError in line 6 of .../Snakefile:
Error validating config file.
ValidationError: None is not of type 'string'

Failed validating 'type' in schema['properties']['reference_variants']:
    OrderedDict([('type', 'string'),
                 ('description',
                  'Path to reference variants, if not provided then '
                  'variant will be called by freebayes'),
                 ('default', None)])

On instance['reference_variants']:
    None

The comment says that FreeBayes will be run if reference_variants is left out, but doing so leads to the above error.

example doesn't complete

Hello,

when I run
bash ../BLR_automation.sh -r testdata_read1.fastq.gz testdata_read2.fastq.gz testdata_analysis

the script runs only runs until step 3 and doesn't complete. Do you have any idea why this is?
bowtie2 version: 2.3.4.3

Thanks for your help.

Below: command line output and content of 3_map.log

  1. Argparsing & options
    Read 1: testdata_read1.fastq.gz
    Read 2: testdata_read2.fastq.gz
    Output: testdata_analysis

Threads: 1
Starts at step: 1
End after step: 4

Wed Feb 27 16:46:07 CET 2019 ANALYSIS STARTING

  1. Demultiplexing
    Wed Feb 27 16:46:07 CET 2019 1st adaptor removal
    Wed Feb 27 16:46:08 CET 2019 1st adaptor removal done
    Wed Feb 27 16:46:08 CET 2019 Barcode extraction
    Wed Feb 27 16:46:08 CET 2019 Barcode extraction done
    Wed Feb 27 16:46:08 CET 2019 2nd adaptor removal
    Wed Feb 27 16:46:09 CET 2019 2nd adaptor removal done
    Wed Feb 27 16:46:09 CET 2019 3' trimming
    Wed Feb 27 16:46:09 CET 2019 3' trimming done
    Wed Feb 27 16:46:09 CET 2019 Intact reads: 0.973162 %

  2. Clustering
    Wed Feb 27 16:46:09 CET 2019 Barcode fasta generation
    Wed Feb 27 16:46:09 CET 2019 Barcode fasta generation done
    Wed Feb 27 16:46:09 CET 2019 Barcode clustering
    Wed Feb 27 16:46:10 CET 2019 Barcode clustering done

  3. Mapping
    Wed Feb 27 16:46:10 CET 2019 Mapping

The content of 3_map.log file:
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
[main_samview] fail to read the header from "-".

Phase all reads into phase blocks

Currently only reads at heterozygous sites are being phased and numerous LSV callers require reads to be tagged with HP tags (haplotype tags). So in order to test and evaluate LSV callers we are required to phase all reads.

This should be done by phasing the molecules according to the hetSNV sites they cover.

Implement a "blr init" command

While #92 deals with blr run, we should also have a blr init command that initializes an analysis directory.

Considerations:

  • Splitting up initialization and running in this way makes sense because we want to give the user the opportunity to configure the pipeline before running it. The init command can create a nicely commented template configuration in which it should be mostly self-explanatory what the settings do. This is easier to use than requiring the user to provide a configuration file from scratch.

  • BLR will have different types of pipelines (e.g., whole-genome haplotyping, metagenomic analysis). How does the user choose which pipeline to run? We can have either different commands (blr wgh and blr meta) or options (blr init --wgh, blr init --meta).

Include frameshift constructs

We have started testing frameshift constructs for the BLR method. This means that the first handle might be preceded by 0, 1, 2 or 3 nucleotides. We should investigate how this affects the trimming and if we have to edit the trimming somehow to allow for this new setup.

Common summary class

Issue from discussion in #76

I am also thinking if we should maybe have a standard summary class for these scripts that we import into all of them. This could maybe use something other than the predefined variables used currently. I like using the collections.Counterfor most of these things, it uses a dictionary to store these instead of having to define separate variables.

.yml not compatible with macOS

Problem

Trying to install dependencies doesn't seem to work on macOS.

conda env create -n blr -f ~/PycharmProjects/BLR/environment.yml 
Collecting package metadata: done
Solving environment: failed

ResolvePackageNotFound: 
  - tbb==2019.7=hc9558a2_0
  - llvm-openmp==8.0.0=hc9558a2_0
  - readline==7.0=h7b6447c_5
  - libcurl==7.64.1=h20c2e04_0
  - cryptography==2.7=py36h72c5cf5_0
  - bowtie2==2.3.5=py36he860b03_0
  - bcftools==1.9=ha228f0b_4
  - ncurses==6.1=he6710b0_1
  - sqlite==3.28.0=h7b6447c_0
  - curl==7.64.1=hbc83047_0
  - libstdcxx-ng==8.2.0=hdf63c60_1
  - libgcc-ng==8.2.0=hdf63c60_1
  - cutadapt==2.3=py36h14c3975_0
  - libssh2==1.8.2=h1ba5d50_0
  - bzip2==1.0.6=h14c3975_5
  - libffi==3.2.1=hd88cf55_4
  - pyrsistent==0.15.2=py36h516909a_0
  - pysam==0.15.2=py36h4b7d16d_3
  - cffi==1.12.3=py36h8022711_0
  - libdeflate==1.0=h14c3975_1
  - yaml==0.1.7=h14c3975_1001
  - wrapt==1.11.1=py36h516909a_0
  - samtools==1.9=h8571acd_11
  - cd-hit==4.8.1=hdbcaa40_2
  - pyyaml==5.1=py36h14c3975_0
  - psutil==5.6.2=py36h516909a_0
  - zlib==1.2.11=h7b6447c_3
  - xz==5.2.4=h14c3975_4
  - openjdk==8.0.152=h46b5887_1
  - tk==8.6.8=hbc83047_0
  - openssl==1.1.1b=h14c3975_1
  - pigz==2.4=h84994c4_0
  - krb5==1.16.1=h173b8e3_7
  - dnaio==0.3=py36h14c3975_1
  - perl==5.26.2=h14c3975_0
  - htslib==1.9=ha228f0b_7
  - libedit==3.1.20181209=hc058e9b_0
  - python==3.6.8=h0371630_0
  - datrie==0.7.1=py36h14c3975_0

Similar problem

I found a similar situation where they move dependencies into a pip section. I'm not sure how the .yml file is generated but, could we implement a pip section in it and would it work with linux?

buildmolecules not outputting stats_file

The buildmoleculesscript has a -s/--stats-file option to output barcode/molecule statistics files. This is currently not default in the snakemake pipeline, but it should be.

Allow BWA-MEM as an alternative mapper

I’ve benchmarked mapping 20 million reads against GRCh38 both with Bowtie2 and BWA-MEM.

These are the wall-clock runtimes (each mapper was run three times):

  • BWA-MEM 0.7.17 (from Bioconda): 33m 30s, 33m 23s, 33m 31s
  • Bowtie2 2.3.5 (from Bioconda): 43m 33s, 43m 22s, 43m 28s

CPU time was 492 minutes vs. 687 minutes.

I’d like to add BWA-MEM as an alternative mapper to the Snakefile. We can make it configurable, but given the numbers, BWA-MEM should probably be the default.

Common fetch_bc function

Issue from discussion in #76

One thing that I thought about was maybe to move the fetch_tag function used into >utils. The function is used quite similar in both filterclusters and buildmolecules >and could possibly be useful for other programs. The differ abit between them at the >moment but I think you could combine them like this.

def fetch_bc(pysam_read, tag, summary=None, default=None):
   """
   Fetches barcode from a bam file tag, returns None if reads isn't tagged.
   """

   try:
       result = pysam_read.get_tag(tag)
   except KeyError:
       result = default

       if summary:
           summary.reads_without_tag += 1

   return result

Evaluate where more streaming can be used

More streaming can perhaps be employed. Currently, some large intermediate files are generated, some uncompressed, some compressed. This is slow: When writing them uncompressed, file I/O performance is likely a bottleneck. When writing them compressed, that takes a lot of CPU hours. Instead, one can let one program write uncompressed and send the output to the next program via a pipe. This avoids the intermediate file to disk entirely.

Perhaps this can be used for some of the intermediate BAM files. (And the other non-final ones should be marked as temp() as suggested in #115).

Print DAG functionality

Enable the ability to print out the DAG flowchart for the Snakefile. We could do something similar to how I did in the DBS-Pro runscript.

SamToFastq fails because of unpaired mates

When running the pipeline (on the test data), during the very last step when SamToFastq is used to convert the final BAM to FASTQ, the following error occurs:

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Found 493 unpaired mates

I found a forum post by the author of Picard somewhere (sorry, lost the link) stating that this validation step occurs after the output FASTQ has already been written, which means that the error doesn’t become apparent unless one looks at the log file or checks the exit code.

The problem is that MarkDuplicates is run with REMOVE_DUPLICATES=true, which apparently deletes single reads out of pairs.

This can be “fixed” by adding VALIDATION_STRINGENCY=silent to SamToFastq, but that’s just hiding the problem. Usually, MarkDuplicates is indeed used to only mark duplicates, not to really remove them, in order to avoid losing data. The idea is that one can then (in theory) keep only a single BAM file and be able to reconstruct the full input FASTQ file from it. My preferred course of action would be to follow this best practice, but I don’t know how the created FASTQ and BAM files are actually used later.

Non-diploid calls

From running the current pipeline without a reference vcf file I am getting non-diploid calls. These are then causing problems for HapCUT2 when trying to phase them.

However according to the freebayes documentation it seems like we should only get diploid samples.

"Call variants assuming a diploid sample"

freebayes -f ref.fa aln.bam >var.vcf

Current command used in Snakefile for variant calling

"freebayes"
" -f {config[reference]}.fasta"
" {input.bam} 1> {output.vcf} 2> {log}"

filtercluster stats for output molecules_per_bc is wrong

I just noticed that the statistics file outputed by the flitercluster scripts is currently wrong. Running on the test files produces the following output:

(blr) PontusHMBA:BLR pontus.hojer$ cat outdir/cluster_stats/x2.stats.molecules_per_bc
11112212121222111112221221211112211111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

I could not see any immediate errors in the script so maybe we have to look deeper into it if its a file we would like to keep.

MD5 sum comparisons should be removed as they are hard to diagnose

I’ve noticed some test failures because the MD5 sums of a BAM file didn’t match. These problems are hard to diagnose since it is not clear exactly what the difference is. Also, it appears to me as if the MD5 sometimes changes even if nothing was done that should influence it.

filter_clusters Cleanup

filter clusters is written a long time ago and has a lot of clumsy code which can be significantly improved & simplified in order to make it more readable.

Test whether Picard SamToFastq should write gzip directly

SamToFastq can write gzipped FASTQ files directly. However, since pigz is used currently in a separate compression step, perhaps it would actually be slower to let SamToFastq do the compression.

Alternatively, to avoid the intermediate uncompressed FASTQs, test whether it is possible to let SamToFastq write to two FIFOs and let pigz read from the FIFOs and write to the final output.

Random MD5 sum mismatches

When running the tests, the MD5 sum of the mapped.sorted.tag.mkdup.bcmerge.mol.filt.bam sometimes mismatches. This appears to be random and happens perhaps one out of 10 times.

cluster_rmdup cleanup

Cluster rmdup is written a long time ago and has a lot of clumsy code which can be significantly improved & simplified in order to make it more readable.

Consider which files to mark as temporary

Snakemake allows to mark intermediate files as temporary with temp(). These files are then deleted as soon as they are not required anymore for subsequent rules. This makes sense for large files in particular. For example, we may want to use this for the intermediate BAM files, keeping only the final one.

Create a nice summary report

After the pipeline has done all the computations, it should create a report. We can use MultiQC for this. It may be necessary to write our own plugin for it for the custom BLR-specific parts, but we can use its existing modules already now for Bowtie2, Cutadapt, MarkDuplicates etc.

Dynamic filter

Issue from discussion in #76

Currently the filtering is based on the number of molecules in the DBS cluster and on an average size of 30kbp. Instead we could take into account the actual size of the molecules and how much of the genome that is covered by the DBS cluster. Then we could set a probability threshold or similar for which clusters to remove/mask.

Consistency in argparse and comments

In line with what @marcelm pointed out in #109 we should also apply some of these comments in other scripts whenever applicable. The ones I have found we currently do not follow are:

Argparser

  • help: %(default)s instead of hardcoded strings
  • help: Write BAM rather than .bam
  • Omit type=str in (it is default)
  • Use double dash -- for multi-character option names (not only their long names)

Comments

  • Write abbreviations without colon (e.g. ID:s => IDs)

Simplify how binaries are found

The pipeline currently requires that the paths to Picard and HapCUT2 are set in the configuration file. This should not be necessary. Instead, we should simply expect that the programs are installed and available on the $PATH. Since we recommend users to install BLR with Conda, this will work.

This affects the picard_command and hapcut2 configuration settings. For HapCUT2, we may need a Conda recipe first (#90).

Implement LSV consolidation

Following the testing of the different LSV callers (see #131 ) we could try implementing some SV consolidation tool to leverage multiple callers. These combines results from multiple callers "to increase sensitivity and reduce false-positive calls" according to this article i found.

These could also be useful in testing the different SV callers.

They mention the tools:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.