fricktobias / blr Goto Github PK

View Code? Open in Web Editor NEW

6.0 4.0 5.0 1.89 MB

License: MIT License

Shell 2.17% Python 97.83%

blr's Introduction

NB! For the latest version of this pipeline go to: AfshinLab/BLR

Barcode-Linked Reads Analysis

Content:

Usage
One-time installation
Old version

Usage

1. Setup an analysis folder

Activate your Conda environment.

conda activate blr

Choose a name for the analysis. It will be output_folder in this example. Create the analysis directory.

blr init --reads1=path/to/sample.R1.fastq.gz path/to/output_folder

Note that BLR expects paired-end reads. However, only the path to the R1 file needs to be provided. The R2 file will be found automatically.

To use the other blr commands, make sure you working directory is your newly created analysis folder.

cd path/to/output_folder

Then, you may need to edit the configuration file blr.yaml, in particular to enter the path to your reference genome.

blr set --genome_reference=path/to/GRCh38.fasta

To see what other configurations can be altered, read the documentation in the blr.yaml file.

2. Running an analysis

Change working directory to your analysis folder

cd path/to/output_folder

The pipeline automatically runs all steps.

blr run

For more options, see the documentation.

blr run -h

One-time installation

1. Prerequisite: Conda

Install miniconda
Enable the bioconda channel

You could also try copy-pasting the following to your terminal. This will download miniconda, install it to you $HOME folder and enable the bioconda channel.

if [[ $OSTYPE = "linux-gnu" ]]; then 
    wget http://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
elif [[ $OSTYPE = "darwin"* ]]; then 
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh
fi
bash miniconda.sh -b -p $HOME/miniconda
source $HOME/miniconda/etc/profile.d/conda.sh
conda config --add channels bioconda

2. Install

Clone the git repository.

  git clone https://github.com/FrickTobias/BLR.git

Create & activate a new Conda environment, in which all dependencies will be installed.

  conda env create -n blr -f environment.yml
  conda activate blr

Install blr into the environment in "editable install" mode.

  pip install -e .

This will install blr in such a way that you can still modify the source code and get any changes immediately without re-installing.

2.1 Linux users (not macOS)

To enable DeepVariant, install it separately to your environment.

conda activate blr
conda install deepvariant

This will enable the variant_caller: deepvariant option in the analysis config file.

3. Updating

Change working directory to your blr git folder and update.

cd path/to/BLR
git pull

Old version

To run the analysis described in High throughput barcoding method for genome-scale phasing, look at the stable branch for this git repository.

That version of BLR Analysis is also available at OMICtools.

blr's People

Contributors

Stargazers

Watchers

Forkers

pontushojer jennifertheland jdemeul marcelm parharn

blr's Issues

Implement a "blr run" command or similar

It should not be required for the user to explicitly run snakemake (and to know the location of the Snakefile). Similar to the way IgDiscover works, we can have some kind of blr init command that initalizes an analysis directory with a configuration file that creates sane defaults (see #94). Then, calling blr run from within the directory would run the pipeline.

Compress barcodes.fasta after use.

barcodes.fasta containing the uncorrected barcodes is now left uncompressed after the workflow. In the latest dataset this took up ~31Gb of space. After it has been used for clustering and tagging the FASTQ it should therefore be compressed to save space.

Integrate WhatsHap for phasing

HapCUT2 was preferred over WhatsHap because it is unclear how to use the barcode tags. Check whether 10X data is actually supported by WhatsHap and if so, add it to the pipeline as an alternative to HapCUT2.

Performance improvements

I am collecting here some ideas for how to improve the performance of the pipeline.

Maximum available threads requested from piped rules.

So I have noticed a recurring error which I think relates to the piped rules I introduced in out rules/trim.smk file. Since there are three rules connected through the pipe, snakemake requires a minimum of three threads to run the trimming. Problem is that snakemake does not automatically adjust this for the three rules so that if i set threads: 20 for each of the three rules and give three threads the first rule will take all three and snakemake raise a WorkflowError requesting five threads (one more for each other rule). This is already noticed in a current snakemake issue on bitbucket.

In my solution I used the same solution as in the issue to cap the number of threads requested by the rule using the available_cpu_count function from snakemake utils with some modifications. Currently it looks as follows:

rule trim_r1_handle_pipe:
    ...
    threads: ceil((available_cpu_count()-1)/2)
    ...

rule extract_barcodes_pipe:
    ...
    (Only runs one thread)
    ...

rule final_trim_pipe:
    ...
    threads: int((available_cpu_count()-1)/2)
    ...

I thought this function would give the actual number of threads given in the argument (snakemake -j) but instead this is based on the total number of available CPUs on the local machine. Thus it requires that you run the pipeline with the full number of CPUs or it will raise an error such as this (from our local server with 24 CPUs:

WorkflowError:
Job needs threads=24 but only threads=20 are available. This is likely because two jobs are connected via a pipe and have to run simultaneously. Consider providing more resourc
es (e.g. via --cores).

I have tested reading the arguments given in the command i.e. taking whatever comes after -jin the sys.argv list, but this seems as a bad solution. Alternatively I could move from using the snakemake pipe function on three rules to a single rule which uses regular unix pipes (i.e. |).

Run commands using snakemake

Create an initial Snakefile

Tag reads with barcode sequence

Currently reads are tagged with barcode cluster ids. This comes from when we used cd-hit and got a cluster index rather than a canonical sequence for the cluster. Using the barcode includes more valuable information and is also more similar to the longranger format (like it or not this will probably be the basis of manny tools we want to apply). Possibly we also want to include a GEM group there also.

Test minimap2 as alternative read mapper

I’m working on this at the moment.

We need a bioconda recipe for HapCUT2

It appears as if doesn’t exist, yet.

This would allow us to get rid of the build script.

Add athena-meta pipeline

The meta branch now contains snakemake rules to run athena-meta. To try it out

update your environment to install megahit and bwa
run the test script (bash tests/run.sh) to generate the trimmed files and barcode cluster file
run the metagenomic part:

snakemake -s src/blr/rules/meta.smk --use-conda -np outdir/athena.done (remove the '-n' to trigger the actual run)

All seems to work fine (on linux) up until the Flye step of athena-meta. Here it crashes with something seemingly related to this issue. Note: running with just one thread does not fix the problem. Also, running with the athena-meta-flye-docker docker image does not work either.

I'll try to look into the problem further.

Test alternative duplicate marking programs

Consider whether possibly faster alternatives to Picard MarkDuplicates can be used:

sambamba markdup
samblaster

Replace copy_config in test_cli

It was suggested in #148 that the copy_config in tests/test_cli.py should be replaced with something more robust.

One options was to do something similar to the IgDiscover config command.
This uses the ruamel.yaml module to modify the YAML configuration file.

tagfastq: cannot allocate memory

Problem

I'm trying to run library with 440 M read pairs and I'm getting a traceback due to being unable to allocate memory in the tagfastq step (see log file at the bottom).

Since Cutadapt works fine with the given amount of reads I'm a bit confused as to what might be giving this error, are we misusing dnaio or could this be because we've already allocated a lot of memory to holding the barcoding information?

If I remember correctly cutadapt threads by assigning a specific core for reading, could this be related since the traceback specifies it cannot allocate memory (rather than running out of it).

I'm running the pipe with 20 cores, 128 Gb RAM.

Related code snippets

Barcode reading functions

def parse_corrected_barcodes(open_file):
    """
    Parse starcode cluster output and return a dictionary with raw sequences pointing to a
    corrected canonical sequence
    :param open_file: starcode tabular output file.
    :return: dict: raw sequences pointing to a corrected canonical sequence.
    """
    corrected_barcodes = dict()
    for cluster in tqdm(open_file, desc="Clusters processed"):
        canonical_seq, _, cluster_seqs = cluster.strip().split("\t", maxsplit=3)
        corrected_barcodes.update({raw_seq: canonical_seq for raw_seq in cluster_seqs.split(",")})
    return corrected_barcodes

def parse_raw_barcodes(open_file):
    """
    Parse FASTA/FASTQ containing barcode sequences and return a dictionary with entry headers
    pointing to a raw barcodes sequence.
    :param open_file: dnaio odject.
    :return: dict: entry headers pointing to a raw barcodes sequence
    """
    raw_barcodes = dict()
    for barcode in tqdm(open_file, desc="Raw barcodes processed"):
        header, _ = barcode.name.split(maxsplit=1)
        raw_barcodes[header] = barcode.sequence
    return raw_barcodes

Loop at traceback

 with dnaio.open(args.input1, file2=args.input2, interleaved=in_interleaved, mode="r") as reader, \
            dnaio.open(args.output1, file2=args.output2, interleaved=out_interleaved, mode="w") as writer:
        for read1, read2 in tqdm(reader, desc="Read pairs processed"):
            # Header parsing
            name_and_pos_r1, read_and_index_r1 = read1.name.split(maxsplit=1)
            name_and_pos_r2, read_and_index_r2 = read2.name.split(maxsplit=1)

            try:
                raw_barcode_seq = raw_barcodes[name_and_pos_r1]
            except KeyError:
                raw_barcode_seq = None
                reads_missing_barcode += 1

            if raw_barcode_seq:
                corr_barcode_seq = corrected_barcodes[raw_barcode_seq]

                raw_barcode_id = f"{args.sequence_tag}:Z:{raw_barcode_seq}"
                corr_barcode_id = f"{args.barcode_tag}:Z:{corr_barcode_seq}"

                # Create new name with barcode information.
                new_name = separator.join([name_and_pos_r1, raw_barcode_id, corr_barcode_id])

                # Save header to read instances
                read1.name = " ".join([new_name, read_and_index_r1])
                read2.name = " ".join([new_name, read_and_index_r2])

            # Write to out
            writer.write(read1, read2)

Log file

SETTINGS FOR: tagfastq raw_barcodes: barcodes.fasta
 corrected_barcodes: barcodes.clstr
 input1: trimmed.1.fastq.gz
 input2: trimmed.2.fastq.gz
 output1: trimmed_barcoded.1.fastq.gz
 output2: trimmed_barcoded.2.fastq.gz
 barcode_tag: BX
 sequence_tag: RX
 sep: _
tagfastq - INFO: Starting
Clusters processed: 7774460it [00:21, 362636.59it/s]
Raw barcodes processed: 433169093it [27:54, 258621.92it/s]
tagfastq - INFO: Input detected as paired FASTQ.
tagfastq - INFO: Output detected as paired FASTQ.
Traceback (most recent call last):
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/bin/blr", line 11, in <module>
    load_entry_point('blr', 'console_scripts', 'blr')()
  File "/crex/proj/uppstore2018173/private/scripts/BLR/src/blr/__main__.py", line 45, in main
    module.main(args)
  File "/crex/proj/uppstore2018173/private/scripts/BLR/src/blr/cli/tagfastq.py", line 57, in main
    with dnaio.open(args.input1, file2=args.input2, interleaved=in_interleaved, mode="r") as reader, \
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/dnaio/__init__.py", line 83, in open
    return PairedSequenceReader(file1, file2, fileformat)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/dnaio/__init__.py", line 225, in __init__
    self.reader1 = stack.enter_context(_open_single(file1, fileformat=fileformat))
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/dnaio/__init__.py", line 130, in _open_single
    file = xopen(path, mode + 'b')
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/xopen/__init__.py", line 404, in xopen
    return _open_gz(filename, mode, compresslevel, threads)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/xopen/__init__.py", line 353, in _open_gz
    return open_with_threads()
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/xopen/__init__.py", line 339, in open_with_threads
    return PipedGzipReader(filename, mode, threads=threads)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/site-packages/xopen/__init__.py", line 218, in __init__
    self.process = Popen(pigz_args, stdout=PIPE, stderr=PIPE)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "/proj/uppstore2018173/private/external/nbis-meta/envs/blr-master/lib/python3.6/subprocess.py", line 1275, in _execute_child
    restore_signals, start_new_session, preexec_fn)
OSError: [Errno 12] Cannot allocate memory

Leaving out reference_variants configuration crashes

WorkflowError in line 6 of .../Snakefile:
Error validating config file.
ValidationError: None is not of type 'string'

Failed validating 'type' in schema['properties']['reference_variants']:
    OrderedDict([('type', 'string'),
                 ('description',
                  'Path to reference variants, if not provided then '
                  'variant will be called by freebayes'),
                 ('default', None)])

On instance['reference_variants']:
    None

The comment says that FreeBayes will be run if reference_variants is left out, but doing so leads to the above error.

example doesn't complete

Hello,

when I run
bash ../BLR_automation.sh -r testdata_read1.fastq.gz testdata_read2.fastq.gz testdata_analysis

the script runs only runs until step 3 and doesn't complete. Do you have any idea why this is?
bowtie2 version: 2.3.4.3

Thanks for your help.

Below: command line output and content of 3_map.log

Argparsing & options
Read 1: testdata_read1.fastq.gz
Read 2: testdata_read2.fastq.gz
Output: testdata_analysis

Threads: 1
Starts at step: 1
End after step: 4

Wed Feb 27 16:46:07 CET 2019 ANALYSIS STARTING

Demultiplexing
Wed Feb 27 16:46:07 CET 2019 1st adaptor removal
Wed Feb 27 16:46:08 CET 2019 1st adaptor removal done
Wed Feb 27 16:46:08 CET 2019 Barcode extraction
Wed Feb 27 16:46:08 CET 2019 Barcode extraction done
Wed Feb 27 16:46:08 CET 2019 2nd adaptor removal
Wed Feb 27 16:46:09 CET 2019 2nd adaptor removal done
Wed Feb 27 16:46:09 CET 2019 3' trimming
Wed Feb 27 16:46:09 CET 2019 3' trimming done
Wed Feb 27 16:46:09 CET 2019 Intact reads: 0.973162 %
Clustering
Wed Feb 27 16:46:09 CET 2019 Barcode fasta generation
Wed Feb 27 16:46:09 CET 2019 Barcode fasta generation done
Wed Feb 27 16:46:09 CET 2019 Barcode clustering
Wed Feb 27 16:46:10 CET 2019 Barcode clustering done
Mapping
Wed Feb 27 16:46:10 CET 2019 Mapping

The content of 3_map.log file:
[bam_header_read] EOF marker is absent. The input is probably truncated.
[bam_header_read] invalid BAM binary header (this is not a BAM file).
[main_samview] fail to read the header from "-".

Get the bioconda samblaster recipe to work on macOS

See #113

Phase all reads into phase blocks

Currently only reads at heterozygous sites are being phased and numerous LSV callers require reads to be tagged with HP tags (haplotype tags). So in order to test and evaluate LSV callers we are required to phase all reads.

This should be done by phasing the molecules according to the hetSNV sites they cover.

Implement a "blr init" command

While #92 deals with blr run, we should also have a blr init command that initializes an analysis directory.

Considerations:

Splitting up initialization and running in this way makes sense because we want to give the user the opportunity to configure the pipeline before running it. The init command can create a nicely commented template configuration in which it should be mostly self-explanatory what the settings do. This is easier to use than requiring the user to provide a configuration file from scratch.
BLR will have different types of pipelines (e.g., whole-genome haplotyping, metagenomic analysis). How does the user choose which pipeline to run? We can have either different commands (blr wgh and blr meta) or options (blr init --wgh, blr init --meta).

Include frameshift constructs

We have started testing frameshift constructs for the BLR method. This means that the first handle might be preceded by 0, 1, 2 or 3 nucleotides. We should investigate how this affects the trimming and if we have to edit the trimming somehow to allow for this new setup.

Common summary class

Issue from discussion in #76

I am also thinking if we should maybe have a standard summary class for these scripts that we import into all of them. This could maybe use something other than the predefined variables used currently. I like using the collections.Counterfor most of these things, it uses a dictionary to store these instead of having to define separate variables.

.yml not compatible with macOS

Problem

Trying to install dependencies doesn't seem to work on macOS.

conda env create -n blr -f ~/PycharmProjects/BLR/environment.yml 
Collecting package metadata: done
Solving environment: failed

ResolvePackageNotFound: 
  - tbb==2019.7=hc9558a2_0
  - llvm-openmp==8.0.0=hc9558a2_0
  - readline==7.0=h7b6447c_5
  - libcurl==7.64.1=h20c2e04_0
  - cryptography==2.7=py36h72c5cf5_0
  - bowtie2==2.3.5=py36he860b03_0
  - bcftools==1.9=ha228f0b_4
  - ncurses==6.1=he6710b0_1
  - sqlite==3.28.0=h7b6447c_0
  - curl==7.64.1=hbc83047_0
  - libstdcxx-ng==8.2.0=hdf63c60_1
  - libgcc-ng==8.2.0=hdf63c60_1
  - cutadapt==2.3=py36h14c3975_0
  - libssh2==1.8.2=h1ba5d50_0
  - bzip2==1.0.6=h14c3975_5
  - libffi==3.2.1=hd88cf55_4
  - pyrsistent==0.15.2=py36h516909a_0
  - pysam==0.15.2=py36h4b7d16d_3
  - cffi==1.12.3=py36h8022711_0
  - libdeflate==1.0=h14c3975_1
  - yaml==0.1.7=h14c3975_1001
  - wrapt==1.11.1=py36h516909a_0
  - samtools==1.9=h8571acd_11
  - cd-hit==4.8.1=hdbcaa40_2
  - pyyaml==5.1=py36h14c3975_0
  - psutil==5.6.2=py36h516909a_0
  - zlib==1.2.11=h7b6447c_3
  - xz==5.2.4=h14c3975_4
  - openjdk==8.0.152=h46b5887_1
  - tk==8.6.8=hbc83047_0
  - openssl==1.1.1b=h14c3975_1
  - pigz==2.4=h84994c4_0
  - krb5==1.16.1=h173b8e3_7
  - dnaio==0.3=py36h14c3975_1
  - perl==5.26.2=h14c3975_0
  - htslib==1.9=ha228f0b_7
  - libedit==3.1.20181209=hc058e9b_0
  - python==3.6.8=h0371630_0
  - datrie==0.7.1=py36h14c3975_0

Similar problem

I found a similar situation where they move dependencies into a pip section. I'm not sure how the .yml file is generated but, could we implement a pip section in it and would it work with linux?

buildmolecules not outputting stats_file

The buildmoleculesscript has a -s/--stats-file option to output barcode/molecule statistics files. This is currently not default in the snakemake pipeline, but it should be.

Allow BWA-MEM as an alternative mapper

I’ve benchmarked mapping 20 million reads against GRCh38 both with Bowtie2 and BWA-MEM.

These are the wall-clock runtimes (each mapper was run three times):

BWA-MEM 0.7.17 (from Bioconda): 33m 30s, 33m 23s, 33m 31s
Bowtie2 2.3.5 (from Bioconda): 43m 33s, 43m 22s, 43m 28s

CPU time was 492 minutes vs. 687 minutes.

I’d like to add BWA-MEM as an alternative mapper to the Snakefile. We can make it configurable, but given the numbers, BWA-MEM should probably be the default.

Pipeline wants to be run from the repo root

Currently, the pipeline expects to be run from the repository root because the Snakefile reads in a config.yaml file.

Common fetch_bc function

Issue from discussion in #76

One thing that I thought about was maybe to move the fetch_tag function used into >utils. The function is used quite similar in both filterclusters and buildmolecules >and could possibly be useful for other programs. The differ abit between them at the >moment but I think you could combine them like this.

def fetch_bc(pysam_read, tag, summary=None, default=None):
   """
   Fetches barcode from a bam file tag, returns None if reads isn't tagged.
   """

   try:
       result = pysam_read.get_tag(tag)
   except KeyError:
       result = default

       if summary:
           summary.reads_without_tag += 1

   return result

Evaluate where more streaming can be used

More streaming can perhaps be employed. Currently, some large intermediate files are generated, some uncompressed, some compressed. This is slow: When writing them uncompressed, file I/O performance is likely a bottleneck. When writing them compressed, that takes a lot of CPU hours. Instead, one can let one program write uncompressed and send the output to the next program via a pipe. This avoids the intermediate file to disk entirely.

Perhaps this can be used for some of the intermediate BAM files. (And the other non-final ones should be marked as temp() as suggested in #115).

Print DAG functionality

Enable the ability to print out the DAG flowchart for the Snakefile. We could do something similar to how I did in the DBS-Pro runscript.

Use logging in all scripts

Cluster execution

Snakemake has support for cluster execution using a scheduler. This would be a nice addition to simplify running on uppmax.

I saw that this was implemented in the nbis-meta pipeline so possibly we could base ours on that.

SamToFastq fails because of unpaired mates

When running the pipeline (on the test data), during the very last step when SamToFastq is used to convert the final BAM to FASTQ, the following error occurs:

Exception in thread "main" htsjdk.samtools.SAMFormatException: SAM validation error: ERROR: Found 493 unpaired mates

I found a forum post by the author of Picard somewhere (sorry, lost the link) stating that this validation step occurs after the output FASTQ has already been written, which means that the error doesn’t become apparent unless one looks at the log file or checks the exit code.

The problem is that MarkDuplicates is run with REMOVE_DUPLICATES=true, which apparently deletes single reads out of pairs.

This can be “fixed” by adding VALIDATION_STRINGENCY=silent to SamToFastq, but that’s just hiding the problem. Usually, MarkDuplicates is indeed used to only mark duplicates, not to really remove them, in order to avoid losing data. The idea is that one can then (in theory) keep only a single BAM file and be able to reconstruct the full input FASTQ file from it. My preferred course of action would be to follow this best practice, but I don’t know how the created FASTQ and BAM files are actually used later.

Non-diploid calls

From running the current pipeline without a reference vcf file I am getting non-diploid calls. These are then causing problems for HapCUT2 when trying to phase them.

However according to the freebayes documentation it seems like we should only get diploid samples.

"Call variants assuming a diploid sample"

freebayes -f ref.fa aln.bam >var.vcf

Current command used in Snakefile for variant calling

"freebayes"
" -f {config[reference]}.fasta"
" {input.bam} 1> {output.vcf} 2> {log}"

Move imports to top in all script files

Make all rules in the Snakefile relative to the analysis folder

All input and output file names in the rules in the Snakefile currently contain a {dir}/ prefix. Instead, we should remove this and expect that snakemake is run within the analysis directory.

filtercluster stats for output molecules_per_bc is wrong

I just noticed that the statistics file outputed by the flitercluster scripts is currently wrong. Running on the test files produces the following output:

(blr) PontusHMBA:BLR pontus.hojer$ cat outdir/cluster_stats/x2.stats.molecules_per_bc
11112212121222111112221221211112211111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

I could not see any immediate errors in the script so maybe we have to look deeper into it if its a file we would like to keep.

MD5 sum comparisons should be removed as they are hard to diagnose

I’ve noticed some test failures because the MD5 sums of a BAM file didn’t match. These problems are hard to diagnose since it is not clear exactly what the difference is. Also, it appears to me as if the MD5 sometimes changes even if nothing was done that should influence it.

Introduce LSV calling to pipeline

filter_clusters Cleanup

filter clusters is written a long time ago and has a lot of clumsy code which can be significantly improved & simplified in order to make it more readable.

Test whether Picard SamToFastq should write gzip directly

SamToFastq can write gzipped FASTQ files directly. However, since pigz is used currently in a separate compression step, perhaps it would actually be slower to let SamToFastq do the compression.

Alternatively, to avoid the intermediate uncompressed FASTQs, test whether it is possible to let SamToFastq write to two FIFOs and let pigz read from the FIFOs and write to the final output.

Random MD5 sum mismatches

When running the tests, the MD5 sum of the mapped.sorted.tag.mkdup.bcmerge.mol.filt.bam sometimes mismatches. This appears to be random and happens perhaps one out of 10 times.

cluster_rmdup cleanup

Cluster rmdup is written a long time ago and has a lot of clumsy code which can be significantly improved & simplified in order to make it more readable.

Test Emerald as alternative mapper

Emerald github

Add new option to filterclusters script to include Tn5 related overlaps.

Related to comment

This is ment as a future improvement and is no current issue. Just writing it here to remember.

Consider which files to mark as temporary

Snakemake allows to mark intermediate files as temporary with temp(). These files are then deleted as soon as they are not required anymore for subsequent rules. This makes sense for large files in particular. For example, we may want to use this for the intermediate BAM files, keeping only the final one.

Create a nice summary report

After the pipeline has done all the computations, it should create a report. We can use MultiQC for this. It may be necessary to write our own plugin for it for the custom BLR-specific parts, but we can use its existing modules already now for Bowtie2, Cutadapt, MarkDuplicates etc.

Dynamic filter

Issue from discussion in #76

Currently the filtering is based on the number of molecules in the DBS cluster and on an average size of 30kbp. Instead we could take into account the actual size of the molecules and how much of the genome that is covered by the DBS cluster. Then we could set a probability threshold or similar for which clusters to remove/mask.

Consistency in argparse and comments

In line with what @marcelm pointed out in #109 we should also apply some of these comments in other scripts whenever applicable. The ones I have found we currently do not follow are:

Argparser

help: %(default)s instead of hardcoded strings
help: Write BAM rather than .bam
Omit type=str in (it is default)
Use double dash -- for multi-character option names (not only their long names)

Comments

Write abbreviations without colon (e.g. ID:s => IDs)

Simplify how binaries are found

The pipeline currently requires that the paths to Picard and HapCUT2 are set in the configuration file. This should not be necessary. Instead, we should simply expect that the programs are installed and available on the $PATH. Since we recommend users to install BLR with Conda, this will work.

This affects the picard_command and hapcut2 configuration settings. For HapCUT2, we may need a Conda recipe first (#90).

Implement LSV consolidation

Following the testing of the different LSV callers (see #131 ) we could try implementing some SV consolidation tool to leverage multiple callers. These combines results from multiple callers "to increase sensitivity and reduce false-positive calls" according to this article i found.

These could also be useful in testing the different SV callers.

They mention the tools:

fricktobias / blr Goto Github PK

blr's Introduction

Barcode-Linked Reads Analysis

Usage

1. Setup an analysis folder

2. Running an analysis

One-time installation

1. Prerequisite: Conda

2. Install

2.1 Linux users (not macOS)

3. Updating

Old version

blr's People

Contributors

Stargazers

Watchers

Forkers

blr's Issues

Problem

Related code snippets

Barcode reading functions

Loop at traceback

Log file

Problem

Similar problem

Recommend Projects

Recommend Topics

Recommend Org