Git Product home page Git Product logo

mycovista's Introduction

Mycovista - a pipeline to assemble (highly repetitive) bacterial genomes

Installation

In order to run Mycovista, you'll need to install:

Usage

To use Mycovista, edit the config.yaml file

  • specify the pipeline you want to use: long or hybrid mode
  • insert additional information:
    • path to input short reads
    • path to input long reads
    • path to output folder
    • names of the bacterial strains

Note: The raw read files must contain the name of the strain in the file name.

A python script helps you to generate all folders required for the output. Just run:

python scripts/create.py

Afterwards, all required data should be linked in the raw_data/ folder. Please check this!

Now you can get started and assemble your bacterias. You can start Mycovista by using:

snakemake -s <mode> -c <#threads> --use-conda

  • mode - assembly mode file (long or hybrid)

Tip: You can use -n for a dry run to check if snakemake will start all required rules.

Tools

When using Mycovista, please cite all incorporated tools as without them this pipeline wouldn't exist.

FastQC and NanoPlot are used for quality check of the reads (raw and preprocessed). Short reads are filtered by fastp for adapter clippling and Trimmomatic for quality trimming. Long reads are filtered by length using Filtlong. Flye assembles the long reads first. The assembly is then polished with long reads by Racon using minimap2 as mapper inbetween. Afterwards, medaka is incorporated as additional polishing step with long reads. In hybrid mode, the assembly postprocessed further with Racon and minimap2 using short reads. The final assembly is annotated by Prokka and general assembly statistics are calculated by QUAST.

Downstream analysis of our M. bovis assembly panel included pangenome analsis followed by a genome-wide association analysis (GWAS). We provided the R script for the GWAS in scripts/gwas.R.

Click here for all citations
  • fastp

    • Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34(17):i884–i890
  • Trimmomatic

    • Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15):2114–2120
  • Filtlong

    • Wick R (2018) Filtlong. Available: https://github.com/rrwick/Filtlong
  • Flye

    • Kolmogorov M, Yuan J, Lin Y, Pevzner PA (2019) Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology 37(5):540–546
  • Racon

    • Vaser R, Sovi ́c I, N N, Šiki ́c M (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research 27:737–746
  • minimap2

    • Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094–3100
  • medaka

    • Ltd. ONT (2018) medaka: Sequence correction provided by ONT Research. Available: https://github.com/nanoporetech/medaka
  • FastQC

    • Andrews S, et al. (2012) FastQC (Babraham Institute. Available: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • NanoPlot

    • De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C (2018) NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34(15):2666–2669
  • QUAST

    • Gurevich A, Saveliev V, Vyahhi N, Tesler G (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8):1072–1075
  • Prokka

    • Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30(14):2068–2069
  • R

    • R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Pipeline Overview

mycovista's People

Contributors

sandratriebel avatar

Watchers

 avatar

mycovista's Issues

"assembly best practice" -- additional flye option and SPAdes trusted contigs option

@sandraTriebel see here a nice guide for Nanopore/ hybrid genome assembly:

https://achri.blogspot.com/2019/12/nanopore-bacterial-genome-assemblies.html?m=1

Whereas here the focus is fast execution using a GPU, the tools and pipeline are interesting and not so different from what you already implemented.

[1]
flye --nano-raw barcode06.fastq --threads 8 --iterations 2 --plasmids -g 3m --out-dir barcode06
This is the flye command used here. Interesting: --iterations parameter that already seems to do some kind of polishing. Maybe we also want to have this.

[2]
The other really interesting part in my eyes:

Not only using the short reads for polishing but instead, integrate them again into the assembly process while using the long-read-only assembly as a real backbone. For this, the author use SPAdes with the --trusted-contigs option and passes the long-read polished contigs as a trusted set of sequences. Then they use pilon for polishing the SPAdes result using the short reads. I think you also tried pilon at some point?

spades.py -o spades --trusted-contigs medaka/consensus.fasta -1 /path/to/illumina/sample_R1_001.fastq.gz  -2 /path/to/illumina/sample_R2_001.fastq.gz

The question is: do we really need this in our case? Or: how difficult would it be for you to also implement a SPAdes rule that uses the Nanopore assembly with the error-corrected short reads as an input? So that we can compare?

racon via conda

is it possible to run racon via conda instead of a docker container?

Circlator

Integration of a circularization tool as final step of the assembly pipeline? Does M. bovis circularize?

polishing procedure

Please correct me if I am wrong, @sandraTriebel

At the moment we do only polishing using the short reads? So we do not polish the flye assembly with the long reads, right?

If so, I think we should do this, according to:
https://achri.blogspot.com/2019/12/nanopore-bacterial-genome-assemblies.html?m=1

flye >> 4x racon >> 1x medaka

When I remember correctly, this is what you already do, just that you use the short reads, right?

Maybe the benefit is not that high, but something like

flye >> 4x racon w/ LR >> 1x medaka w/ LR >> 4x racon w/ SR >> 1x medaka w/ SR

would be the most accurate way. Besides, we could also test what I have written here #10

flye >> 4x racon w/ LR >> 1x medaka w/ LR >> SPAdes w/ trusted contig option

flye --nano-raw barcode06.fastq --threads 8 --iterations 2 --plasmids -g 3m --out-dir barcode06

for n in 1 2 3 4; 
do 
minimap2 racon`expr $n - 1`.fasta ../barcode06.fastq > minimap.paf; 
racon ../barcode06.fastq minimap.paf racon`expr $n - 1`.fasta -e 0.15 -t 8 -m 8 -x -6 -g -8 -w 500 > racon$n.fasta; 
done

medaka_consensus -i ../barcode06.fastq -d racon4.fasta -o medaka -t 8 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.