rnajena / mycovista Goto Github PK

View Code? Open in Web Editor NEW

0.0 1.0 0.0 541.55 MB

Mycovista - M. bovis assembly pipeline

License: GNU General Public License v3.0

Python 58.55% R 41.45%

mycovista's Introduction

`Mycovista` - a pipeline to assemble (highly repetitive) bacterial genomes

Installation

In order to run Mycovista, you'll need to install:

Usage

To use Mycovista, edit the config.yaml file

specify the pipeline you want to use: long or hybrid mode
insert additional information:
- path to input short reads
- path to input long reads
- path to output folder
- names of the bacterial strains

Note: The raw read files must contain the name of the strain in the file name.

A python script helps you to generate all folders required for the output. Just run:

python scripts/create.py

Afterwards, all required data should be linked in the raw_data/ folder. Please check this!

Now you can get started and assemble your bacterias. You can start Mycovista by using:

snakemake -s <mode> -c <#threads> --use-conda

mode - assembly mode file (long or hybrid)

Tip: You can use -n for a dry run to check if snakemake will start all required rules.

Tools

When using Mycovista, please cite all incorporated tools as without them this pipeline wouldn't exist.

FastQC and NanoPlot are used for quality check of the reads (raw and preprocessed). Short reads are filtered by fastp for adapter clippling and Trimmomatic for quality trimming. Long reads are filtered by length using Filtlong. Flye assembles the long reads first. The assembly is then polished with long reads by Racon using minimap2 as mapper inbetween. Afterwards, medaka is incorporated as additional polishing step with long reads. In hybrid mode, the assembly postprocessed further with Racon and minimap2 using short reads. The final assembly is annotated by Prokka and general assembly statistics are calculated by QUAST.

Downstream analysis of our M. bovis assembly panel included pangenome analsis followed by a genome-wide association analysis (GWAS). We provided the R script for the GWAS in scripts/gwas.R.

Click here for all citations

fastp
- Chen S, Zhou Y, Chen Y, Gu J (2018) fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34(17):i884–i890
Trimmomatic
- Bolger AM, Lohse M, Usadel B (2014) Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30(15):2114–2120
Filtlong
- Wick R (2018) Filtlong. Available: https://github.com/rrwick/Filtlong
Flye
- Kolmogorov M, Yuan J, Lin Y, Pevzner PA (2019) Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology 37(5):540–546
Racon
- Vaser R, Sovi ́c I, N N, Šiki ́c M (2017) Fast and accurate de novo genome assembly from long uncorrected reads. Genome Research 27:737–746
minimap2
- Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18):3094–3100
medaka
- Ltd. ONT (2018) medaka: Sequence correction provided by ONT Research. Available: https://github.com/nanoporetech/medaka
FastQC
- Andrews S, et al. (2012) FastQC (Babraham Institute. Available: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
NanoPlot
- De Coster W, D’Hert S, Schultz DT, Cruts M, Van Broeckhoven C (2018) NanoPack: visualizing and processing long-read sequencing data. Bioinformatics 34(15):2666–2669
QUAST
- Gurevich A, Saveliev V, Vyahhi N, Tesler G (2013) QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8):1072–1075
Prokka
- Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics 30(14):2068–2069
R
- R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Pipeline Overview

mycovista's People

Contributors

Watchers

mycovista's Issues

qcat with trimming option

@sandraTriebel are you aware that qcat also has a trimming option that we can use?

conda install qcat
conda install future
cat *.fastq | qcat --trim -b demultiplexed

"assembly best practice" -- additional flye option and SPAdes trusted contigs option

@sandraTriebel see here a nice guide for Nanopore/ hybrid genome assembly:

https://achri.blogspot.com/2019/12/nanopore-bacterial-genome-assemblies.html?m=1

Whereas here the focus is fast execution using a GPU, the tools and pipeline are interesting and not so different from what you already implemented.

[1]
flye --nano-raw barcode06.fastq --threads 8 --iterations 2 --plasmids -g 3m --out-dir barcode06
This is the flye command used here. Interesting: --iterations parameter that already seems to do some kind of polishing. Maybe we also want to have this.

[2]
The other really interesting part in my eyes:

Not only using the short reads for polishing but instead, integrate them again into the assembly process while using the long-read-only assembly as a real backbone. For this, the author use SPAdes with the --trusted-contigs option and passes the long-read polished contigs as a trusted set of sequences. Then they use pilon for polishing the SPAdes result using the short reads. I think you also tried pilon at some point?

spades.py -o spades --trusted-contigs medaka/consensus.fasta -1 /path/to/illumina/sample_R1_001.fastq.gz  -2 /path/to/illumina/sample_R2_001.fastq.gz

The question is: do we really need this in our case? Or: how difficult would it be for you to also implement a SPAdes rule that uses the Nanopore assembly with the error-corrected short reads as an input? So that we can compare?

Filtering long reads

Filtlong, --min_length 1000 nt

racon via conda

is it possible to run racon via conda instead of a docker container?

Polish snakefile

automatization
config file

Guppy fresh basecalling

POLCA polishing

New tool:
https://www.biorxiv.org/content/10.1101/2019.12.17.864991v1

tldr;

same accuracy then pilon/racon
... but much faster

Idea: we could add a rule for POLCA and compare the outcome.

If so, I think we should do this, according to:
https://achri.blogspot.com/2019/12/nanopore-bacterial-genome-assemblies.html?m=1

flye >> 4x racon >> 1x medaka

When I remember correctly, this is what you already do, just that you use the short reads, right?

Maybe the benefit is not that high, but something like

flye >> 4x racon w/ LR >> 1x medaka w/ LR >> 4x racon w/ SR >> 1x medaka w/ SR

would be the most accurate way. Besides, we could also test what I have written here #10

flye >> 4x racon w/ LR >> 1x medaka w/ LR >> SPAdes w/ trusted contig option

flye --nano-raw barcode06.fastq --threads 8 --iterations 2 --plasmids -g 3m --out-dir barcode06

for n in 1 2 3 4; 
do 
minimap2 racon`expr $n - 1`.fasta ../barcode06.fastq > minimap.paf; 
racon ../barcode06.fastq minimap.paf racon`expr $n - 1`.fasta -e 0.15 -t 8 -m 8 -x -6 -g -8 -w 500 > racon$n.fasta; 
done

medaka_consensus -i ../barcode06.fastq -d racon4.fasta -o medaka -t 8