Git Product home page Git Product logo

score-assemblies's Introduction

score-assemblies

A Snakemake-wrapper for evaluating de novo bacterial isolate genome assemblies, e.g. from Oxford Nanopore (ONT) or Illumina sequencing, using multiple programs. The results are summarized in a HTML report.

The workflow is published in Snakemake workflows for long-read bacterial genome assembly and evaluation in GigaByte.

Following programs are included in the workflow:

Installation

Clone repository, for example:

git clone https://github.com/pmenzel/score-assemblies.git /opt/software/score-assemblies

Create a new conda environment containing all necessary programs:

conda env create -n score-assemblies --file /opt/software/score-assemblies/env/environment.yaml

and activate the environment:

conda activate score-assemblies

Usage

First, prepare a data folder, which must contain subfolders assemblies/ containing the assemblies.
Additionally, the sub-foldersreferences/ and references-protein/ can contain reference genomes and reference proteins with which the assemblies and predicted proteins will be compared.
For example:

.
├── assemblies
│   ├── example-mtb_flyehq4.fa
│   ├── example-mtb_flyehq4+medaka.fa
│   ├── example-mtb_flyehq.fa
│   ├── example-mtb_flyehq+racon4.fa
│   ├── example-mtb_flyehq+racon4+medaka.fa
│   ├── example-mtb_raven4.fa
│   ├── example-mtb_raven4+medaka.fa
│   ├── example-mtb_raven4+medaka+pilon.fa
│   └── example-mtb_unicycler.fa
├── references
│   └── AL123456.3.fa
└── references-protein
    └── AL123456.3.faa

NB: The assembly and reference FASTA files need to have the .fa extension and protein reference FASTA files need to have the extension .faa.

This is the same folder structure used by ont-assembly-snake, i.e. score-assemblies can be run directly in the same folder.

To run the workflow, e.g. with 20 threads, use this command:

snakemake -s /opt/software/score-assemblies/Snakefile --cores 20 --use-conda

Output files of each program will be written to various folders in score-assemblies-data/.

Modules

If no references are supplied, then only ideel and BUSCO are done, otherwise score-assemblies will run these programs on each assembly:

assess_assembly and assess_homopolymers

Each assembly will be compared against each reference genome using the assess_assembly and assess_homopolymers scripts from pomoxis. Additionally to the tables and plots generated by these programs, summary plots for each reference genome will be plotted in score-assemblies-data/pomoxis/<reference>_assess_assembly_all_meanQ.pdf.

BUSCO

Set the lineage via the snakemake call:

snakemake -s /opt/software/score-assemblies/Snakefile --cores 20 --config busco_lineage=bacillales

If not set, the default lineage bacteria will be used. Available datasets can be listed with busco --list-datasets

The number of complete, fragmented and missing BUSCOs per assembly is tabulated in the file score-assemblies-data/busco/all_stats.tsv and also drawn as dotplot in score-assemblies-data/busco/busco_stats.pdf.

dnadiff

Each assembly is compared with each reference and the output files will be located in score-assemblies-data/dnadiff/<reference>/<assembly>-dnadiff.report. The values for AvgIdentity (from 1-to-1 alignments) and TotalIndels are extracted from these files and are plotted for each reference in score-assemblies-data/dnadiff/<reference>_dnadiff_stats.pdf.

NucDiff

Each assembly is compared with each reference and the output files will be located in the folder score-assemblies-data/nucdiff/<reference>/<assembly>-nucdiff/. The values for Insertions, Deletions, and Substitutions are extracted from the file results/nucdiff_stat.out and are drawn for each reference in score-assemblies-data/nucdiff/<reference>_nucdiff_stats.pdf.

QUAST

One QUAST report is generated for each reference genome, containing the results for all assemblies. The report files are located in score-assemblies-data/quast/<reference>/report.html. The main report file score-assemblies-report.html also links the these individual reports.

ideel

Open reading frames are predicted from each assembly via Prodigal and are search in the Uniprot sprot database with diamond, retaining the best alignment for each ORF. For each assembly, the distribution of the ratios between length of the ORF and the matching database sequence are plotted to ideel/ideel_uniprot_histograms.pdf and ideel/ideel_uniprot_boxplots.pdf.

Additionally, diamond alignments are done between the predicted ORFs and the supplied reference proteins and ratios are plotted to score-assemblies-data/ideel/<reference>_ideel_histograms.pdf and score-assemblies-data/ideel/<reference>_ideel_boxplots.pdf.

bakta

bakta is only run when specified as extra config argument in the snakemake call:

snakemake -s /opt/software/score-assemblies/Snakefile --cores 20 --use-conda --config bakta=1

The bakta outfiles files are written to in the folder score-assemblies-data/bakta/<assembly>/.

NB: It takes a long time to download the bakta database and run bakta on all assemblies.

Summary report

All measurements are summarized in a HTML page in score-assemblies-report.html.

Example report

Example report

score-assemblies's People

Contributors

pmenzel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

upalabdhad

score-assemblies's Issues

errors in generating report.html

Hello!

Very good job!

However, I met two errors in generating the final report:
One is in lines 95 and 459 from report.Rmd
list_breaks <- seq(min(values), second_highest_value, length.out = 10)
gives a running error of Error in seq.default(1,data_length) : 'to' must be of length 1
I assume "second_highest_value" is a NULL, and I changed sort(unique(values))[n-1] to sort(unique(values))[n] to make the script work. But I am not sure if it is fine to do so.

The second thing is the QUAST report could not be accessed from the final report.html.
Instead, I need to be download the QUAST report.html from the quast directory as a renamed file.

Could you please fix these small issues. Thanks!

error with dnadiff shebang

Running the dnadiff rule results in this error in the log:

usr/bin/env: ‘perl -w’: No such file or directory
/usr/bin/env: use -[v]S to pass options in shebang lines

dnadiff -p {out_dir}/dnadiff/{wildcards.ref}/{wildcards.id}-dnadiff {input.reference} {input.assembly} >{log} 2>&1

Python Regular Expression

Hi, very helpful pipeline to evaluate assemblies. I have problems with understanding your wildcard contraint.
Can you please help ?
[^/\\\\] - A set of characters no slash and no backslash (escaped)

But why is it necessary to have two escaped backslashes in Python Regular Expression ?

Best, Michael

NameError

Hi,
I am getting the following error when running snakemake -k -s score-assemblies/Snakefile --cores 8

NameError in line 69 of /scratch/score-assemblies/Snakefile:
name 'list_nucdiff_stat' is not defined
  File "/scratch/score-assemblies/Snakefile", line 69, in <module>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.