Git Product home page Git Product logo

vocal's Introduction

VOCAL: Variant Of Concern ALert and prioritization

The goal of VOCAL is to detect sc-2 emerging variants from collected bases of genomes, before their annotation by phylogenetic analysis. It does so by parsing sc2 genomes and detecting amino acids mutations in the spike proteins that can be associated with a phenotypic change. The phenotypic changes are annotated according to the knowledge accumulated on previous variants. Owing to the limited size of the genome, convergent evolution is expected to take place.

Documentation.

Explore the docs »

Getting Started

⚠️Note: 🔌 Right now, VOCAL tested on Linux and Mac system only 💻

Installation

clone this repository:

git clone https://github.com/rki-mf1/vocal.git

You can easly install all dependencies with conda:

cd vocal
conda env create -n vocal -f environment.yml
conda activate vocal

Running Vocal

... in three steps.

Step1: Annotate mutations in the Spike protein

python vocal/vocal.py -i test-data/sample-test.fasta -o results/variant_table.tsv

This creates by default a variant_table.tsv file with all mutations.

⚠️Note: when VOCAL is run without option, it realigns each query sequence to the reference Wuhan sequence NC_045512 using the pairwise alignment function in the biopython library.

🐌 SLOW ??: The alignment option in vocal uses a biopython pairwise aligner and can be relatively slow. It is thus recommended to first generate an alignment file of all the sequences before running vocal annotation of the mutations. The alignment file (in PSL format) can be created using the tool pblat that can be downloaded here or simply installed through our provided conda environment.

👀 Thus, if we want to use precomputed whole-genome alignments of the fasta file as a PSL file ( --PSL option) to improve alignment speed please see the below section, otherwise please continue to step2.

To generate a PSL file with alignments

Example command to generate PSL format.

pblat test-data/ref.fna test-data/sample-test.fasta -threads=4 results/output.psl

To run VOCAL with a PSL file;

python vocal/vocal.py -i test-data/sample-test.fasta --PSL results/output.psl -o results/variant_table.tsv

Step2: Annotate mutation phenotypes

python vocal/Mutations2Function.py -i results/variant_table.tsv -a data/table_cov2_mutations_annotation.tsv -o results/variants_with_phenotypes.tsv 

By default, this step will create the consolidated table ("variants_with_phenotypes.tsv") of mutations with phenotype annotation.

Step3: Detect/Alert emerging variants

Rscript --vanilla "vocal/Script_VOCAL_unified.R" \
-f results/variants_with_phenotypes.tsv \
-o results/ 

in case we want to include metadata file, use (-a)

Rscript --vanilla "vocal/Script_VOCAL_unified.R" \
-f results/variants_with_phenotypes.tsv \
-a test-data/meta.tsv \
-o results/ 

⚠️Note: meta data must have these information

  • ID column (match with sample ID in FASTA file)
  • LINEAGE column (e.g., B.1.1.7, BA.1)
  • SAMPLING DATE column (the date that a sample was collected) (format YYYY-mm-dd)

Finally, we can easily generate report into HTML format at the end of the analysis.

python  vocal/Reporter.py  \
        -s results/vocal-alerts-samples-all.csv \
        -c results/vocal-alerts-clusters-summaries-all.csv \
        -o results/vocal-report.html 

Please visit Explore the docs »

How to interprete result.

Vocal output an alert level in 5 different colours which can be classified into 3 ratings.

Alert color Description Impact
Pink Variant is known as VOC/VOI and containing MOC or new mutations. HIGH
Red Not VOC/VOI but contain high MOC or ROI, and a new matuation (likely to cause a problem/ new dangerous). HIGH
Orange Variant contains moderately muations, or also possibly consider them either VUM or De-escalated variant. MODERATE
Lila Mostly harmless variant (near-zero mutation size for MOC or ROI). LOW
Grey No evidence of impact (either no MOC or no ROI). LOW

Contact

Did you find a bug?🐛 Suggestion/Feedback/Feature request?👨‍💻 please visit GitHub Issues

For business inquiries or professional support requests 🍺 please contact Dr. Hölzer, Martin([email protected]) or Dr. Richard, Hugues ([email protected])

Acknowledgments

  • Original Idea: SC2 Evolution Working group at the Robert Koch Institute in Berlin

  • Funding: Supported by the European Centers for Disease Control [grant number ECDC GRANT/2021/008 ECD.12222].

vocal's People

Contributors

huguesrichard avatar matthuska avatar silenus092 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

ztsin

vocal's Issues

Previously circulating VOC and VOI to "De-escalated"

I have discovered that in this project it uses the table of variants that I keep up to date.

Yesterday I made a change to reflect the way the WHO calls variants that were VOC or VOI but are no longer in circulation.
Now exist PVOC (Previously circulating VOC) and PVOI (Previously circulating VOI).

WHO definition: "A designated VOC or VOI which has demonstrated to no longer pose a major added risk to global public health compared to other circulating SARS-CoV-2 variants, can be designated as previously circulating VOCs or VOIs." source: https://www.who.int/activities/tracking-SARS-CoV-2-variants#PageContent_C296_Col00

You can now use the PVOC and PVOI replacement criteria to pass them to "De-escalated" that variants.
Or if you want to continue working as before, you can return its previous value by replacing PVOC with VOC and PVOI with VOI.

sc2_anno_df["type"].replace([np.nan, "FMV"], "De-escalated", inplace=True)

Regards.

Special case for sequences with the pango lineage "Unassigned"

Recombinant sequences are currently a bit difficult for the software that we use to assign lineages to our SARS-CoV-2 sequences. In some cases, lots of these sequences are assigned the lineage "Unassigned." For vocal, they look like very worrisome sequences, because the background of them being a known VOC (or recombinant of several VOCs) is not taken into account. Sequences can also be legitimately unassigned though, perhaps when they are very different from all known sequences.

Because of this, we would like the Unassigned sequences to be displayed separately and treated specially compared to all other sequences with proper pango lineages. They shouldn't be removed, but they should probably just be shown in a separate part of the final report.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.