SARS-CoV-2-HPDA-evolutionary-analysis

@Author: Arnaud N'Guessan

Overview

This repository contains a script for analyzing SARS-CoV-2 evolution in epitopes during the first two waves of the COVID-19 pandemic. The immunological data come from a high protein density array analysis of SARS-CoV-2 epitopes in 15 patients (N'Guessan A. et al., 2022). This script and the related data can be updated manually to integrate data from other waves or other sets of epitopes.

Dependencies

R (version 3.5.2+) packages: "ggplot2", "seqinr", "grid", "RColorBrewer", "randomcoloR", "gplots", "lmPerm", "ggpubr", "gridExtra", "RColorBrewer", "indicspecies", "tidyr", "Cairo", "parallel", "foreach", "doParallel", "infotheo", "VennDiagram", "Biostrings", "session"

The script

a) Inputs:

-->OUTPUT_WORKSPACE: The absolute path of the "SARS-CoV-2-HPDA-evolutionary-analysis/" repertory in your system. Make sure that it contains a sub-directory named "depth_report_NCBI_SRA_amplicon/" which contains all the samples depth coverage analysis files (a .csv file generated by "samtools depth" or a csv file with 3 columns/fields corresponding to the sample, the position of the site in the reference genome MN908947_3 and the site depth respectively). We added an example of such depth report file in "SARS-CoV-2-HPDA-evolutionary-analysis/depth_report_NCBI_SRA_amplicon/" so that you can visualize what it should look like (each sample needs to have its own depth report file). Next, make sure that the "SARS-CoV-2-HPDA-evolutionary-analysis/" repertory should also contain the script (high_confidence_epitopes_analysis.r) and the related data (Epitopes_mapped.csv, MN908947_3.fasta, Table_signature_mutations.csv, df_high_confidence_epitope_metrics.rds, df_sars_cov_2_epitopes.rds, df_variants_SRA_amplicon_first_wave.rds and df_variants_SRA_amplicon_second_wave.rds)

-->NB_CPUS: the number of cpus to use for analyzes that are performed through parallel programming (R doParallel)

b) Outputs: Various plots showing the evolutionary profile of SARS-CoV-2 epitopes during waves 1 and 2 + comparisons between lineages / variants.

c)Running the script For running the script from a terminal (command line), you must have R (version 3.5.2+) installed or loaded (slurm module) and you must run the command: Rscript high_confidence_epitopes_analysis.r $OUTPUT_WORKSPACE $NB_CPUS

d) Updating lineage signature mutation data To update or edit the lineage signature mutation data, you can open the file "Table_signature_mutations.csv" in Excel and add the new signature mutation + its lineage as a new entry in the table. Only these 2 columns are mandatory. You can set the other fields/columns as "NA" or leave them empty. Don't forget to save the table as a .csv file. You can also make the edits in your favorite text editor (newline for a signature mutation X of lineage Z sequences would be: X,Z,NA,NA,NA OR X,Z,"","","" OR X,Z,,,). The signature mutation name needs to be in the format ORF_name:Old_amino_acidResidue_position_in_ORF_protein_sequenceNew_amino_acid (e.g. ORF8:L84S).

References

We defined signature mutations of each variant (see "Table_signature_mutations.csv") as substitutions that are present in >=90% of sequences assigned to that lineage. We calculated the prevalence of substitutions in thousands of publicly available consensus sequences collected from NCBI during 2020 and added data from CoV-Spectrum about under-represented lineage in the database or lineages that emerged during 2021 (Chen et al., 2021). The signature mutation dataset is a mix of mutation prevalence data from our own NCBI consensus seqeunces database (for the earlier lineage) and GISAID data obtained from cov-spectrum (for more recent lineages like Omicron). Thus, multiple PANGO versions are involved (v.2.1.7 for the earliest 2020 lineages and v.3.1.20 for recent variants like Omicron). The signature mutation prevalence dataset is presented here as a json file named "Database_Missense_and_Nonsense_signature_mutations_prevalence_in_SC2_lineages_consensus_sequences_as_of_2021_01_16_plus_VOCs.json".

Chen, C., Nadeau, S., Yared, M., Voinov, P., Ning, X., Roemer, C. & Stadler, T. "CoV-Spectrum: Analysis of globally shared SARS-CoV-2 data to Identify and Characterize New Variants" Bioinformatics (2021); doi: 10.1093/bioinformatics/btab856.

arnaud00013 / sars-cov-2-hpda-evolutionary-analysis Goto Github PK