This build performs a full Nextstrain analysis of Enterovirus A71. You can choose to either run a >=600 base pair VP1 run or a >=6400 base pair whole genome run.
If you are unfamiliar with or haven't installed Nextstrain you can find an introduction and full documentation here.
You can read the master's thesis by Simon Grimm, based on this build, here.
This build could be extended in the future to do several additional things:
- Including additional metadata like patient age, granular spatial data or clinical outcomes.
- Automating updates of the build with the newest available sequences. See Emma Hodcroft's Enterovirus D68 build for some efforts to implement this with a closely related virus.
Data used for this build can be downloaded from viprbrc.org. I've added instructions for how to download sequences manually at the end of this README.
To learn more about Enterovirus A71, I recommend this very well written review article by Solomon et al.
This repo contains the following folders and files:
scripts
contains custom python scripts which are being called from the snakefile
.
snakefile
contains the entire computational pipeline. This file uses the Snakemake workflow management system, which allows elegant, reproducible biocomputational analyses. You can find snakemake's documentation here. If you want to change some part of the analysis or call your own scripts, you need to edit this file.
ev_a71/vp1
contains sequences and config files used for the >=600 bp VP1 run.
ev_a71/whole_genome
contains sequences and config files used for the >=6400 bp whole genome run.
In the folder ev_a71/vp1/config
and ev_a71/whole_genome/config
respectively, you can find configuration files required for running nextstrain:
- coloring scheme
colors.tsv
- geographical locations
geo_regions.tsv
- latitude data
lat_longs.tsv
- dropped strains
dropped_strains.txt
- virus clade assignments
clades_genome.tsv
- reference sequence
reference_sequence.gb
The reference sequence used for this build can be found online. It was sequenced in 1970, is called BrCr, and its accession number is U22521.
To run this repository you need to install the Nextstrain environment. You can find detailed install instructions here.
Before running a build, you need to initialize nextstrain by executing
conda activate nextstrain
Following this you can create a vp1 build and a whole genome build simply by executing
snakemake --cores 1
If you only want one of those builds, you can either create a vp1 build by executing
snakemake ev_a71/vp1/auspice/ev_a71_vp1.json --cores 1
or you can create a whole genome build by executing
snakemake ev_a71/whole_genome/auspice/ev_a71_whole_genome.json --cores 1
If everything worked out, you can now visualize your build using auspice (which is contained within nextstrain).
For the vp1 build do this via
auspice view --datasetDir ev_a71/vp1/auspice
For the whole genome build do this via
auspice view --datasetDir ev_a71/whole_genome/auspice
You might need to run the command export PORT=4001
if you want to run two auspice visualizations simultaneously.
You can download up-to-date sequences. This can be done via viprbrc.org. On the landing page, pick Enterovirus (you should find this under the header "Featured Viruses").
Within the Enterovirus Taxonomy Browser, pick Enterovirus A. On the Genome Search page, click on "Search Criteria". There you can select Enterovirus A71 sequences. As of January 2022, there should be ~13'000 sequences. You do NOT need to specify sequence length, as subsampling by length is included in this build.
Sequences should be downloaded in "Genome FASTA" format. Under Format for FASTA file definition line pick Custom format, adding ALL metadata fields. You can now download the sequences.
Save the resulting file as vipr.fasta
in the folder ev_a71/whole_genome/data
and ev_a71/vp1/data
.
If you have any questions or comments feel free to reach out via github, twitter (@Simon__Grimm) or via simon(dot)grimm(at)unibas(dot)ch.