Git Product home page Git Product logo

vadr's Introduction

VADR - Viral Annotation DefineR

Version 1.6.3; December 2023

VADR is a suite of tools for classifying and analyzing sequences homologous to a set of reference models of viral genomes or gene families. It has been mainly tested for analysis of Norovirus, Dengue, and SARS-CoV-2 virus sequences in preparation for submission to the GenBank database.

The VADR v-annotate.pl script is used to classify a sequence, by determining which in a set of reference models it is most similar to, and then annotate that sequence based on that most similar model. Example usage of v-annotate.pl can be found here. Another VADR script, v-build.pl, is used to create the models from NCBI RefSeq sequences or from input multiple sequence alignments, potentially with secondary structure annotation. v-build.pl stores the RefSeq feature annotation in the model, and v-annotate.pl maps that annotation (e.g. CDS coordinates) onto the sequences it annotates.

VADR includes 205 prebuilt models of Flaviviridae and Caliciviridae viral RefSeq genomes, created with a process similar to the one described here. Example usage of v-build.pl can be found here. An advanced tutorial on building VADR models using RSV as an example can be found here. To use v-annotate.pl with viruses other than the default set of 205, see 'Available VADR models'. For instructions on using VADR for SARS-CoV-2 annotation see this page.

v-annotate.pl identifies unexpected or divergent attributes of the sequences it annotates (e.g. invalid or early stop codons in CDS features) and reports them to the user in the form of alerts. A subset of alerts are fatal and cause a sequence to fail. A sequence passes if zero fatal alerts are reported for it. VADR is used by GenBank staff to evaluate incoming sequence submissions of some viruses (currently Norovirus, Dengue virus, and SARS-CoV-2). Submitted Norovirus, Dengue virus and SARS-CoV-2 sequences that pass v-annotate.pl are accepted into GenBank.

The homology search and alignment components of VADR scripts, the most computationally expensive steps, are performed by the Infernal, HMMER, FASTA, MINIMAP2 and BLAST software packages, which are downloaded and installed with VADR installation.


SARS-CoV-2 annotation using VADR

The v-annotate.pl script includes some special options specifically developed for SARS-CoV-2 annotation that increase speed (-s and --glsearch options) and provide better annotation for sequences with stretches of Ns (-r option). See this page for more information on using VADR to annotate SARS-CoV-2 sequences.


Available VADR models

VADR installation includes a default set of Caliciviridae models including Norovirus virus. The installation also includes a set of Flaviviridae models including Dengue virus. You can download additional pre-built models to use to validate and annotate viruses, including SARS-CoV-2, RSV, or cox1 genes. Importantly, to use a set of models other than the default Caliciviridae set, you will need to use either the --mdir and --mkey options, or the the -m, -i, -x and possibly -n options as described here.

See this page for a list of all available models and additional information.


VADR documentation


Contributors

  • VADR includes contributions and input from current and former colleagues at NCBI, including:

    Rodney Brister

    Vince Calhoun

    Sergiy Gotvyanskyy

    Eneida Hatcher

    Sophia Hu

    Ilene Karsch-Mizrachi

    Rich McVeigh

    Susan Schafer

    Alejandro Schäffer

    Lara Shonkwiler

    Beverly Underwood

    Yuri Wolf

    Linda Yankie


Reference

  • The recommended citation for using VADR for SARS-CoV-2 analysis: Eric P Nawrocki; Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR. NAR Genom Bioinform. 2023 Jan 20;5(1)::lqad002. (2023). https://doi.org/10.1093/nargab/lqad002

  • The recommended citation for non-SARS-CoV-2 use of VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3


Questions, comments or feature requests? Send a mail to [email protected].

vadr's People

Contributors

nawrockie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vadr's Issues

Docker or singularity image

Given you have no plans for bioconda I was wondering if you have Docker or Singularity containers ready to go?

Consider adding a script to download models

I might have missed it, but it would be convenient if there was an official script from VADR for downloading the models.

For example this script could keep track of what models are currently available, and put it the appropriate place. I think going forward this would make it easier to maintain the docker and conda recipes.

Force annotation of maturation proteins?

Hi Eric,

I hope this finds you well. I tried running VADR on the SARS-CoV-2 reference genome from NCBI, and against a distantly-related CoV genome that we found:

https://github.com/ababaian/serratus/wiki/Fr4NK

Below please see a screenshot showing the FTR output file for Fr4NK (above), and SARS-CoV-2 (below):

image

I realize that because Fr4NK is so divergent from the SARS-CoV-2 VADR model, it probably doesn't bother calling the maturation proteins. It's important for us to isolate the domains of these maturation proteins, so that we can build trees from our new CoV genomes and determine species & genus boundaries, as explained by the ICTV:

https://talk.ictvonline.org/ictv-reports/ictv_9th_report/positive-sense-rna-viruses-2011/w/posrna_viruses/222/coronaviridae

Is there a way to tell VADR to call the maturation proteins, even if the polyprotein "fails"? Is there a way to get the raw output from the infernal / hmmer3 / blastx calls? Or should I just use the SARS-CoV-2 Pfam models on the polyproteins called by VADR? Thanks in advance for your help!

Double-entries in `.tbl` file

I've noticed that there are often double-entries in the generated tbl file:

<1	1358	misc_feature
			note	similar to nsp3
<1	1358	misc_feature
			note	similar to nsp3
1359	2678	misc_feature
			note	similar to nsp4
1359	2678	misc_feature
			note	similar to nsp4
2679	3391	misc_feature
			note	similar to 3C-like proteinase
2679	3391	misc_feature
			note	similar to 3C-like proteinase

See the attached file (had to append the .txt suffix to attach to the GitHub issue):
SRR8389791.vadr.fail.tbl.txt

Please add a license statement

Hi,
the Debian Med team considers to maintain COVID-19 relevant software inside main Debian. However, each code needs a license and I have not found any for vadr. Thus please add a license statement.
Thanks a lot, Andreas.

Feature request: GFF output

When validating an annotation, I find it very convenient to load the annotation in GFF format into a genome browser like JBrowse, to see how the alignment works. Unfortunately I didn't find a file in the VADR output directory that could be used directly for this purpose. I hacked together a script (happy to share if it would be of interest) that generates a GFF file from the *.tbl files, and used it to visualize the VADR annotation of a novel CoV genome distantly related to any existing CoV genomes in RefSeq:

image

So my enhancement request is that VADR generates a GFF version 3.0 file which is ready for uploading into genome browsers. Thanks for your consideration!

Documentation: vadr-map-model-coords.pl command in wiki

Hello!

Very excited to try 1.3, being able to run vadr locally has saved me so much time.

I think the wiki page for SARS-CoV-2 annotation has a small typo in Step 5.

The wiki says the pattern to run step 5 is <script> <mmap-path> <alt.list-file> <model>

$VADRSCRIPTSDIR/miniscripts/vadr-map-model-coords.pl <sarscov2-model-dir-path>/sarscov2.mmap <output-directory>/<output-alt-list-file> NC_045512

I think it should be <script> <alt.list-file> <mmap-path> <model>

$VADRSCRIPTSDIR/miniscripts/vadr-map-model-coords.pl <output-directory>/<output-alt-list-file> <sarscov2-model-dir-path>/sarscov2.mmap  NC_045512

Thanks for all your work on the new version,
Jake

Bioconda recipe

Do you have plans to build a bioconda recipe to ease installation and aid reproducibilty?

There is a community who may be able to assist with this, but i may require some patching of the code base.

`ERROR in output_feature_table`

Trying to process a novel coronavirus genome using Docker image taltman/vadr:1.3, and I get this error message. Any hints as to what is going wrong?

root@9c6bb46cfb29:~# date; time darth.sh SRR6788790 /output/inputs/SRR6788790.epsy.fa none /root/data /output/outputs 8; echo $?; date
date; time darth.sh SRR6788790 /output/inputs/SRR6788790.epsy.fa none /root/data /output/outputs 8; echo $?; date
Mon Dec 13 18:14:52 UTC 2021
Translate nucleic acid sequences
Reverse and complement a nucleotide sequence
Reverse and complement a nucleotide sequence
# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.3 (Aug 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Mon Dec 13 18:14:58 2021
# $VADRBIOEASELDIR:  /root/vadr/Bio-Easel
# $VADRBLASTDIR:     /root/vadr/ncbi-blast/bin
# $VADREASELDIR:     /root/vadr/infernal/binaries
# $VADRINFERNALDIR:  /root/vadr/infernal/binaries
# $VADRMODELDIR:     /root/vadr/vadr-models-calici
# $VADRSCRIPTSDIR:   /root/vadr/vadr
#
# sequence file:                                                                  /output/outputs/transeq/canonical.fna
# output directory:                                                               /output/outputs/SRR6788790
# force directory overwrite:                                                      yes [-f]
# leaving intermediate files on disk:                                             yes [--keep]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  corona [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         /root/data/vadr-models-corona-1.3-2 [--mdir]
# set max allowed memory for cmalign to <n> Mb:                                   64000 [--mxsize]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Validating input                                                                        ... done. [    0.1 seconds]
# Classifying sequences (2 seqs)                                                          ... done. [  122.1 seconds]
# Determining sequence coverage (NC_006577: 1 seq)                                        ... done. [   14.0 seconds]
# Aligning sequences (NC_006577: 1 seq)                                                   ... done. [  102.8 seconds]
# Determining annotation                                                                  ... done. [    0.7 seconds]
# Validating proteins with blastx (NC_006577: 1 seq)                                      ... done. [    0.7 seconds]
# Generating feature table output                                                         ...
ERROR in output_feature_table, unable to split alert_line for output: INDEFINITE_ANNOTATION_END: (mat_peptide:nsp4 (TM2)) S:7081..7081:+; M:10207..10207:+; alignment to homology model has low confidence at 3' boundary for feature that does not match a CDS [0.20<0.60 (mat_peptide feature)]

VADR doesn't detect annotated segments

Hi dear team,

I executed VADR on 119 genome accession of species from the Flaviviridae family. Out of the 82 for which VADR found reliable annotation, in 5 accessions the poylprotein cds annotation, which I expected to find in all accessions, wasn't apparent (AY632536, KY523073, MN649260, MT276210, NC_038912). When examining one of these accessions manually (AY632536), I saw that the polyprotein cds was in fact annotated in it.

Please find below the input on which vadr was executed and the produced output.
vadr.zip

The command I used was:
v-annotate.pl --mkey flavi --mdir $VADRMODELDIR/vadr-models-flavi-1.2-1 unaligned.fasta flaviviridae_genomes_annotation

Additional info:
I am using version 1.4, installed with conda
I am using the vadr
flavi models version 1.2-1 (as seen in the command above)

I wonder if in such case I should complement vadr annotation with manually annotated data. What would you suggest?

Many thanks!

Error in v-build.pl on example command

Hi dear team,

I tried to apply v-build.pl on your example accession and received an error.

Command: v-build.pl NC_039897 NC_039897

Error: ERROR in vdr_EutilsFetchToFile, problem fetching NC_039897 (undefined)

Additional details:
I used conda to install vadr version 1.4.
Output files: NC_039897.zip

Best,
Keren

cdsstopp can be reported for stop codon that overlaps with replaced region (due to -r)

Because -r causes Ns in some N-rich regions to be replaced with expected nucleotides from the reference model sequence, it is possible for an in-frame stop codon to be introduced that doesn't exist in the original sequence. This will usually happen only if one of the nucleotides in the stop codon is a non-N but if there is a frameshift prior to the stop then three consecutive Ns could be replaced by an in-frame stop.

If an in-frame stop is introduced by N-replacement, a cdsstopp alert can be reported in the blastx protein validation stage. No cdsstopn alerts are reported because those are detected on the non-replaced sequence.

This was unanticipated, and I think it is better if cdsstopp alerts are not reported for early stops introduced in regions that overlap with replaced regions.

Problems in v-test.pl file

Hello,
I have a problem installing vadr, when i run the following command = $VADRSCRIPTSDIR/testfiles/do-all-tests.sh
the error is the following
BEGIN failed - compilation aborted at /home/user/Desktop/vadr-install-dir/vadr/v-test.pl line 7.
FAIL: at least one test failed [do-issue-tests.sh]
FAIL: at least one test failed [do-all-tests.sh]
I hope you can help me solve it. Thank you.

VADR produces qualifier with invalid value

VADR produces features with the /exception qualifier, to specify ribosomal slippage:

>Feature NODE_1_length_19663_cov_257.252269
<1      12777   gene
                        gene    ORF1ab
<1      5687    CDS
5687    12777
                        product ORF1ab polyprotein
                        exception       ribosomal slippage
                        codon_start     3
                        protein_id      NODE_1_length_19663_cov_257.252269_1

But according to the INSDC specs, this is an invalid value:

https://www.insdc.org/documents/feature_table.html#7.3

Qualifier       /exception=
Definition      indicates that the coding region cannot be translated using
                standard biological rules
...
                - must not be used for ribosomal slippage, instead use join operator, 
                  e.g.: CDS   join(486..1784,1787..4810)
                              /note="ribosomal slip on tttt sequence at 1784..1787"

This causes problems when trying to submit genomes annotated with VADR to ENA.

Understanding `misc_feature` with no qualifiers

In the previous issue that I filed about double-entries, I noticed the following at the top of the fail.tbl file:

Feature NODE_1_length_19663_cov_257.252269
<1      12777   gene
                        gene    ORF1ab
<1      5687    misc_feature
5687    12777
                        note    similar to ORF1ab polyprotein
<1      5694    misc_feature
                        note    similar to ORF1a polyprotein

My interpretation of these lines is as follows:

  • There's a gene from <1 to 12,777
  • The region of <1 to 5694 is similar to the ORF1a polyprotein
  • The region of 5687 to 12,777 is similar to the ORF1ab polyprotein

I don't understand how to interpret the misc_feature from <1 to 5687, as it has no note qualifier that might explain what it is.

ERROR in utl_RunCommand() while executing testfile

Hi,

Installing and testing VADR I got this output. Apparently it has something to do with how I'm executing the binaries.

Thanks for any answer you could give me.

# v-test.pl :: test VADR scripts [TEST SCRIPT]
# VADR 1.1.1 (July 2020)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:    Wed Aug 26 02:41:29 2020
#
# test file:                                                         /home/marlen/Programas/vadr_install_dir/vadr/testfiles/noro.r10.local.testin
# output directory:                                                  vt-n10-local
# forcing directory overwrite:                                       yes [-f]
# if output files listed in testin file already exist, remove them:  yes [--rmout]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Parsing test file                                  ... done. [    0.0 seconds]
##teamcity[testStarted name='annotate-noro-10-local' captureStandardOutput='true']
# Running command  1 [annotate-noro-10-local]        ... /home/marlen/Programas/vadr_install_dir/infernal/binaries/esl-seqstat: 1: /home/marlen/Programas/vadr_install_dir/infernal/binaries/esl-seqstat: Syntax error: "(" unexpected

ERROR in utl_RunCommand(), the following command failed:
/home/marlen/Programas/vadr_install_dir/infernal/binaries/esl-seqstat --dna -a va-noro.r10/va-noro.r10.vadr.in.fa > va-noro.r10/va-noro.r10.vadr.seqstat

done. [    3.0 seconds]
#	checking va-noro.r10/va-noro.r10.vadr.pass.tbl                                                                ... FAIL [output file does not exist]
#	checking va-noro.r10/va-noro.r10.vadr.fail.tbl                                                                ... FAIL [output file does not exist]
#	checking va-noro.r10/va-noro.r10.vadr.sqa                                                                     ... FAIL [output file does not exist]
#	checking va-noro.r10/va-noro.r10.vadr.sqc                                                                     ... FAIL [output file does not exist]
#	checking va-noro.r10/va-noro.r10.vadr.ftr                                                                     ... FAIL [output file does not exist]
#	checking va-noro.r10/va-noro.r10.vadr.sgm                                                                     ... FAIL [output file does not exist]
#	checking va-noro.r10/va-noro.r10.vadr.mdl                                                                     ... FAIL [output file does not exist]
#	checking va-noro.r10/va-noro.r10.vadr.alt                                                                     ... FAIL [output file does not exist]
#	checking va-noro.r10/va-noro.r10.vadr.alc                                                                     ... FAIL [output file does not exist]
##teamcity[testFailed name='annotate-noro-10-local' message='v-test.pl failure']
##teamcity[testFinished name='annotate-noro-10-local']
#
#
# FAIL: 9 of 9 files were not created correctly.
#
# Output printed to screen saved in:                   vt-n10-local.vadr.log
# List of executed commands saved in:                  vt-n10-local.vadr.cmd
# List and description of all output files saved in:   vt-n10-local.vadr.list
#
# All output files created in directory ./vt-n10-local/
#
# Elapsed time:  00:00:03.02
#                hh:mm:ss
# 
[FAIL]

Regards,

Exec format error

sh: 1: /directory/to/infernal/binaries/esl-seqstat: Exec format error

when run "v-annotate.pl" in Ubuntu 18.04.

Error: summed psi of split states in node 32146 not 1.0 but : 1.001000

Hi developers,
I would like to use VADR for some mastadenovirus genomes annotation. Due to no official models for mastadenovirus, I manually collated a list of mastadenovirus genomes to build my own model using v-build.pl.

However, when I built the moeld according to build instructions , I always encountered the error below:

$ v-annotate.pl --mdir my-vadr-model-mastadv --mkey mastadv fnas/LC068714.fasta test/LC068714 -f
# v-annotate.pl :: classify and annotate sequences using a CM library
# VADR 1.2.1 (June 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Tue Jan 18 10:01:28 2022
# $VADRBIOEASELDIR:  /data/amax81t/packages/Bio-Easel
# $VADRBLASTDIR:     /home/zjl/anaconda3/envs/py38env/bin
# $VADREASELDIR:     /home/zjl/anaconda3/bin
# $VADRINFERNALDIR:  /home/zjl/anaconda3/envs/py38env/bin
# $VADRMODELDIR:     /data/amax81t/packages/vadr/vadr-models-flavi
# $VADRSCRIPTSDIR:   /data/amax81t/packages/vadr
#
# sequence file:                                                                  fnas/LC068714.fasta
# output directory:                                                               test/LC068714
# force directory overwrite:                                                      yes [-f]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  mastadv [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         my-vadr-model-mastadv [--mdir]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Validating input                                                                        ... done. [    0.3 seconds]
# Classifying sequences (1 seq)                                                           ... 
Error: summed psi of split states in node 32146 not 1.0 but : 1.001000


ERROR in utl_RunCommand(), the following command failed:
/home/zjl/anaconda3/envs/py38env/bin/cmsearch  -T -10 --cpu 1 --trmF3 --noali --hmmonly --tblout test/LC068714/LC068714.vadr.std.cls.s0.tblout my-vadr-model-mastadv/mastadv.cm test/LC068714/LC068714.vadr.in.fa > test/LC068714/LC068714.vadr.std.cls.s0.stdout

Alternatively, I summarized the v-build.pl results into mastadenovirus species, and encountered a same error.

$ v-annotate.pl --mdir vadr_models_mastadv --mkey Human_adenovirus_C fnas/LC068714.fasta test/LC068714 -f
# v-annotate.pl :: classify and annotate sequences using a CM library
# VADR 1.2.1 (June 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Tue Jan 18 09:44:08 2022
# $VADRBIOEASELDIR:  /data/amax81t/packages/Bio-Easel
# $VADRBLASTDIR:     /home/zjl/anaconda3/envs/py38env/bin
# $VADREASELDIR:     /home/zjl/anaconda3/bin
# $VADRINFERNALDIR:  /home/zjl/anaconda3/envs/py38env/bin
# $VADRMODELDIR:     /data/amax81t/packages/vadr/vadr-models-flavi
# $VADRSCRIPTSDIR:   /data/amax81t/packages/vadr
#
# sequence file:                                                                  fnas/LC068714.fasta
# output directory:                                                               test/LC068714
# force directory overwrite:                                                      yes [-f]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  Human_adenovirus_C [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         vadr_models_mastadv [--mdir]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Validating input                                                                        ... done. [    0.0 seconds]
# Classifying sequences (1 seq)                                                           ... 
Error: summed psi of split states in node 32146 not 1.0 but : 1.001000


ERROR in utl_RunCommand(), the following command failed:
/home/zjl/anaconda3/envs/py38env/bin/cmsearch  -T -10 --cpu 1 --trmF3 --noali --hmmonly --tblout test/LC068714/LC068714.vadr.std.cls.s0.tblout vadr_models_mastadv/Human_adenovirus_C.cm test/LC068714/LC068714.vadr.in.fa > test/LC068714/LC068714.vadr.std.cls.s0.stdout

could you please help me ? And will you release an official mastadenovirus models of VADR? Thanks.

VADR predicted nested genes, prevents submission to ENA

This seemed to anger the validation guards at ENA:

19094   20750   gene
                        gene    N
19094   20750   CDS
                        product nucleocapsid phosphoprotein
                        protein_id      NODE_1_length_10623_cov_925.238_7
19115   19838   gene
                        gene    N2
19115   19838   CDS
                        product nucleocapsid phosphoprotein 2
                        protein_id      NODE_1_length_10623_cov_925.238_8

Is this desired behavior by VADR?

Strange handling of input sequence

I'm working on a SARS-Cov-2 workflow that includes VADR and I'm testing how it handles indels. I've tested on some assemblies of real samples with known indels, and VADR performs as expected. However, for unit testing purposes I've been creating mock contigs with simulated indels, and VADR's performance has been inconsistent.

Here's one case: it's the Wuhan Hu reference with an 1bp insertion at position 14321 and a 3bp deletion (CAT) at position 29124.

wuhan-hu-2indels.txt

VADR gets stuck at the following alignment step and consumes a lot of memory.

...
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Validating input                                                                        ... done. [    0.1 seconds]
# Preprocessing for N replacement: blastn classification (1 seq)                          ... done. [    0.3 seconds]
# Preprocessing for N replacement: coverage determination from blastn results (1 seq)     ... done. [    0.0 seconds]
# Replacing Ns based on results of blastn-based pre-processing                            ... done. [    0.0 seconds]
# Classifying sequences with blastn (1 seq)                                               ... done. [    0.1 seconds]
# Determining sequence coverage from blastn results (1 seq)                               ... done. [    0.0 seconds]
# Aligning sequences (NC_045512: 2 seqs)                                                  ...

I notice that there are two subsequences being aligned. There is a lot of missing sequence, and it doesn't seem to be related at all to the variants I introduced.

$ grep -c '^>' workdir/vadr/*.fa                                                                       
workdir/vadr/vadr.vadr.blastn.fa:1
workdir/vadr/vadr.vadr.cp.NC_045512.fa:1
workdir/vadr/vadr.vadr.in.fa:1
workdir/vadr/vadr.vadr.NC_045512.a.fa:1
workdir/vadr/vadr.vadr.NC_045512.a.subseq.fa:2
workdir/vadr/vadr.vadr.NC_045512.fa:1
workdir/vadr/vadr.vadr.rpn.sub.fa:0
$ grep '^>' workdir/vadr/vadr.vadr.NC_045512.a.subseq.fa 
>genome/1-14422
>genome/29026-29907

Any idea what's going on here?


I'm using the staphb-vadr Docker image with the following command.

v-annotate.pl -r -s --nomisc --lowsimterm 2 --mxsize 2000 --mkey NC_045512 -f --fstlowthr 0.0 --alt_fail lowscore,fsthicnf,fstlocnf --lowsc 0.75 wuhan-hu-2indels.txt vadr

Feature request: pre-built VADR APT package

Hello,

Thank you for providing VADR. The installation script recommended in the instructions requires online access, an intricate compilation process, and uses a number of external dependencies. For ease of distribution, it would be ideal if VADR could be packaged as an APT package with its dependencies vendored or expressed as Debian/Ubuntu package dependencies (such as ncbi-blast+, hmmer, and infernal).

Error in install from Inline module

I've tried download and install for the first time. I got this message when running ./vadr-install.sh linux (also attach a 1install.log :
1install.log

Warning: prerequisite Inline 0.51 not found.
Warning: prerequisite Inline::MakeMaker 0.45 not found.
Generating a Unix-style Makefile
Writing Makefile for Bio::Easel
Writing MYMETA.yml and MYMETA.json
cp lib/Bio/Easel/Random.pm blib/lib/Bio/Easel/Random.pm
cp lib/Bio/Easel/SqFile.c blib/lib/Bio/Easel/SqFile.c
cp lib/Bio/Easel/SqFile.typemap blib/lib/Bio/Easel/SqFile.typemap
cp lib/Bio/Easel/Random.c blib/lib/Bio/Easel/Random.c
cp lib/Bio/Easel/SqFile.pm blib/lib/Bio/Easel/SqFile.pm
cp lib/Bio/Easel/Random.typemap blib/lib/Bio/Easel/Random.typemap
cp lib/Bio/Easel.pm blib/lib/Bio/Easel.pm
cp lib/Bio/Easel/MSA.c blib/lib/Bio/Easel/MSA.c
cp lib/Bio/Easel/MSA.typemap blib/lib/Bio/Easel/MSA.typemap
cp lib/Bio/Easel/MSA.pm blib/lib/Bio/Easel/MSA.pm
cp scripts/esl-alidepair.pl blib/script/esl-alidepair.pl
"/home/kt/anaconda3/envs/kt/bin/perl" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/esl-alidepair.pl
cp scripts/esl-ssplit.pl blib/script/esl-ssplit.pl
"/home/kt/anaconda3/envs/kt/bin/perl" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/esl-ssplit.pl
"/home/kt/anaconda3/envs/kt/bin/perl" -Mblib -MInline=NOISY,INSTALL -MBio::Easel::MSA -e1 0.01 blib/arch
Can't locate Inline.pm in @inc (you may need to install the Inline module) (@inc contains: /media/kt/data/TRUNG/2.Tool/vadr/Bio-Easel/blib/arch /media/kt/data/TRUNG/2.Tool/vadr/Bio-Easel/blib/lib /home/kt/anaconda3/envs/kt/lib/site_perl/5.26.2/x86_64-linux-thread-multi /home/kt/anaconda3/envs/kt/lib/site_perl/5.26.2 /home/kt/anaconda3/envs/kt/lib/5.26.2/x86_64-linux-thread-multi /home/kt/anaconda3/envs/kt/lib/5.26.2 .).
BEGIN failed--compilation aborted.
make: *** [Makefile:963: MSA.inl] Error 2
Then when I install Inline conda install -c bioconda perl-inline
run vadr-install.sh build to patch, another error I send in file install-build.log !

vadr-install-build.log
Please give it a fix soon. Thank you

--version

Could you add a flag to print the version and exit 0? Thank you!

Installation issue?

I tried to install vadr in my ubuntu 16 LTS desktop computer, but when I tried to install the same I am ending up with an error as follows,

`ga@ga-214:~/Documents/tools/vadr-master$ sh ./vadr-install.sh linux
./vadr-install.sh: 51: [: linux: unexpected operator
./vadr-install.sh: 54: [: linux: unexpected operator
./vadr-install.sh: 57: [: t: unexpected operator

DOWNLOADING AND BUILDING VADR 1.1


IMPORTANT: BEFORE YOU WILL BE ABLE TO RUN VADR SCRIPTS,
YOU NEED TO FOLLOW THE INSTRUCTIONS OUTPUT AT THE END
OF THIS SCRIPT TO UPDATE YOUR ENVIRONMENT VARIABLES.


Determining current directory ...
Set VADRINSTALLDIR as current directory (/home/ga/Documents/tools/vadr-master).

Downloading vadr ...
./vadr-install.sh: 82: ./vadr-install.sh: curl: not found
`
Kindly help me to fix this issue.

VADR breaks processing AY394999.1: "Use of uninitialized value $uapos"

Hi Eric,

I was trying to process a GenBank virus genome using VADR, and I got an odd break. Output below. Please let me know if I can provide any further information to help you debug this issue.

Thanks!

~Tomer

(base) ubuntu@ip-172-31-65-128:~/repos/darth$ time make test-taxon-prot-gen
mkdir -p test/taxon-prots
cd test/taxon-prots
for acc in AY394999.1
do
echo "Processing genome $acc:"
mkdir -p $acc
pushd $acc
if esl-sfetch /home/ubuntu/repos/darth/data/cov3ma.fa $acc > $acc.fa
then
sudo docker run -it --rm -m 13GB -v `pwd`:/output taltman/darth:maul \
                        darth.sh \
                                $acc \
                                /output/$acc.fa \
                                none \
                                /root/data \
                                /output \
                                2
fi
popd
echo "... complete!"
done
Processing genome AY394999.1:
~/repos/darth/test/taxon-prots/AY394999.1 ~/repos/darth/test/taxon-prots
WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
# v-annotate.pl :: classify and annotate sequences using a CM library
# VADR 1.1 (May 2020)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Fri Jul 17 08:06:40 2020
# $VADRBIOEASELDIR:  /root/vadr/Bio-Easel
# $VADRBLASTDIR:     /root/vadr/ncbi-blast/bin
# $VADREASELDIR:     /root/vadr/infernal/binaries
# $VADRINFERNALDIR:  /root/vadr/infernal/binaries
# $VADRMODELDIR:     /root/vadr/vadr-models
# $VADRSCRIPTSDIR:   /root/vadr/vadr
#
# sequence file:                                                                  /output/AY394999.1.fa
# output directory:                                                               /output/AY394999.1
# force directory overwrite:                                                      yes [-f]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  corona [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         /root/data/vadr-models-corona-1.1-1 [--mdir]
# set max allowed memory for cmalign to <n> Mb:                                   64000 [--mxsize]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Validating input                                                                        ... done. [    6.6 seconds]
# Classifying sequences (1 seq)                                                           ... done. [  129.4 seconds]
# Determining sequence coverage (NC_004718: 1 seq)                                        ... done. [  255.4 seconds]
# Aligning sequences (NC_004718: 1 seq)                                                   ... done. [  586.9 seconds]
# Determining annotation                                                                  ... Use of uninitialized value $uapos in concatenation (.) or string at /root/vadr/vadr/v-annotate.pl line 3757.
done. [    0.3 seconds]
# Validating proteins with blastx (NC_004718: 1 seq)                                      ...
ERROR in seq_Overlap start1 > end1 (27702 > 27701)

Makefile:278: recipe for target 'test-taxon-prot-gen' failed
make: *** [test-taxon-prot-gen] Error 1

real    16m21.518s
user    0m0.107s
sys     0m0.053s

install issue: mv: cannot move ‘vadr-vadr-1.2’ to ‘vadr/vadr-vadr-1.2’: Directory not empty

when I run sh ./vadr-install.sh linux, I got the error:
inflating: vadr-vadr-1.2/vadr-install.sh
inflating: vadr-vadr-1.2/vadr.pm
inflating: vadr-vadr-1.2/vadr.qsubinfo
inflating: vadr-vadr-1.2/vadr_seed.pm
mv: cannot move ‘vadr-vadr-1.2’ to ‘vadr/vadr-vadr-1.2’: Directory not empty

Anyone know how to solve it? I manually run mv vadr-vadr-1.2 vadr/vadr-vadr-1.2 and then rerun the sh ./vadr-install.sh linux again, same error.

Installation errors: missing prereqs and Perl Inline issue

I've been trying to follow the instructions for installing VADR, but I'm hitting a few issues.
Some of the errors I've been able to figure out. For example, for installing on an Amazon Linux instance on EC2, I've had to perform the following prerequisite steps:

	sudo yum install -y autoconf.noarch
	sudo cpanm Inline

Also, I find that the installer doesn't clean up after itself, so it is necessary to blow away the repo completely after each error: rm -r vadr.

I finally hit the following issue, which I cannot figure out:

make[1]: Entering directory `/home/taltman/repos/darth/third-party/vadr/Bio-Easel'
cp lib/Bio/Easel.pm blib/lib/Bio/Easel.pm
cp lib/Bio/Easel/MSA.pm blib/lib/Bio/Easel/MSA.pm
cp lib/Bio/Easel/SqFile.c blib/lib/Bio/Easel/SqFile.c
cp lib/Bio/Easel/SqFile.pm blib/lib/Bio/Easel/SqFile.pm
cp lib/Bio/Easel/Random.c blib/lib/Bio/Easel/Random.c
cp lib/Bio/Easel/SqFile.typemap blib/lib/Bio/Easel/SqFile.typemap
cp lib/Bio/Easel/Random.typemap blib/lib/Bio/Easel/Random.typemap
cp lib/Bio/Easel/Random.pm blib/lib/Bio/Easel/Random.pm
cp lib/Bio/Easel/MSA.typemap blib/lib/Bio/Easel/MSA.typemap
cp lib/Bio/Easel/MSA.c blib/lib/Bio/Easel/MSA.c
cp scripts/esl-alidepair.pl blib/script/esl-alidepair.pl
/usr/bin/perl -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/esl-alidepair.pl
cp scripts/esl-ssplit.pl blib/script/esl-ssplit.pl
/usr/bin/perl -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/esl-ssplit.pl
/usr/bin/perl -Mblib -MInline=NOISY,_INSTALL_ -MBio::Easel::SqFile -e1 0.01 blib/arch
Error. You have specified 'C' as an Inline programming language.

I currently only know about the following languages:
    Foo, foo

If you have installed a support module for this language, try deleting the
config-x86_64-linux-thread-multi-5.016003 file from the following Inline DIRECTORY, and run again:

    /home/taltman/repos/darth/third-party/vadr/Bio-Easel/_Inline

(And if that works, please file a bug report.)

 at /home/taltman/repos/darth/third-party/vadr/Bio-Easel/blib/lib/Bio/Easel/SqFile.pm line 68.
BEGIN failed--compilation aborted at /home/taltman/repos/darth/third-party/vadr/Bio-Easel/blib/lib/Bio/Easel/SqFile.pm line 74.
Compilation failed in require.
BEGIN failed--compilation aborted.
make[1]: *** [SqFile.inl] Error 255
make[1]: Leaving directory `/home/taltman/repos/darth/third-party/vadr/Bio-Easel'

I have no idea what this "language foo" stuff is about. I tried following the instructions to delete the specified config file for "language foo", but it doesn't change the error issued when re-running the last command.

Possibly I need to run some extra steps to load this _Inline Perl package with language definitions? Any help in figuring this out would be greatly appreciated! Thanks in advance for your time!

> 1 CPU for v-annotate.pl

I'm interested in speeding up v-annotate.pl by using > 1 CPU. In commit 023bdab I saw you commented out the -n arg and in commit 0aa1d0f you explicity made blast run with -num_threads 1, so perhaps you observed issues with > 1? If it's only due to users requesting more than available (e.g., confusing num threads with num CPUs) Torsten has a nice function that handles it well here.

Would you consider exposing CPUs as an option, where default remains 1, but allow users to request more?

With -r, some N-rich regions not replace because overlapping blastn hits are thrown out

N replacement with -r relies on identifying regions between blastn hits and checking them for N content. In v-annotate.pl::parse_cdt_tblout_file_and_replace_ns(), the code for selecting the blastn hits to check between throws out overlapping (w.r.t sequence) hits after sorting them by score. If two blastn hits overlap due to chance similarity at the ends after a deletion of a model region, then one will be thrown out and it messes up the check for Ns in missing regions.

Proposed fix is to allow overlaps as long as they're not complete (100% length of smaller hit), and then modify downstream code that analyzes each region for N content to relax the assumption that overlaps are impossible.

SARS-CoV-2 examples seqs are MW967213 and MW967242, can reproduce with vadr-models-sarscov2-1.2-1, vadr 1.2 and this command:

v-annotate.pl --r_minfract 0.1 --keep --mdir vadr-models-sarscov2-1.2-1 -s -r --nomisc --mkey sarscov2 --lowsim5term 2 --lowsim3term 2 --alt_fail lowscore,fstukcnf,insertnn,deletinn --glsearch -f <fasta> <outdir>

Cannot install using vadr-install.sh

When attempting to install vadr using command ./vadr-install.sh macosx on macOS Catalina 10.15.6, the following error message came up:
-bash: ./vadr-install.sh: Permission denied

How can I solve the problem? Please let me know if you need extra information.

VADR exits due to limited memory, but doesn't report this to the user

When processing a Coronavirus genome, I encountered the following error. Any hints?

# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.3 (Aug 2021)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date:              Tue Dec 14 07:10:37 2021
# $VADRBIOEASELDIR:  /root/vadr/Bio-Easel
# $VADRBLASTDIR:     /root/vadr/ncbi-blast/bin
# $VADREASELDIR:     /root/vadr/infernal/binaries
# $VADRINFERNALDIR:  /root/vadr/infernal/binaries
# $VADRMODELDIR:     /root/vadr/vadr-models-calici
# $VADRSCRIPTSDIR:   /root/vadr/vadr
#
# sequence file:                                                                  /darth/outputs/MalbNV/transeq/canonical.fna
# output directory:                                                               /darth/outputs/MalbNV/SRR10402291
# force directory overwrite:                                                      yes [-f]
# leaving intermediate files on disk:                                             yes [--keep]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr':  corona [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR:                         /root/data/vadr-models-corona-1.3-3 [--mdir]
# set max allowed memory for cmalign to <n> Mb:                                   64000 [--mxsize]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Validating input                                                                        ... done. [    0.1 seconds]
# Classifying sequences (1 seq)                                                           ... done. [  151.2 seconds]
# Determining sequence coverage (NC_045512: 1 seq)                                        ... done. [   18.6 seconds]
# Aligning sequences (NC_045512: 1 seq)                                                   ... Killed

ERROR in vdr_CmalignCheckStdOutput, cmalign /darth/outputs/MalbNV/SRR10402291/SRR10402291.vadr.NC_045512.align.r1.s0.stdout exists but is empty

Feature request: avoid dependency on C/C++ build toolchain

Thanks again for providing VADR. In packaging VADR for a production system, I have noticed that it transitively depends on Inline::C. Compiling C/C++ code at runtime requires a build toolchain to be present at runtime, and introduces potential reproducibility and maintainability issues. While I haven't used Inline::C myself, I understand from reading the documentation that it's possible to compile the Inline::C modules once and distribute them as XS modules. It would be nice if this could be done in the course of packaging VADR, so that only dynamic libraries and Perl modules are required, instead of a full compiler toolchain.

VADR doesn't annotate second segment of segmented CoV genome

The Serratus Project expanded the set of known CoV/nidovirus genomes, including segmented ones. An example of a segmented nidovirus similar to the ones that we found is the Pacific salmon nidovirus (MK611985.1). Please see Figure 3 of our preprint for more context:
https://www.biorxiv.org/content/10.1101/2020.08.07.241729v2

When I try to annotate the AmexNV genome, with two segments in the input FASTA file, VADR 1.3 annotates the first segment, and then reports the following for the second one:

>Feature NODE_11_length_12596_cov_95.354468

Additional note(s) to submitter:
ERROR: NO_ANNOTATION: (*sequence*) no significant similarity detected [-]; seq-coords:-; mdl-coords:-; mdl:-;

Yet, when I concatenate the two contigs with a run of 16 Ns: I get additional annotations (see below). Is there a way for VADR to recognize the multiple segments, and annotate them individually? (see below for the input files used)

Additional annotations:

22167   27672   gene
                        gene    S
22167   27672   CDS
                        product spike glycoprotein
                        protein_id      NODE_3_length_19124_cov_65.568632_3
27717   28212   gene
                        gene    orf4
27717   28212   CDS
                        product non-structural protein
                        protein_id      NODE_3_length_19124_cov_65.568632_4
28193   28627   gene
                        gene    E
28193   28627   CDS
                        product small membrane protein
                        protein_id      NODE_3_length_19124_cov_65.568632_5
28639   29602   gene
                        gene    M
28639   29602   CDS
                        product membrane glycoprotein
                        protein_id      NODE_3_length_19124_cov_65.568632_6
29646   31439   gene
                        gene    N
29646   31439   CDS
                        product nucleocapsid phosphoprotein
                        protein_id      NODE_3_length_19124_cov_65.568632_7
29665   30389   gene
                        gene    N2
29665   30389   CDS
                        product nucleocapsid phosphoprotein 2
                        protein_id      NODE_3_length_19124_cov_65.568632_8

Original FASTA file with two segments:
SRR6788790.epsy.fa.txt

Modified FASTA with the two segments concatenated:
AmexNV-one-contig-test.fa.txt

5' UTR identification?

In comparing VADR vs. the NCBI annotation for SARS-CoV-2, I noticed that the NCBI annotation displays the 5' UTR, but it is missing from the VADR output:

image

Perhaps I'm not parsing out the content correctly. Does VADR have a model for the 5' UTR?

Issue in installation - reg

I am trying to install vadr.

Command: sh ./vadr-install.sh linux

ERROR:
cp scripts/esl-alidepair.pl blib/script/esl-alidepair.pl
"/home/navajeet/miniconda3/bin/perl" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/esl-alidepair.pl
cp scripts/esl-ssplit.pl blib/script/esl-ssplit.pl
"/home/navajeet/miniconda3/bin/perl" -MExtUtils::MY -e 'MY->fixin(shift)' -- blib/script/esl-ssplit.pl
"/home/navajeet/miniconda3/bin/perl" -Mblib -MInline=NOISY,INSTALL -MBio::Easel::Random -e1 0.01 blib/arch
Error. You have specified 'C' as an Inline programming language.

I currently only know about the following languages:
Foo, foo

If you have installed a support module for this language, try deleting the
config-x86_64-linux-thread-multi-5.026002 file from the following Inline DIRECTORY, and run again:

/home/navajeet/vadr-install-dir/Bio-Easel/_Inline

(And if that works, please file a bug report.)

at /home/navajeet/vadr-install-dir/Bio-Easel/blib/lib/Bio/Easel/Random.pm line 66.
BEGIN failed--compilation aborted at /home/navajeet/vadr-install-dir/Bio-Easel/blib/lib/Bio/Easel/Random.pm line 72.
Compilation failed in require.
BEGIN failed--compilation aborted.
make: *** [Makefile:963: Random.inl] Error 255

Can you please help me.

Regards,
Navajeet

Long sequence descriptions invalidate glsearch output format

vdr_GlsearchFormat3And9CToStockholmAndInsertFile() in vadr.pm relies on sequence length being in the glsearch -m 3,9c output line (1000 nt in example below):

1>>>seq1 this is the description - 1000 nt (forward-only)

But if the description is above a length threshold (about 470 characters), it won't be printed:

1>>>seq1 extremly long description more than 470 characters

This causes v-annotate.pl to fail in error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.