I noticed today (after a user complaint) that apparently version 4.3 of SnpEff, which is used in the Variation workflows, fails to parse the NC_045512.2 genbank file correctly at the snpeff build step.
This leads to:
- missing annotations of effects on the mature peptides produced from the ORF1ab precursor
- wrong reporting of the ORF1ab AA length as 7101 instead of 7096 in snpeff's EFF field
- shifted amino acid numbers at ORF1ab positions past the ribosomal slippage site.
All reported AA positions for variants downstream of this site seem to be shifted by 5 compared to the reference sequence.
None of this affects the tabular reports produced by SnpSift because this one does not provide these bits of info, but it means that all VCFs produced by the workflows suffer from the annotation error in 3. (though 2. provides a hint at what happened).
The underlying bug in SnpEff has been fixed in the recent 4.5covid19 release of snpeff, which also provides a built-in NC_045512.2 genome file. Annotations with either that built-in genome or a genome rebuilt from the genbank file with the 4.5covid19 snpeff version, yield identical and expected results.
Compare this output snippet of snpeff v4.3 for SRR10971381 taken from https://usegalaxy.org/datasets/bbd44e69cb8906b5ace4f1e83ab23509/display/?preview=True:
NC_045512 | 14268 | . | G | A | 103.0 | PASS | DP=473;AF=0.019027;SB=1;DP4=179,284,3,7;EFF=SYNONYMOUS_CODING(LOW\|SILENT\|acG/acA\|T4673\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14272 | . | G | T | 71.0 | PASS | DP=374;AF=0.016043;SB=3;DP4=141,226,4,3;EFF=STOP_GAINED(HIGH\|NONSENSE\|Gag/Tag\|E4675*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|T)
NC_045512 | 14309 | . | G | A | 79.0 | PASS | DP=334;AF=0.020958;SB=1;DP4=119,208,3,4;EFF=STOP_GAINED(HIGH\|NONSENSE\|tGg/tAg\|W4687*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14310 | . | G | A | 86.0 | PASS | DP=303;AF=0.023102;SB=13;DP4=114,181,0,7;EFF=STOP_GAINED(HIGH\|NONSENSE\|tgG/tgA\|W4687*\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|A)
NC_045512 | 14313 | . | T | C | 317.0 | PASS | DP=301;AF=0.073090;SB=21;DP4=109,158,3,20;EFF=SYNONYMOUS_CODING(LOW\|SILENT\|gaT/gaC\|D4688\|7101\|orf1ab\|protein_coding\|CODING\|GU280_gp01\|2\|C)
to the corresponding bit of output from v4.5covid19:
NC_045512.2 14268 . G A 103.0 PASS DP=473;AF=0.019027;SB=1;DP4=179,284,3,7;EFF=SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T4668|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),SYNONYMOUS_CODING(LOW|SILENT|acG/acA|T276|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2 14272 . G T 71.0 PASS DP=374;AF=0.016043;SB=3;DP4=141,226,4,3;EFF=STOP_GAINED(HIGH|NONSENSE|Gag/Tag|E4670*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|T),STOP_GAINED(HIGH|NONSENSE|Gag/Tag|E278*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|T|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2 14309 . G A 79.0 PASS DP=334;AF=0.020958;SB=1;DP4=119,208,3,4;EFF=STOP_GAINED(HIGH|NONSENSE|tGg/tAg|W4682*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),STOP_GAINED(HIGH|NONSENSE|tGg/tAg|W290*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2 14310 . G A 86.0 PASS DP=303;AF=0.023102;SB=13;DP4=114,181,0,7;EFF=STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W4682*|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|A),STOP_GAINED(HIGH|NONSENSE|tgG/tgA|W290*|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|A|WARNING_TRANSCRIPT_NO_START_CODON)
NC_045512.2 14313 . T C 317.0 PASS DP=301;AF=0.073090;SB=21;DP4=109,158,3,20;EFF=SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D4683|7096|ORF1ab|protein_coding|CODING|GU280_gp01|2|C),SYNONYMOUS_CODING(LOW|SILENT|gaT/gaC|D291|931|ORF1ab|protein_coding|CODING|YP_009725307.1|2|C|WARNING_TRANSCRIPT_NO_START_CODON)
Unfortunately, v4.5covid19 of snpeff is somewhat ill-defined: the conda-packaged version and the version on sourceforge differ in that the sourceforge version seems to be a complete SnpEff version (with covid19-related genomes included), while the bioconda version has only the covid19-related genomes, but lacks standard ones.
If we want to fix this annotation issue, @bgruening and me can think of four possibilities:
- Tweak the genbank file that we feed into snpeff build to have even v4.3 parse it correctly
We haven't tested how much effort that would be, and it would mean that the WFs would be bound to this specific input, which feels a bit absurd.
- We take the built-in genome file from 4.5covid19, backport it to 4.3 (which means rewriting the version info inside the genome file) and offer it as a cached genome file on usegalaxy.* instances).
I confirmed already that this is working, but, of course, you cannot port the WFs to an instance without the genome files then.
- We fix the conda version of snpeff v4.5covid19, then release a new wrapper version of snpeff and have the WFs use that one.
This is a bit more work and could cause ambiguities in the snpeff versioning and when you installed it from conda.
- We create a new covid19-specific snpeff wrapper using the current bioconda package and add it to the toolshed alongside the current one.
@bgruening has already prepared such a wrapper and could provide the PR for it.
If you're interested please add your thoughts, ideas, alternative suggestions here!