Git Product home page Git Product logo

Comments (8)

linzhi2013 avatar linzhi2013 commented on July 24, 2024

Hi dleopold,

Thank you very much for reporting the issue. I will check it when I have more time.

I agree with your point "It would seem more appropriate to stop at the same position as the reference if possible, particularly if the reference is highly similar" if the annotation of reference is correct (but in most cases, we don't know the truth).

MitoZ checks the initial CDS ranges from the file DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position (https://github.com/linzhi2013/MitoZ/wiki/Tutorial#3-the-annotation-step), and tries to find the stop codons or T/TA endings.

So, we can look at the file DM01_DM01.megahit.mitogenome.fa_mitoscaf.fa.cds.position to confirm that the annotation hits the reference mitogenome (NC_030597). If it hits another reference, we can understand why it happened like now.

Secondly, if it hits the reference genome (NC_030597), the gene length information might start to have an effect on the annotation process (the COX2 on NC_030597 might be too short). You can compare the COX2 on NC_030597 with the COX2 of other closely related species, do a multiple sequence alignment (codon-based version), and see if all species end at the same position.

from mitoz.

dleopold avatar dleopold commented on July 24, 2024

In this case, the "...position.cds" file shows that the sequence being annotated is hitting itself in the reference database for COX2. Here is the relevant part of the file:

>gi_NC_030597_COX2_Sargocentron_spiniferum_230_aa_Ref1:230aa	[mRNA]	locus=NC_030597.1:7206:7895:+
ATGGCACATCCCTCTCAACTAGGATTCCAAGATGCGGCTTCACCCGTTAT
AGAAGAGCTCCTTCACTTCCACGACCACGCTTTAATAATCGTCTTTCTAA
TTAGCACACTAGTTCTTTACATTATTGTGGCGATAGTCTCCACTAAACTA
ACCAACAAATATATCCTCGACTCCCAAGAAATCGAAATTATCTGAACAGT
ACTCCCTGCAGTAATTCTTATCCTAATTGCCCTCCCCTCACTACGAATTC
TTTATCTTATGGATGAAATTAATGACCCACACCTAACTATTAAAGCAATA
GGACACCAATGATACTGAAGCTACGAATATACTGATTACGAGGATCTTGG
CTTCGACTCTTATATAATTCCTACCCAAGACCTTACCCCAGGACAATTCC
GCCTCCTAGAAGCAGACCATCGAATAGTTATCCCAATTGAATCCCCTATT
CGTGTTCTAGTCTCAGCCGAAGACGTCCTACACTCATGAGCAGTTCCAGC
ACTAGGCGTTAAAATAGACGCAGTGCCTGGCCGACTAAACCAAACAGCCT
TTATTACATCCCGCCCAGGTGTATTCTACGGTCAATGCTCCGAAATCTGC
GGCGCAAACCACAGCTTTATACCCATCGTCGTTGAAGCTGTCCCACTAGA
ACACTTTGAAAACTGATCCTCTATAATACTTGAAGACGCT

However, although the alignment occurs at positions 7206:7895, the final annotation of the gene is 7206:7916, extending 19bp into the tRNA-Lys despite a T/TA in the 'correct' stop position. As far as I can tell this is not due to an issue with length - I downloaded mitogenomes for 4 similar taxa and they all align very well (codon or nucleotide based alignment) and end at the same T or TA. They are also all the same length, only differing by 1 bp if the stop codon was identified as T or TA.

I do realize that we often do not know the "correct" position when working with novel taxa, but I chose this Ref Seq mitogenome as a reproducible example of something that I am seeing quite frequently in my data (~800 fish mitogenomes). To me, this seems like a clear case where the stop position in the original annotation should be properly re-identified by MitoZ. Perhaps I am missing something about how length enters into the delineation of the start/stop positions? I did not see anything about that in the original paper or documentation, so I am just trying to understand why the annotation is incorrect.

from mitoz.

linzhi2013 avatar linzhi2013 commented on July 24, 2024

image

Could you please tell me the accessions for the "4 similar taxa"? thanks! I would like to check them further.

from mitoz.

linzhi2013 avatar linzhi2013 commented on July 24, 2024

When I check this COX2 of NC_030597.1, I do find a standard stop codon after the extension (the first line below). We can check (at the protein level) if this is also true for the other 4 similar taxa. If all taxa show a similar pattern (conserved proteins at the extended region), then we can say probably that the NCBI Ref annotation for this gene is wrong.

image

from mitoz.

linzhi2013 avatar linzhi2013 commented on July 24, 2024

image

from mitoz.

linzhi2013 avatar linzhi2013 commented on July 24, 2024

Perhaps in the future, we can add an option to ask MitoZ not to do such a greedy extension if the users don't want, to make the annotation result more similar to the reference gene (although it's not necessary to be correct).

from mitoz.

dleopold avatar dleopold commented on July 24, 2024

It looks like you compared the results with a similar set of reference sequences as the ones I used. Thank you for looking into it. I don't have any additional data to determine which is the 'correct' annotation. However, it appears that for this gene, in the Holocentridae, the NCBI annotations are fairly consistent, all ending before the neighboring tRNA. This is also consistent with annotations produced by other annotation software (e.g. Mitos2, MitoAnnotator). I have also had many of my NCBI submissions rejected when the annotation differs from all of the annotations of currently accepted submissions in this way (for this and other taxonomic groups / genes). So, whether or not there is uncertainty about which is correct, there does seem to be a prevailing consensus.

from mitoz.

linzhi2013 avatar linzhi2013 commented on July 24, 2024

Thanks for your information.

Maybe we can try to first determine the boundaries of tRNA genes and then set some constraints on the boundaries of PCGs, to avoid the overlapping of PCGs and tRNAs. (However, I remember that at least of some PCGs of some clades like ATP6 and ATP8 genes, the overlapping of genes happens, right?)

But anyway, if the goal of studies is for phylogenetic analysis using PCGs, or gene order rearrangement analysis, such kind of issues don't matter at all as far as I can see.

from mitoz.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.