Git Product home page Git Product logo

Comments (10)

gpertea avatar gpertea commented on June 26, 2024 2

To clarify : if a query transcript matches a reference transcript (using class code =) and also overlaps other reference transcript(s) (using class code j, which means at least one intron-exon junction matches), obviously the output files produced by gffcompare will show only the = matching reference transcript (due to the class code priorities, = always overrides j ).

However if a query transcript does not "match" any reference transcript (no class code =) but just overlaps multiple reference transcripts with the same class code j, unfortunately gffcompare is currently using only the amount of matching bases (overlap length) to choose as the "closest" reference, so it will pick the reference transcript that has "more bases in common with" the query transcript to show in the output files. Now I do see how that is a rather poor/weak criterion -- because it can be argued that the number of matching junctions should always take precedence over the number of common bases, right? And I can agree to that, perhaps I should implement a better scoring scheme for these cases. Some of the more interesting isoform variations are already covered by other class codes (c or k for containment, m or n for retained introns etc.) which all have a higher priority over the bland j class code.. But when it comes to just multiple j code assignments, indeed gffcompare could be found guilty of using only the number of common bases to pick the "closest" reference/annotation transcript.

If you are interested in a more thorough analysis of all the junction overlaps between isoforms and reference transcripts you could use the trmap utility which is distributed with gffcompare and that will output all the overlaps (with their class code) found between every query transcripts vs. a set of reference transcripts. The output of that utility includes all the exon coordinates (for both query and reference transcripts) that could be easily used for further analysis of those overlaps, so one could write a script to parse that output and pick a better "representative" reference transcript for a query that only has j code overlaps with the reference transcripts.

from gffcompare.

gpertea avatar gpertea commented on June 26, 2024

Good point. Besides the priority assigned to each "class code" , gffcompare also keeps track of the "overlap length" (or "matching bases" as you suggested). So between multiple references with the same 'class code' , the one with the largest overlap will be chosen. Now for the "matching" references (class code =), recently I've added the concept of "overlap score", which is the overlap length minus the length of the "overhangs" (shown in red in the attached picture for two example transcripts (in blue) matching the same reference transcript (shown in black) ).
match_overhangs

The situation exemplified in this picture is actually a bit tricky if you consider the problem from the point of view of the reference - which one of the query transcripts are the "closest", or "best match" for that reference transcript? (the .refmap file produced by gffcompare is supposed to answer this question). It is hard to answer because the "overlap scores" are very close in this case, because the first transcript has a smaller overlap there for the first exon, despite having shorter overhangs.

Getting back to your question, to summarize: if a query transcript overlaps multiple references, for simple non-match overlaps the better overlap length is the deciding factor, except for class code '=' where the better overlap score is deciding, where overlap_score = overlap_len - overhangs_len.
If the overlap_score is the same, then it's a toss-up - the "matching" reference to use for annotation is chosen randomly. Note that in the annotation data usually there are not multiple "matching" references in the same location, and gffcompare will actually discard most of those as "duplicates" (though it's quite strict about their structural matching when it does that).

from gffcompare.

ljw90607 avatar ljw90607 commented on June 26, 2024

Dear @gpertea

Thank you very much for your wonderful comment. It is really helpful.

The steps you have explained does make sense for the "matching" transcripts.
But what if there are some isoform candidate transcripts that have internal exon variation which can be called as "junction match", and have similar reference transcripts for their match?
Would there be any deciding factor for this situation? (or have overlap score?)

In my understanding, the "matching" class transcripts are only called when all the internal exon boundaries do match with the reference. But I am having a quite difficult time figuring out the logic of which "junction match" transcripts are annotated from which reference transcripts.

Thank you so much again for your help!

Jungwoo

from gffcompare.

ljw90607 avatar ljw90607 commented on June 26, 2024

Thank you very much again for wonderful comment.
I will definitely try the suggestion.

from gffcompare.

ljw90607 avatar ljw90607 commented on June 26, 2024

Dear @gpertea,

I would like to ask a quick question about the previous topic that we have discussed.
I found out that the sample transcript that shows the missing exon (around 50bp) compare to the reference shown below, was labeled with the coverage at 100% percentage using "trmap" -S option.

trnascript_pic_1

I even tried annotating the reference to the sample and still observed the same result (in this case, transcript A at 100% to the sample transcript B)
I am not quite sure whether the algorithm ignores some of the short bases containing exons?

If you have any idea of what this could be, please let me know!
Thank you for your wonderful help!

Jungwoo

from gffcompare.

gpertea avatar gpertea commented on June 26, 2024

Sorry to admit I forgot why the -S option even exists, but I guess the original intent was to provide a value showing how much of each (query) transcript length is covered ("confirmed") by "annotated" transcription bases (i.e. reference transcripts' exons). If that was the original intent then it seems that the 100% value there is correct, because all the bases of Transcript B seem to be covered by reference transcript(s) bases, right? Not the other way around, of course. The fact that there is no actual structural match between transcript B and transcript A does not seem to matter for this -S output. If you really want to be more precise about structural matching/overlap between sample and reference then perhaps you should not use the -S option and use the default output of trmap instead (and compute your own coverage for the = class overlaps)

But you also say that when you swapped query and reference you still got 100% coverage for transcript A by transcript B (only by transcript B? or were there other transcripts in the sample overlapping that short exon ? ). Looking at the proportions of exons in the figure you have there I think it is possible for this to be a silly rounding issue, as that exon seems very small relative to the entire length of the transcript (assuming there are many more large exons not shown in the truncated figure). What is the total exonic length of transcript A ?

from gffcompare.

ljw90607 avatar ljw90607 commented on June 26, 2024

I see. I thought that the value intended the coverage of bases referring to that specific transcript.

Here I am providing the length for both transcript A & B.
Transcript A : 1549bp
Transcript B : 1535bp

Although the overall length different is only 14bp, the missing exon in Transcript B is 24bp because the exon next to it (towards the 5' end) is 10bp longer in Transcript B than A.

from gffcompare.

gpertea avatar gpertea commented on June 26, 2024

A late update to this: I realized that trmap with -S option was actually not computing exon coverage, but just plain full-range interval coverage (start..end of the transcripts). That was not very useful, I suppose. The latest release actually changed this so trmap -S now reports specifically exon coverage.

from gffcompare.

clsteam avatar clsteam commented on June 26, 2024

Dear @gpertea ,
Thank you for your wonderful comment. I have some questions for the follow-up to this question:
What is the priority principle of class code? The result file obtained by trmap contains all overlaps. As you said above, the result of gffcompare is = if overlaps contains =, but if it contains multiple other types of ’class code' (such as y, i, m, j ..., without the =), how does it work in the end? Is overlap_score used to judge?

Give a simple example:
If I now need to screen candidate transcripts of lncrna based on the results of trmap, should I put transcripts that include uxijo and = into the candidate set?

Thank you,
Yao

from gffcompare.

gpertea avatar gpertea commented on June 26, 2024

Currently the class codes have a priority of each class code is somewhat reflected in the table found on the documentation page (https://ccb.jhu.edu/software/stringtie/gffcompare.shtml#transfrag-class-codes), though some of those code are "tied" in terms of priority level and then that tie is broken by the overlap length as you assumed.
Some codes have no direct overlap (zero overlap length) - e.g. i and y are by definition not overlapping any exons. Those codes have an absolute code-level priority set lower than the actually overlapping codes.

from gffcompare.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.