Comments (10)
To clarify : if a query transcript matches a reference transcript (using class code =
) and also overlaps other reference transcript(s) (using class code j
, which means at least one intron-exon junction matches), obviously the output files produced by gffcompare will show only the =
matching reference transcript (due to the class code priorities, =
always overrides j
).
However if a query transcript does not "match" any reference transcript (no class code =
) but just overlaps multiple reference transcripts with the same class code j
, unfortunately gffcompare is currently using only the amount of matching bases (overlap length) to choose as the "closest" reference, so it will pick the reference transcript that has "more bases in common with" the query transcript to show in the output files. Now I do see how that is a rather poor/weak criterion -- because it can be argued that the number of matching junctions should always take precedence over the number of common bases, right? And I can agree to that, perhaps I should implement a better scoring scheme for these cases. Some of the more interesting isoform variations are already covered by other class codes (c
or k
for containment, m
or n
for retained introns etc.) which all have a higher priority over the bland j
class code.. But when it comes to just multiple j
code assignments, indeed gffcompare could be found guilty of using only the number of common bases to pick the "closest" reference/annotation transcript.
If you are interested in a more thorough analysis of all the junction overlaps between isoforms and reference transcripts you could use the trmap utility which is distributed with gffcompare and that will output all the overlaps (with their class code) found between every query transcripts vs. a set of reference transcripts. The output of that utility includes all the exon coordinates (for both query and reference transcripts) that could be easily used for further analysis of those overlaps, so one could write a script to parse that output and pick a better "representative" reference transcript for a query that only has j
code overlaps with the reference transcripts.
from gffcompare.
Good point. Besides the priority assigned to each "class code" , gffcompare also keeps track of the "overlap length" (or "matching bases" as you suggested). So between multiple references with the same 'class code' , the one with the largest overlap will be chosen. Now for the "matching" references (class code =
), recently I've added the concept of "overlap score", which is the overlap length minus the length of the "overhangs" (shown in red in the attached picture for two example transcripts (in blue) matching the same reference transcript (shown in black) ).
The situation exemplified in this picture is actually a bit tricky if you consider the problem from the point of view of the reference - which one of the query transcripts are the "closest", or "best match" for that reference transcript? (the .refmap
file produced by gffcompare is supposed to answer this question). It is hard to answer because the "overlap scores" are very close in this case, because the first transcript has a smaller overlap there for the first exon, despite having shorter overhangs.
Getting back to your question, to summarize: if a query transcript overlaps multiple references, for simple non-match overlaps the better overlap length is the deciding factor, except for class code '=' where the better overlap score is deciding, where overlap_score = overlap_len - overhangs_len
.
If the overlap_score is the same, then it's a toss-up - the "matching" reference to use for annotation is chosen randomly. Note that in the annotation data usually there are not multiple "matching" references in the same location, and gffcompare will actually discard most of those as "duplicates" (though it's quite strict about their structural matching when it does that).
from gffcompare.
Dear @gpertea
Thank you very much for your wonderful comment. It is really helpful.
The steps you have explained does make sense for the "matching" transcripts.
But what if there are some isoform candidate transcripts that have internal exon variation which can be called as "junction match", and have similar reference transcripts for their match?
Would there be any deciding factor for this situation? (or have overlap score?)
In my understanding, the "matching" class transcripts are only called when all the internal exon boundaries do match with the reference. But I am having a quite difficult time figuring out the logic of which "junction match" transcripts are annotated from which reference transcripts.
Thank you so much again for your help!
Jungwoo
from gffcompare.
Thank you very much again for wonderful comment.
I will definitely try the suggestion.
from gffcompare.
Dear @gpertea,
I would like to ask a quick question about the previous topic that we have discussed.
I found out that the sample transcript that shows the missing exon (around 50bp) compare to the reference shown below, was labeled with the coverage at 100% percentage using "trmap" -S option.
I even tried annotating the reference to the sample and still observed the same result (in this case, transcript A at 100% to the sample transcript B)
I am not quite sure whether the algorithm ignores some of the short bases containing exons?
If you have any idea of what this could be, please let me know!
Thank you for your wonderful help!
Jungwoo
from gffcompare.
Sorry to admit I forgot why the -S
option even exists, but I guess the original intent was to provide a value showing how much of each (query) transcript length is covered ("confirmed") by "annotated" transcription bases (i.e. reference transcripts' exons). If that was the original intent then it seems that the 100% value there is correct, because all the bases of Transcript B seem to be covered by reference transcript(s) bases, right? Not the other way around, of course. The fact that there is no actual structural match between transcript B and transcript A does not seem to matter for this -S
output. If you really want to be more precise about structural matching/overlap between sample and reference then perhaps you should not use the -S
option and use the default output of trmap instead (and compute your own coverage for the =
class overlaps)
But you also say that when you swapped query and reference you still got 100% coverage for transcript A by transcript B (only by transcript B? or were there other transcripts in the sample overlapping that short exon ? ). Looking at the proportions of exons in the figure you have there I think it is possible for this to be a silly rounding issue, as that exon seems very small relative to the entire length of the transcript (assuming there are many more large exons not shown in the truncated figure). What is the total exonic length of transcript A ?
from gffcompare.
I see. I thought that the value intended the coverage of bases referring to that specific transcript.
Here I am providing the length for both transcript A & B.
Transcript A : 1549bp
Transcript B : 1535bp
Although the overall length different is only 14bp, the missing exon in Transcript B is 24bp because the exon next to it (towards the 5' end) is 10bp longer in Transcript B than A.
from gffcompare.
A late update to this: I realized that trmap with -S
option was actually not computing exon coverage, but just plain full-range interval coverage (start..end of the transcripts). That was not very useful, I suppose. The latest release actually changed this so trmap -S
now reports specifically exon coverage.
from gffcompare.
Dear @gpertea ,
Thank you for your wonderful comment. I have some questions for the follow-up to this question:
What is the priority principle of class code? The result file obtained by trmap
contains all overlaps. As you said above, the result of gffcompare
is =
if overlaps contains =
, but if it contains multiple other types of ’class code' (such as y, i, m, j
..., without the =
), how does it work in the end? Is overlap_score used to judge?
Give a simple example:
If I now need to screen candidate transcripts of lncrna based on the results of trmap
, should I put transcripts that include uxijo
and =
into the candidate set?
Thank you,
Yao
from gffcompare.
Currently the class codes have a priority of each class code is somewhat reflected in the table found on the documentation page (https://ccb.jhu.edu/software/stringtie/gffcompare.shtml#transfrag-class-codes), though some of those code are "tied" in terms of priority level and then that tie is broken by the overlap length as you assumed.
Some codes have no direct overlap (zero overlap length) - e.g. i
and y
are by definition not overlapping any exons. Those codes have an absolute code-level priority set lower than the actually overlapping codes.
from gffcompare.
Related Issues (20)
- Extract transposon HOT 1
- Missing tmap and refmap files in query gff directory + missing statistics in stats file
- How to keep the CDS info for each transcript? HOT 2
- Transcript classification codes ? HOT 1
- class_code 'u' in all transcripts HOT 1
- Merging different gtf files
- How is class code X is identified in stranded data?
- There is no CDS information in the generated file!
- add option to assemble transfrags in the combined.gtf output HOT 1
- A question on gffcompare
- -p should rename transcripts even when running in "annotation mode" (single input file, no merge options)
- all input transcript attributes and features should be preserved in annotation mode
- No sensitivity or precision stats reported if multiple gffs given as input
- Number of loci and mRNAs in reference keeps changing HOT 1
- GFFCompare ".tracking" file not containing all transcript IDs of all the provided annotation files
- "Warning: merging adjacent/overlapping segments" when using GFFCompare
- How can one visualize GFF files coverage, FPKM ; TPM values in a genome browser?
- missing stringTie ID in gffcomapre tmap file
- Should I filter the transcript with class code "s"?
- No Sensitivity/Accuracy in the .stats output file
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gffcompare.