gpertea / gffcompare Goto Github PK
View Code? Open in Web Editor NEWclassify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
License: MIT License
classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
License: MIT License
Hello,
I mapped my RNA-seq reads with HISAT2 and assembled them with stringtie while using the Salmo salar whole genome as a reference in both stages. I understand that "Sensitivity is the proportion of coding nucleotides that have been correctly predicted as coding, and Specificity is the proportion of noncoding nucleotides that have been correctly predicted as noncoding." (From Burset and Guigo 1996)
Is Precision synonymous with Specificity in this case?
What could lead to the discrepancy between Sensitivity and Precision seen in my data? Thanks!
gffcompare v0.9.8 | Command line was:
./gffcompare -r /media/genetics/Data1/Jeff_workstation_1/subread-1.5.0-p1/GCF_000233375.1_ICSASG_v2_genomic.gff -o /media/genetics/Data1/Jeff_workstation_1/gffcompare-0.9.8.Linux_x86_64/compare/8-B7877_S4_1a_mRNA-e-disabled /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled-e-disabled/8-B7877_S4_1a_mRNA-e-disabled.gtf
= Summary for dataset: /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled-e-disabled/8-B7877_S4_1a_mRNA-e-disabled.gtf
Query mRNAs : 20484 in 19076 loci (13136 multi-exon transcripts)
(998 multi-transcript loci, ~1.1 transcripts per locus)
Reference mRNAs : 136039 in 81267 loci (111148 multi-exon)
Super-loci w/ reference transcripts: 12245
-----------------| Sensitivity | Precision |
Base level: 12.2 | 86.9 |
Exon level: 8.9 | 73.9 |
Intron level: 9.5 | 96.7 |
Intron chain level: 5.7 | 48.6 |
Transcript level: 5.6 | 37.3 |
Locus level: 8.3 | 35.4 |
Matching intron chains: 6378
Matching transcripts: 7647
Matching loci: 6744
Missed exons: 498437/562650 ( 88.6%)
Novel exons: 5960/66229 ( 9.0%)
Missed introns: 421469/471248 ( 89.4%)
Novel introns: 935/46255 ( 2.0%)
Missed loci: 68711/81267 ( 84.5%)
Novel loci: 4714/19076 ( 24.7%)
Total union super-loci across all input datasets: 17075
hello, sorry to botter you. I am a beginner.I wonder what these numbers mean in stats file.
What does the number after 'Novel exons' represent?
Missed exons: 70627/364162 ( 19.4%)
Novel exons: 3595/181415 ( 2.0%)
Missed introns: 36856/226049 ( 16.3%)
Novel introns: 1791/139213 ( 1.3%)
Missed loci: 0/13156 ( 0.0%)
Novel loci: 92/14996 ( 0.6%)
thank you .
Hi,
I am taking a few stringtie assemblies of human single-cell RNAseq data and annotating them back against Gencode v27 gtf files using gffcompare. I have noticed some differences between the total number of transcripts in the one input.gtf and the output file of .tmap after I used the many default modes of gffcompare.
The command is used as follows:
gffcompare-0.10.1.Linux_x86_64/gffcompare -R -r /ReferenceAnnotation/gencode.v27.annotation.gtf -G
-o gffResultPrefix stringtie_result/SRR.gtf
I have got some information about the differences between the "GffCompare" and "CuffCompare" from the website at https://github.com/gpertea/gffcompare, as follows:
Another important difference is that the input transcripts are no longer discarded when they are found to be "intron redundant", i.e. contained within other, longer isoforms. CuffCompare had the -G option to prevent collapsing of such intron redundant isoforms into their longer "containers", but GffCompare has made this the default mode of operation (hence the -G option is no longer needed and is simply ignored when given).
Maybe I am missing something obvious, so could you please give advice on how I can run the newest version of gffcompare to preserve the missing transcripts?
Hi:
I can't find the class code k in cuffcompare.So,what's the mean of class code 'k'?
Hello, I've been following the protocol published on Nature and running gffcompare
on two RNAseq replicates that I have. I removed the -G
flag as (from what I understand) it's now the default behaviour.
Still, when I run
gffcompare -r hg19.gtf merged.gtf
I get the following error:
Error allocating memory. Aborted (core dumped)
When I run the tool in debugging mode with -D
I see a lot of
GFF Warning: unusual segment inclusion: exon(X-X) within CDS(X-X) (ID=X)
and of
GFF warning: merging adjacent/overlapping segments of X on X (X-X, X-X)
and then:
Kept X ref transcripts out of Y
W duplicate reference transcripts discarded.
Prefix for output files: gffcmp
Loading query file #1: merged.gtf
Error allocating memory.
Aborted (core dumped)
Any idea on what could be causing this?
I am working on an Ubuntu 16.04 machine and I tried both with the precompiled binaries and with my own build (following instruction in the README file).
Many thanks in advance!
Cheers
Hello,
Recently, I was running the GFFcompare and found some confusing results.
The query file has one single-exon transcript which is slightly different from the reference file at the boundary, and the query file has also one multiple-exon transcript which is slightly different from the reference file at the boundary. When I using GFFCompare with "--strict-match -e 0 -d 0", I was expected that the transcript level of two data should be 0% due to the error at the boundary, but the results have shown 100%. What are the possible reasons this might happens?
Thank you
grassking100
# gffcompare v0.11.5 | Command line was:
#gffcompare --strict-match --chr-stats --debug -T -e 0 -d 0 --no-merge -r subanswer.gff3 subpredict.gff3
#
#> Genomic sequence: region_0385
# Query mRNAs : 1 in 1 loci (0 multi-exon transcripts)
# (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs : 1 in 1 loci (0 multi-exon)
# Super-loci w/ reference transcripts: 1
#-----------------| Sensitivity | Precision |
Base level: 100.0 | 99.8 |
Exon level: 0.0 | 0.0 |
Transcript level: 100.0 | 100.0 |
Locus level: 100.0 | 100.0 |
Matching intron chains: 0
Matching transcripts: 1
Matching loci: 1
Missed exons: 0/1 ( 0.0%)
Novel exons: 0/1 ( 0.0%)
Missed loci: 0/1 ( 0.0%)
Novel loci: 0/1 ( 0.0%)
#> Genomic sequence: region_0333
# Query mRNAs : 1 in 1 loci (1 multi-exon transcripts)
# (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs : 1 in 1 loci (1 multi-exon)
# Super-loci w/ reference transcripts: 1
#-----------------| Sensitivity | Precision |
Base level: 100.0 | 99.9 |
Exon level: 66.7 | 66.7 |
Intron level: 100.0 | 100.0 |
Intron chain level: 100.0 | 100.0 |
Transcript level: 100.0 | 100.0 |
Locus level: 100.0 | 100.0 |
Matching intron chains: 1
Matching transcripts: 1
Matching loci: 1
Missed exons: 0/3 ( 0.0%)
Novel exons: 0/3 ( 0.0%)
Missed introns: 0/2 ( 0.0%)
Novel introns: 0/2 ( 0.0%)
Missed loci: 0/1 ( 0.0%)
Novel loci: 0/1 ( 0.0%)
#= Summary for dataset: subpredict.gff3
# Query mRNAs : 2 in 2 loci (1 multi-exon transcripts)
# (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs : 2 in 2 loci (1 multi-exon)
# Super-loci w/ reference transcripts: 2
#-----------------| Sensitivity | Precision |
Base level: 100.0 | 99.8 |
Exon level: 50.0 | 50.0 |
Intron level: 100.0 | 100.0 |
Intron chain level: 100.0 | 100.0 |
Transcript level: 100.0 | 100.0 |
Locus level: 100.0 | 100.0 |
Matching intron chains: 1
Matching transcripts: 2
Matching loci: 2
Missed exons: 0/4 ( 0.0%)
Novel exons: 0/4 ( 0.0%)
Missed introns: 0/2 ( 0.0%)
Novel introns: 0/2 ( 0.0%)
Missed loci: 0/2 ( 0.0%)
Novel loci: 0/2 ( 0.0%)
Total union super-loci across all input datasets: 2
2 out of 2 consensus transcripts written in gffcmp.annotated.gtf (0 discarded as redundant)
##gff-version 3
region_0385 . gene 500 1432 . + . ID=region_0769_gene_1;Parent=region_0769
region_0385 . mRNA 500 1432 . + . ID=region_0769_gene_1_mRNA;Parent=region_0769_gene_1
region_0385 . exon 500 1432 . + . ID=region_0769_gene_1_mRNA_exon_1;Parent=region_0769_gene_1_mRNA
###
region_0333 . gene 501 2878 4 - . ID=AT2G26870_gene
region_0333 . mRNA 501 2878 4 - . ID=AT2G26870;Parent=AT2G26870_gene
region_0333 . exon 501 1315 4 - . Parent=AT2G26870
region_0333 . exon 1409 1781 4 - . Parent=AT2G26870
region_0333 . exon 2383 2878 4 - . Parent=AT2G26870
##gff-version 3
region_0385 . gene 500 1434 . + . ID=region_0769_gene_1;Parent=region_0769
region_0385 . mRNA 500 1434 . + . ID=region_0769_gene_1_mRNA;Parent=region_0769_gene_1
region_0385 . exon 500 1434 . + . ID=region_0769_gene_1_mRNA_exon_1;Parent=region_0769_gene_1_mRNA
###
region_0333 . gene 499 2878 . - . ID=region_0666_gene_1;Parent=region_0666
region_0333 . mRNA 499 2878 . - . ID=region_0666_gene_1_mRNA;Parent=region_0666_gene_1
region_0333 . exon 499 1315 . - . ID=region_0666_gene_1_mRNA_exon_3;Parent=region_0666_gene_1_mRNA
region_0333 . exon 1409 1781 . - . ID=region_0666_gene_1_mRNA_exon_2;Parent=region_0666_gene_1_mRNA
region_0333 . exon 2383 2878 . - . ID=region_0666_gene_1_mRNA_exon_1;Parent=region_0666_gene_1_mRNA
I have run stringtie merge and I then use stringtie_merged.gtf to compare to the latest version of the gencode annotation with gffcompare. All of the files should be formatted the same (i.e. both files have 'chr1') and I mapped the data with the same gencode annotation file.
gffcompare -r gencode.v27.primary_assembly.annotation.gtf -o gff stringtie_merged.gtf
There are ~ 1000 (<1%) of transcripts that are not being given a class code.
For example when I create a transcript count matrix (with prepDE.py) transcript ENST00000640668.1 has read counts for all samples in the transcript count matrix but has no class code. When I search for this transcript in stringtie merged.gtf and look at each exon of this transcript and compare it to the genocde.gtf all of the co-ordinates exactly match so should gffcompare be giving it an '=' class code???
I have attached my stringtie_merged.gtf if this help to troubleshoot.
stringtie_merged.gtf.gz
make release
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFastaIndex.cpp -o ../gclib/GFastaIndex.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFaSeqGet.cpp -o ../gclib/GFaSeqGet.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/gff.cpp -o ../gclib/gff.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/gdna.cpp -o ../gclib/gdna.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/codons.cpp -o ../gclib/codons.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GBase.cpp -o ../gclib/GBase.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GStr.cpp -o ../gclib/GStr.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GArgs.cpp -o ../gclib/GArgs.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c gtf_tracking.cpp -o gtf_tracking.o
gtf_tracking.cpp: In function ‘void read_transcripts(FILE*, GList&, bool)’:
gtf_tracking.cpp:606:7: error: ‘class GffReader’ has no member named ‘keepingAttrs’; did you mean ‘keepAttrs’?
gffr.keepingAttrs(keepAttrs, true);
^~~~~~~~~~~~
gtf_tracking.cpp:607:10: error: ‘class GffReader’ has no member named ‘mergingCloseExons’; did you mean ‘mergeCloseExons’?
gffr.mergingCloseExons(true);
^~~~~~~~~~~~~~~~~
gtf_tracking.cpp: In function ‘void read_mRNAs(FILE*, GList&, GList, bool, int, const char, bool)’:
gtf_tracking.cpp:640:8: error: ‘class GffReader’ has no member named ‘mergingCloseExons’; did you mean ‘mergeCloseExons’?
gffr->mergingCloseExons(true);
^~~~~~~~~~~~~~~~~
gtf_tracking.cpp:641:8: error: ‘class GffReader’ has no member named ‘keepingAttrs’; did you mean ‘keepAttrs’?
gffr->keepingAttrs(!isRefData, isRefData || gtf_tracking_largeScale );
^~~~~~~~~~~~
make: *** [gtf_tracking.o] Error 1
Any suggestions?
Hi,
What is the definition of being correct of a single-exon transcript?
Thanks!
Hello,
Thank you very much for this helpful tool. But I'd like to ask how are sensitivity and precision at different levels (bases, exons, transcripts) calculated? In other words, how do you define a "truth positive" or a "match"? Can you provide more details here?
Thanks!
It would be great if you could identify tagged releases of gffcompare and gclib that will work to build this, as I don't want to use an untagged version on our compute cluster. Also, for what it's worth, adding a flag to identify where gclib is would definitely beat hard coding "../gclib".
Deleted, put comment in under wrong username
Hi Geo,
Thanks for a great tool!
I'm using this to compare various short and long read "assemblies" and it would be great if there was an option to allow a "fuzz" distance between intron and exon starts and ends.
I started looking at the code and noticed that you had something for this already:
Line 104 in 4ed82e0
and also later in the file where you do the actual matching.
I'm guessing there was a good reason for commenting it out and was wondering what that might be?
I'd be interested in making my own changes to re-enable it for my work, but wanted to find out the gotchas that you might have encountered before diving in.
Thanks,
Chris
Error while executing the below command
gffcompare -V -R -r merged.gtf -o gffcompare_output -i gtf_list
Error happen: Warning: reference transcript has undetermined strand, discarded.
merged.gtf comes form the output of stringtie --merge
gtf_list is a list of gtf files of 12 samples which are input file of stringtie --merge
This option is listed both in the current README.md as an example, as well as in Box 1 of the Nature Protocols paper (though with different meanings, it seems). However, gffcompare v0.9.7
does not list -G
in the usage statement and does not accept it as input.
An explanation (and/or alternative) in the README would be appreciated!
Also (off topic) it's not instantly clear what's in the 6 different output files. A quick primer would save a new user a lot of time.
Apart from that - where has this tool been all my life?! So useful, thanks!
Hi,
I am taking a few stringtie merged assemblies of mouse RNAseq data and annotating them back against Gencode vM12 gtf files using gffcompare. I have noticed some differences between older versions of gffcompare and the newer versions where the arguments and default modes of gffcompare have been changed.
With version v0.9.6f, I would annotate the merged gtf file against the gencode gtf with the -G -C options which were defined as doing the following:
-G generic GFF input file(s): do not assume Cufflinks/Stringtie GTF input, (do not discard intron-redundant transfrags)
-C include the "contained" transcripts in the .combined.gtf file
In the newest version cloned from github, v0.9.9d, -G is now the default and -C appears to do the opposite of what -C was defined as doing in the older versions, though I could be misinterpreting this.
-C discard the "contained" transfrags in the .combined.gtf (i.e. collapse intron-redundant transfrags across all query files)
So, I figured I would want to run the newer versions of gffcompare without -G and -C in order to mimic the behavior of the older versions of gffcompare with the -G and -C options. However, I am seeing cases where transcripts are being treated as duplicates, even when they are not the same as each other and then getting merged together. So, I tried running with -C and -C -K options, but generally, no matter what I do, I haven't been able to identify the right combination of arguments to mimic the behavior of older versions of gffcompare with -G -C. I also noticed that whenever I run with -C option, that I get a .combined.gtf file instead of a .annotated.gtf file when running without the -C option.
I've attached an image of IGV showing gencode annotatoin for a single exon gene with two transcripts of differing lengths in the gencode annotation. I also show the stringtie assembly and then the older version of gffcompare with -G -C and the newest version of gffcompare with multiple different options. You can see that in the newer version, the contained transcript is removed. This is a simple example, but I see this in multi exon genes as well.
Maybe I am missing something obvious, so could you please give advice on how I can run the newest version of gffcompare to preserve the contained transcripts?
I have ~ 2000 transcripts that don't get given a class code when running gff compare. I am comparing the gencode.gtf with my stringtie_merged.gtf.
I'm confused as when I produce a transcript count matrix (via Stringtie/prepDE.py) there are transcripts that have lots of counts but supposedly aren't present in stringtie_merged.gtf.
eg ENST00000632434.1 has lots of read counts for all samples and has 'NA' class code.
When I grep ENST00000632434.1 from stringtie_merged.gtf I get an MSTRG id:
chr5 StringTie transcript 116085026 116293286 1000 + . gene_id "MSTRG.32862"; transcript_id "ENST00000632434.1"; gene_name "COMMD10"; ref_gene_id "ENSG00000145781.8";
When I then look at the MSTRG.32862 transcript in my count matrix is has very low counts.
If ENST00000632434.1 is present in all samples (as indicated by the transcript count matrix) then why isn't it being given a class code??
Hi
The .stats file, mentions a number of novel exons and a number of novel introns identified. Is there an easy way to get the coordinates of these regions?
e.g.
Novel exons: 16055/631658 ( 2.5%)
Novel introns: 5926/381131 ( 1.6%)
Thank you in advance
Foivos
I am also new to RNA-SEQ analysis. I also followed this same protocol for my dataset. And i am having some problem for identifying novel transcripts and genes from the below gffcompare ".stats" file
i was running the following command $ gffcompare –r chrX_data/genes/chrX.gtf –G –o merged stringtie_merged.gtf. and i got the result file as
= Summary for dataset: stringtie_merged.gtf
Query mRNAs : 253181 in 70635 loci (216728 multi-exon transcripts)
(23264 multi-transcript loci, ~3.6 transcripts per locus)
Reference mRNAs : 216257 in 60158 loci (189357 multi-exon)
Super-loci w/ reference transcripts: 51791
-----------------| Sensitivity | Precision |
Base level: 100.0 | 93.0 |
Exon level: 99.9 | 93.6 |
Intron level: 99.4 | 94.0 |
Intron chain level: 99.7 | 87.1 | Transcript level: 99.8 | 85.2 | Locus level: 100.0 | 84.4 |
Matching intron chains: 188838
Matching transcripts: 215729
Matching loci: 60158
Missed exons: 0/623537 ( 0.0%)
Novel exons: 23741/674273 ( 3.5%)
Missed introns: 2160/383827 ( 0.6%)
Novel introns: 4747/405880 ( 1.2%)
Missed loci: 0/60158 ( 0.0%)
Novel loci: 10239/70635 ( 14.5%)
Total union super-loci across all input datasets: 70632 253181 out of 253181 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant).
from this result where i can find the novel genes and transcripts?
Hi,
I am trying to use gffcompare to compare my assembled transcriptome to a reference gtf that contains information about small open reading frames (sORFs). The reference gtf was obtained by processing downloaded data from several sORF databases. The reference gtf is called all_38.gtf
and is attached at the end.
mytranscriptome.gtf is generated with reference to ensembl hg38 v99 reference.
I used the command: /home/e0470749/gffcompare/gffcompare -r /gpfs/eplab/Nathaniel/norf_replicate/all_38.gtf -o /gpfs/eplab/Nathaniel/norf_replicate/norfs mytranscriptome.gtf
.
This outputs a message file:
0 reference transcripts loaded.
237788 query transfrags loaded.
2714 duplicate query transfrags discarded.
All the expected output files are generated except for the .refmap file which I need for downstream analysis. I assume this is because 0 reference transcripts were loaded into the program. Is there something wrong with my custom reference gtf file?
For the stats produced by gffcompare. Can you explain how they are calculated?
If, 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏=𝑻𝑷/(𝑻𝑷+𝑭𝑷)
What is a false positive?
And how do you calculate this on the 'base level'?
#-----------------| Sensitivity | Precision |
Base level: 82.4 | 76.5 |
Exon level: 81.2 | 82.9 |
Intron level: 86.1 | 94.8 |
Intron chain level: 56.9 | 52.4 |
Transcript level: 55.2 | 38.9 |
Locus level: 70.1 | 48.0 |
Hello,
I used gffcompare to merge two gtf files with the following command: gffcompare -i gtffiles.txt
However, it seems that there are a falsely merged gene regions that are in fact two different genes. Do you have suggestions to deal with this?
Is there a way to specify to only merge after they share at least 2 exons or at least X amount of overlapping base pairs?
I would be gratefull for your reply,
Kind regards,
Ruth
Hello, I am currently working on the isoform detection analysis using gffcompare.
I was wondering if there are some cases on the reference (Gencode for instance) that have quite a similar isoform structure variation where it might be difficult for the input transcript to be discriminated, how would the input transcript be decided for its annotation? If the input transcript gets annotated to one of the closest matching reference transcripts, what would be the deciding factor? (more matching bases?)
I could not get a clear answer that I wanted from the utility page. If you help with this issue, I would really appreciate it!
Jungwoo.
Here is my testing code.
gffcompare -r test.gff test1.gff
The two gff files are identical, but I got this summary.
#= Summary for dataset: test1.gff
# Query mRNAs : 89544 in 56270 loci (57969 multi-exon transcripts)
# (10996 multi-transcript loci, ~1.6 transcripts per locus)
# Reference mRNAs : 57969 in 25225 loci (57969 multi-exon)
...
When i run this order with -M, I got this summary. It works good.
#= Summary for dataset: test1.gff
# Query mRNAs : 57969 in 25225 loci (57969 multi-exon transcripts)
# (10569 multi-transcript loci, ~2.3 transcripts per locus)
# Reference mRNAs : 57969 in 25225 loci (57969 multi-exon)
Brifly, single-exon reference transcripts are discarded even without -N option.
I am working on plant species transcriptome. Its draft genome is known. We want to identify some new transcripts in genome-guided assembly.
I need your help to remove overlapping sequencing by comparing already reported CDS gff file with assembled transcript gff file using StringTie.
Can we also remove the redundancy of transcript which mapping to the same location on the genome by using gff file.
Thanks
Error while executing the below command
gffcompare -R –r Reference.gtf –o merged stringtie_merged.gtf
Error: cannot locate input file: –r
Reference.gtf is available in the same folder with full permission
Version: gffcompare v0.9.7
Hi there,
I'm doing something weird but I can't find my mistake. I'm running gff compare v 0.9.8.Linux_x86_64.
I have 16 stringtie assemblies, but when I run gff compare on the individual assemblies, I get identical results in the .stats files. However, the file paths indicate that I have correctly provided different .gtf files. If I run multiple assemblies at once, the .stats file only contains the command run.
The reference genome is Atlantic salmon (Salmo salar). I'm also very skeptical that I have 100% sensitivity and precision.
Here are the contents of a stats file:
gffcompare v0.9.8 | Command line was:
./gffcompare -r /media/genetics/Data1/Jeff_workstation_1/subread-1.5.0-p1/GCF_000233375.1_ICSASG_v2_genomic.gff -o /media/genetics/Data1/Jeff_workstation_1/gffcompare-0.9.8.Linux_x86_64/compare/comp /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled/11-B5690_S3_mRNA_ARGU1_NGSLib3b.gtf
= Summary for dataset: /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled/11-B5690_S3_mRNA_ARGU1_NGSLib3b.gtf
Query mRNAs : 136272 in 81266 loci (111377 multi-exon transcripts)
(20690 multi-transcript loci, ~1.7 transcripts per locus)
Reference mRNAs : 136039 in 81267 loci (111148 multi-exon)
Super-loci w/ reference transcripts: 80037
-----------------| Sensitivity | Precision |
Base level: 100.0 | 100.0 |
Exon level: 100.0 | 100.0 |
Intron level: 99.5 | 99.5 |
Intron chain level: 100.0 | 99.8 |
Transcript level: 100.0 | 99.8 |
Locus level: 100.0 | 100.0 |
Matching intron chains: 111148
Matching transcripts: 136039
Matching loci: 81267
Missed exons: 0/562650 ( 0.0%)
Novel exons: 0/562955 ( 0.0%)
Missed introns: 2190/471248 ( 0.5%)
Novel introns: 0/471248 ( 0.0%)
Missed loci: 0/81267 ( 0.0%)
Novel loci: 0/81266 ( 0.0%)
Total union super-loci across all input datasets: 81266
I always has one doubt for the code "=", if the query is an exact match of intron chain, but the last exon is longer than the reference, why is the query not an alternative splice.
This is my gtf file for test, saved as 'Test.gtf':
1 StringTie transcript 9182204 9192089 1000 + . gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "3.691087"; FPKM "0.354073"; TPM "0.688700";
1 StringTie exon 9182204 9182316 1000 + . gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; exon_number "1"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "0.261039";
1 StringTie exon 9183633 9183807 1000 + . gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; exon_number "2"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "4.634286";
1 StringTie exon 9191393 9192089 1000 + . gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; exon_number "3"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "4.010363";
1 StringTie transcript 9182752 9182952 1000 . . gene_id "origin.STRG.179"; transcript_id "origin.STRG.179.1"; cov "6.796020"; FPKM "0.651919"; TPM "1.268033";
1 StringTie exon 9182752 9182952 1000 . . gene_id "origin.STRG.179"; transcript_id "origin.STRG.179.1"; exon_number "1"; cov "6.796020";
I typed following command:
gffcompare -r Test.gtf Test.gtf
and this is gffcmp.tracking
:
TCONS_00000001 XLOC_000001 origin.STRG.178|origin.STRG.178.1 = q1:origin.STRG.178|origin.STRG.178.1|3|0.354073|0.688700|3.691087|985
TCONS_00000002 XLOC_000002 origin.STRG.178|origin.STRG.178.1 i q1:origin.STRG.179|origin.STRG.179.1|1|0.651919|1.268033|6.796020|201
See, I compare this file with itself, and two transcripts were not marked two '=', instead, origin.STRG.179
was marked i
compared with origin.STRG.178
. Is there any explanation?
I am trying to find novel transcripts from an RNA-seq database. I tried to follow the protocol provided in the Nature Protocols paper that described the use of Stringtie; the paper also suggested using GFFcompare for comparing the assembled transcripts with the reference, to quantify and find the novel transcripts.
I followed the entire protocol, and this was the output of GFFcompare when comparing the reference GTF with that of one of the samples after mapping -
#= Summary for dataset: ./hisat2/ERR188044_chrX.gtf
#-----------------| Sensitivity | Precision |
Base level: 51.7 | 79.5 |
Exon level: 46.7 | 85.2 |
Intron level: 47.2 | 93.9 |
Intron chain level: 31.2 | 64.4 |
Transcript level: 31.6 | 52.3 |
Locus level: 36.6 | 50.1 |
Matching intron chains: 580
Matching transcripts: 664
Matching loci: 397
Missed exons: 4395/8804 ( 49.9%)
Novel exons: 426/4874 ( 8.7%)
Missed introns: 3832/7946 ( 48.2%)
Novel introns: 83/3992 ( 2.1%)
Missed loci: 610/1086 ( 56.2%)
Novel loci: 273/795 ( 34.3%)
Total union super-loci across all input datasets: 749
1270 out of 1270 consensus transcripts written in 188044_only.annotated.gtf (0 discarded as redundant)
In the paper however, the number of novel genes and transcripts is clearly mentioned -
So, I was wondering how they were able to calculate the number of novel transcripts and genes, from the data that GFFcompare provides (they even mention in the paper that they got it from GFFcompare). I see the numbers related to novel exons and novel introns, but how do I translate it to the number and identities of the novel transcripts/genes?
Currently the *.combined.gtf
file renames the transcripts as they are "merged". There should be a new "annotation mode" option which would simply keep the original transcript names (without trying to rename them), using the longer isoform's transcript ID in case of merging (which shouldn't happen at all if that file is the output of stringtie --merge
) or, optionally, even disabling the merging of compatible/equivalent isoforms (separate option?). Sure the latter option would make the name of this file a bit awkward (in which case perhaps it should be changed to *.annotated.gtf or *.class.gtf ?).
I used cuffcompare to process 1000 files and get an error below:
Error: too many input files (limit set to 500 at compile time)
(if you need to raise this limit set a new value for
MAX_QFILES in gtf_tracking.h and recompile)
But I can't find the "MAX_QFILES" option in the gtf_tracking.h
How can I raise the limit of the input files?
Hi,
This is only a very small issue, but I wanted to tell about it.
It was not clear to me that the refmap and tmap is not written to the current directory, but in the directory of the GFF files.
It is not a big deal, but I asked myself for about 2 hours why gffcompare generates all other files like .stats, .tracking and so on, but no tmap or refmap.
Sorry for bothering with this issue.
Cheers
Chris
Hi
I tried version 0.11.2 (from anaconda) and 0.11.4 (precompiled binary) as following:
gffcompare \
-r reference.gtf \
-s .genome.fa \
-o prefix \
merged_annotation.gtf
where merged_annotation.gtf is the output of stringtie merge. After this run I get the following files:
but I am missing the .tmap and .refmap
Am I doing something wrong?
Thank you in advance
Foivos
I am trying to run the following command:
gffcompare -r ref.gff -G -o merged stringtie-merged.gtf
ref.gff - downloaded from NCBI
string-merged.gtf - obtained from stringtie --merge command
Error encountered : GFF Error: overlapping duplicate transcript feature (ID=gene29892)
When I grep "gene29892" from both the ref.gff and merged stringtie-merged.gtf
From ref.gff
NC_007957.1 RefSeq gene 74631 74744 . + . ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=1/2
NC_007957.1 RefSeq gene 146276 147073 . + . ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=2/2
NC_007957.1 RefSeq CDS 74631 74744 . + 0 ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1 RefSeq CDS 146276 146507 . + 0 ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1 RefSeq CDS 147048 147073 . + 2 ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1 RefSeq exon 146276 146507 . + . ID=id318095;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=1;gbkey=exon;gene=rps12
NC_007957.1 RefSeq exon 147048 147073 . + . ID=id318096;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=2;gbkey=exon;gene=rps12
From stringtie-merged.gtf
NC_007957.1 StringTie transcript 74631 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 74631 74744 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 146276 146507 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "2"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 147048 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "3"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie transcript 146276 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892";
NC_007957.1 StringTie exon 146276 147073 1000 + . gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892";
What could be possibly wrong as I cannot see any duplicate values!
I'm using the command below to compare a merged transcript GTF file from StringTie with annotated UCSC transcripts. gffcompare (v0.9.9) completes but only produces one output file, test_gffcompare.stats
, which is empty except for 2 header lines of information about the command call. Also, my input merged_transcripts.gtf
was overwritten with the correct gene names, so it appears that gffcompare is working in terms of reference annotation matching. Is overwriting the input file correct behavior? Where are the other output files (e.g. test_gffcompare.combined.gtf
, test_gffcompare.loci
)? I basically followed the steps outlined here, and I cloned the gffcompare repo just this morning.
Thanks :]
Mallory
gffcompare -r UCSC_Main_Human_refGene.gtf -T -V -o test_gffcompare merged_transcripts.gtf
Hello,
I've run gffcompare with the following command:
gffcompare -r reference.gtf -R -C -K -o sample1 query.gff
It generated sample1.combined.gtf, sample1.redundant.gtf, sample1.tracking, sample1.loci, sampe1.stats . However, according to this page, shouldn't I also get sample1.refmap and sample1.tmap? I can't find them, even when I drop "-C" and "-K" option.
The reference gtf file I used looks like this:
chr1 BestRefSeq exon 11874 12227 . + . transcript_id "NR_046018"; gene_id "100287102";
chr1 BestRefSeq exon 12613 12721 . + . transcript_id "NR_046018"; gene_id "100287102";
chr1 BestRefSeq exon 13221 14409 . + . transcript_id "NR_046018"; gene_id "100287102";
chr1 BestRefSeq exon 14362 14829 . - . transcript_id "NR_024540"; gene_id "653635";
and the sample1.gff looks like this:
chr1 pinfish mRNA 14421 195419 0 - . gene_id "847e65b5-7e7b-428b-936e-6a376f435050"; transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";
chr1 pinfish exon 14421 14829 0 - . transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";
chr1 pinfish exon 14970 15038 0 - . transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";
chr1 pinfish exon 186317 186469 0 - . transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";
Should I modify my reference or query file to get the refmap/tmap? Any advice would be really helpful!
Hi,
I am running stringtie merge and then gffcompare:
gffcompare -r gencode.gtf -G -o gff stringtie_merged.gtf
When I look at the gff.stringtie_merged.gtf.tmap all values are zero for FPKM,TPM, and coverage for all transcripts??
Hello!
I have been following the procedures provided in the article (HISAT2, Pertea et al., 2016) and successfully merged 34 single-end RNA-seq samples into a stringtie_merged.gtf. I am having trouble running the gffcompare program where it simply crashes out. The error looks like this:
gffcompare -r /home/slab/s6ma/archive/bin/hisat_indices/gene_annotations/Homo_sapiens.GRCh38.91.gtf -G -o merged stringtie_merged.gtf
*** glibc detected *** gffcompare: malloc(): memory corruption: 0x000000000c3ea820 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76e46)[0x7f69f4576e46]
/lib/x86_64-linux-gnu/libc.so.6(+0x79eb3)[0x7f69f4579eb3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x70)[0x7f69f457bcd0]
gffcompare[0x422440]
gffcompare[0x42a379]
gffcompare[0x4406fc]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f69f451eeed]
gffcompare[0x401fe9]
======= Memory map: ========
00400000-00465000 r-xp 00000000 00:27 61384051 /home/slab/s6ma/archive/src/gffcompare-0.10.2.Linux_x86_64/gffcompare
00664000-00666000 rw-p 00064000 00:27 61384051 /home/slab/s6ma/archive/src/gffcompare-0.10.2.Linux_x86_64/gffcompare
00666000-00680000 rw-p 00000000 00:00 0
020ae000-119d8000 rw-p 00000000 00:00 0 [heap]
7f69f0000000-7f69f0021000 rw-p 00000000 00:00 0
7f69f0021000-7f69f4000000 ---p 00000000 00:00 0
7f69f42e8000-7f69f42fd000 r-xp 00000000 00:11 1060115 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f69f42fd000-7f69f44fd000 ---p 00015000 00:11 1060115 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f69f44fd000-7f69f44fe000 rw-p 00015000 00:11 1060115 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f69f4500000-7f69f4682000 r-xp 00000000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4682000-7f69f4881000 ---p 00182000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4881000-7f69f4885000 r--p 00181000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4885000-7f69f4886000 rw-p 00185000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4886000-7f69f488b000 rw-p 00000000 00:00 0
7f69f4890000-7f69f4911000 r-xp 00000000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4911000-7f69f4b10000 ---p 00081000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4b10000-7f69f4b11000 r--p 00080000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4b11000-7f69f4b12000 rw-p 00081000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4b18000-7f69f4b38000 r-xp 00000000 00:11 3164068 /lib/x86_64-linux-gnu/ld-2.13.so
7f69f4d17000-7f69f4d37000 rw-p 00000000 00:00 0
7f69f4d37000-7f69f4d38000 r--p 0001f000 00:11 3164068 /lib/x86_64-linux-gnu/ld-2.13.so
7f69f4d38000-7f69f4d39000 rw-p 00020000 00:11 3164068 /lib/x86_64-linux-gnu/ld-2.13.so
7f69f4d39000-7f69f4d3a000 rw-p 00000000 00:00 0
7f69f4d3c000-7f69f4d41000 rw-p 00000000 00:00 0
7fffd4764000-7fffd4785000 rw-p 00000000 00:00 0 [stack]
7fffd4800000-7fffd4801000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
Does anyone know how to fix this?
I have tried to re-install the gffcompare program but this error still occurs. I have also tried gffcompare on unmerged gtf files but still would not work. So I am highly certain that the issue is within the gffcompare program. Thank you
Hello,
I have a problem of empty output for the
Each set of paired end sequence data were aligned using hisat2 -- according to the published protocol for string tie. And then stringtie run on them with a command similar to:
stringtie -B -G Homo_sapiens.GRCh38.88.gtf rnaseq1.bam -o rnaseq1.gtf
stringtie -B -G Homo_sapiens.GRCh38.88.gtf rnaseq2.bam -o rnaseq2.gtf
...
stringtie -B -G Homo_sapiens.GRCh38.88.gtf rnaseqn.bam -o rnaseqn.gtf
putting these files in a text file I call stringtiefilelist
rnaseq1.gtf
rnaseq2.gtf
...
rnaseqn.gtf
I run stringtie merge as follows:
stringtie --merge -G Homo_sapiens.GRCh38.88.gtf -o stringtie_merged.gtf stringtiefilelist
Each of these steps produces expected output. But then when I run gffcompare
gffcompare -r gencode.v.25.annotation.gtf -o stringtie_merged.txt stringtie_merged.gtf
I get empty output so to speak
#= Summary for dataset: stringtie_merged.gtf
# Query mRNAs : 238442 in 34534 loci (238442 multi-exon transcripts)
# (21005 multi-transcript loci, ~6.9 transcripts per locus)
# Reference mRNAs : 171986 in 33152 loci (171986 multi-exon)
# Super-loci w/ reference transcripts: 0
#-----------------| Sensitivity | Precision |
Base level: 0.0 | 0.0 |
Exon level: 0.0 | 0.0 |
Intron level: 0.0 | 0.0 |
Intron chain level: 0.0 | 0.0 |
Transcript level: 0.0 | 0.0 |
Locus level: 0.0 | 0.0 |
Matching intron chains: 0
Matching transcripts: 0
Matching loci: 0
Missed exons: 544338/544338 (100.0%)
Novel exons: 634910/634910 (100.0%)
Missed introns: 350157/350157 (100.0%)
Novel introns: 398915/398915 (100.0%)
Missed loci: 33152/33152 (100.0%)
Novel loci: 34534/34534 (100.0%)
Is there perhaps a problem in the parameters I am using?
Do I need to convert the gtf file to a gff file? A successful run I had used as input a gff file. Perhaps this is the reason.
Hi,
I used CuffCompare to compare the genome guided assembled GTF file with the genome annotation. Now I want to extract transcripts with class codes annotated with “u”, “x,” and “i” using gffread program.
Please share command line to extract these transcripts.
Thanks
I've merged couple of gff files with stringtie merge tool. The resulting gff has lower sensitivity than each of single gff files used to create merged file. I'm comparing all gff files to the same reference (human, ensembl). Both tools: stringtie and gffcompare are called with default options. The difference is quite high: 30 - 50. Am I missing something? How is that possible?
Update: merged set created with gffcompare (with -i option) behaves according to my expectations: sensitivity is higher than for each component files. It seems more like a stringtie issue now...
The HISAT index provided by http://ccb.jhu.edu/software/hisat2/index.shtml does not have "chr" in the chromosome names, therefore I modified the gencode standard gtf to remove the "chr" before each chromosome as well as renaming "chrM" to "MT" to match the HISAT index.
However, after assembly with Stringtie, I attempt to use gffcompare to compare to the modified gtf, and I find that gffcompare discards every transcript except the unaligned contigs. In the stats file I see
# Reference mRNAs : 271 in 262 loci (27 multi-exon)
Running again with "chr" prepended to the transcript names in the gtf all of the transcripts are kept
# Reference mRNAs : 142428 in 22607 loci (135360 multi-exon)
I use command:
gffread -x sample.CDS.fa -y sample.CDS.protein.fa -g GRCh38.fasta sample.gtf
and No errors appear,but both sample.CDS.fa and sample.CDS.protein.fa are empty file,Why is this happening?
the head of the sample.gtf:
1 StringTie transcript 356184 357780 1000 + . gene_id "MSTRG.49"; transcript_id "MSTRG.49.1";
1 StringTie exon 356184 356438 1000 + . gene_id "MSTRG.49"; transcript_id "MSTRG.49.1"; exon_number "1";
1 StringTie exon 357540 357780 1000 + . gene_id "MSTRG.49"; transcript_id "MSTRG.49.1"; exon_number "2";
1 StringTie transcript 1488502 1572128 1000 - . gene_id "MSTRG.70"; transcript_id "MSTRG.70.1";
1 StringTie exon 1488502 1488533 1000 - . gene_id "MSTRG.70"; transcript_id "MSTRG.70.1"; exon_number "1";
1 StringTie exon 1565130 1572128 1000 - . gene_id "MSTRG.70"; transcript_id "MSTRG.70.1"; exon_number "2";
1 StringTie transcript 2147579 2148044 1000 - . gene_id "MSTRG.79"; transcript_id "MSTRG.79.1";
1 StringTie exon 2147579 2147793 1000 - . gene_id "MSTRG.79"; transcript_id "MSTRG.79.1"; exon_number "1";
1 StringTie exon 2147992 2148044 1000 - . gene_id "MSTRG.79"; transcript_id "MSTRG.79.1"; exon_number "2";
Hi @gpertea ,
I followed the instructions to install gffcompare and I got the error below with make release. Could you provide some help? Thanks!
g++ -O3 -DNDEBUG -std=c++0x -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFastaIndex.cpp -o ../gclib/GFastaIndex.o
g++ -O3 -DNDEBUG -std=c++0x -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFaSeqGet.cpp -o ../gclib/GFaSeqGet.o
g++ -O3 -DNDEBUG -std=c++0x -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/gff.cpp -o ../gclib/gff.o
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)’:
../gclib/gff.cpp:1988: error: template argument for ‘template class GVec’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GeneCDS’
../gclib/gff.cpp:1988: error: trying to instantiate ‘template class GVec’
../gclib/gff.cpp:1989: error: template argument for ‘template class GArray’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GSegMatch’
../gclib/gff.cpp:1989: error: trying to instantiate ‘template class GArray’
../gclib/gff.cpp: In member function ‘void GffObj::processGeneSegments(GffReader*)::GeneCDSChain::addCDS(int, uint, uint)’:
../gclib/gff.cpp:1998: error: request for member ‘Add’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp: In member function ‘void GffObj::processGeneSegments(GffReader*)::GeneCDSChain::addMatch(int, int, int)’:
../gclib/gff.cpp:2003: error: request for member ‘Add’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::singleExonCDSMatch(uint, uint, int&)’:
../gclib/gff.cpp:2009: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2011: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2012: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2012: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::singleCDStoExon(GffObj&, int&)’:
../gclib/gff.cpp:2022: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2023: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2024: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2026: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2026: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::multiCDStoExon(GffObj&, int&)’:
../gclib/gff.cpp:2043: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2047: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2048: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2050: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2050: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2063: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2064: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2074: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2076: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::containedBy(GffObj&, int&)’:
../gclib/gff.cpp:2097: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)’:
../gclib/gff.cpp:2105: error: template argument for ‘template class GHash’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GeneCDSChain’
../gclib/gff.cpp:2105: error: trying to instantiate ‘template class GHash’
../gclib/gff.cpp:2105: error: invalid type in declaration before ‘(’ token
../gclib/gff.cpp:2106: error: template argument for ‘template class GPVec’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GeneCDSChain’
../gclib/gff.cpp:2106: error: trying to instantiate ‘template class GPVec’
../gclib/gff.cpp:2106: error: invalid type in declaration before ‘;’ token
../gclib/gff.cpp:2128: error: request for member ‘Find’ in ‘cdsChainById’, which is of non-class type ‘int’
../gclib/gff.cpp:2133: error: request for member ‘Add’ in ‘cdsChains’, which is of non-class type ‘int’
../gclib/gff.cpp:2134: error: request for member ‘shkAdd’ in ‘cdsChainById’, which is of non-class type ‘int’
../gclib/gff.cpp:2143: error: request for member ‘Count’ in ‘cdsChains’, which is of non-class type ‘int’
../gclib/gff.cpp:2144: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2153: error: request for member ‘Count’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
../gclib/gff.cpp:2158: error: request for member ‘First’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
../gclib/gff.cpp:2159: error: request for member ‘Count’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2160: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2161: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2165: error: request for member ‘First’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
make: *** [../gclib/gff.o] Error 1
We're installing gffcompare on our cluster and it would help us to quickly find and update to the latest stable release if you could keep the tags up-to-date. The latest tag says 0.9.8, but the version string in gffcompare.cpp says 0.9.9d.
Many thanks for your consideration
Hello,
What is the minimum overlap required on the opposite strand for the class code "x" to be assigned? I am curious if this can be altered at all within the script as I am interested in search for antisense transcription in my dataset but would like to know how that is being classified using gffcompare. I've looked in your papers and in the cuffcompare repository/website but can't seem to find this information.
Any help is appreciated!
Thank you,
Diego
I am having the following statistic from gffcompare
Missed exons: 81439/302312 ( 26.9%)
Novel exons: 2536/146326 ( 1.7%)
Missed introns: 38931/183784 ( 21.2%)
Novel introns: 556/100652 ( 0.6%)
Missed loci: 0/11150 ( 0.0%)
Novel loci: 166/10328 ( 1.6%)
I can see there is 2536 novel exons in my input file. Is there any way I can extract this exons?
Thanks
Hello,
I read the Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.
then I download the file chrX_data.tar.gz to practice.
I have a problem for novel transcripts. I use one simple,ERR188044,to practice ,like this:
gffcompare -r /disk/chrX_data/genes/chrX.gtf -G -o /disk/luping/chrX_data/gffcompare/ERR188044 /disk/chrX_data/gtf/ERR188044_chrX.gtf.
#= Summary for dataset: /disk/chrX_data/gtf/ERR188044_chrX.gtf
#-----------------| Sensitivity | Precision |
Base level: 51.6 | 80.3 |
Exon level: 46.7 | 85.4 |
Intron level: 47.2 | 93.8 |
Intron chain level: 31.2 | 65.2 |
Transcript level: 31.6 | 53.8 |
Locus level: 36.6 | 50.6 |
Matching intron chains: 580
Matching transcripts: 664
Matching loci: 397
Missed exons: 4398/8804 ( 50.0%)
Novel exons: 418/4863 ( 8.6%)
Missed introns: 3832/7946 ( 48.2%)
Novel introns: 79/3995 ( 2.0%)
Missed loci: 581/1086 ( 53.5%)
Novel loci: 260/786 ( 33.1%)
Total union super-loci across all input datasets: 741
1234 out of 1234 consensus transcripts written in /disk/luping/chrX_data/gffcompare/ERR188044.annotated.gtf (0 discarded as redundant)
then I statistics the novel transcripts using class_code "j",but this is different from the results of table7
ERR188044 Novel transcripts 615 Did I have any mistake?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.