gpertea / gffcompare Goto Github PK

View Code? Open in Web Editor NEW

181.0 181.0 32.0 452 KB

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF

License: MIT License

Makefile 1.22% C++ 96.91% Shell 1.04% Perl 0.84%

gffcompare's People

Contributors

Stargazers

Watchers

gffcompare's Issues

Interpreting sensitivity and precision

Hello,

I mapped my RNA-seq reads with HISAT2 and assembled them with stringtie while using the Salmo salar whole genome as a reference in both stages. I understand that "Sensitivity is the proportion of coding nucleotides that have been correctly predicted as coding, and Specificity is the proportion of noncoding nucleotides that have been correctly predicted as noncoding." (From Burset and Guigo 1996)

Is Precision synonymous with Specificity in this case?

What could lead to the discrepancy between Sensitivity and Precision seen in my data? Thanks!

gffcompare v0.9.8 | Command line was:
./gffcompare -r /media/genetics/Data1/Jeff_workstation_1/subread-1.5.0-p1/GCF_000233375.1_ICSASG_v2_genomic.gff -o /media/genetics/Data1/Jeff_workstation_1/gffcompare-0.9.8.Linux_x86_64/compare/8-B7877_S4_1a_mRNA-e-disabled /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled-e-disabled/8-B7877_S4_1a_mRNA-e-disabled.gtf

= Summary for dataset: /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled-e-disabled/8-B7877_S4_1a_mRNA-e-disabled.gtf
Query mRNAs : 20484 in 19076 loci (13136 multi-exon transcripts)
(998 multi-transcript loci, ~1.1 transcripts per locus)
Reference mRNAs : 136039 in 81267 loci (111148 multi-exon)
Super-loci w/ reference transcripts: 12245
-----------------| Sensitivity | Precision |
Base level: 12.2 | 86.9 |
Exon level: 8.9 | 73.9 |
Intron level: 9.5 | 96.7 |
Intron chain level: 5.7 | 48.6 |
Transcript level: 5.6 | 37.3 |
Locus level: 8.3 | 35.4 |

 Matching intron chains:    6378
   Matching transcripts:    7647
          Matching loci:    6744

      Missed exons:  498437/562650	( 88.6%)
       Novel exons:    5960/66229	(  9.0%)
    Missed introns:  421469/471248	( 89.4%)
     Novel introns:     935/46255	(  2.0%)
       Missed loci:   68711/81267	( 84.5%)
        Novel loci:    4714/19076	( 24.7%)

Total union super-loci across all input datasets: 17075

how to unnderstand the *.stats file?

hello, sorry to botter you. I am a beginner.I wonder what these numbers mean in stats file.
What does the number after 'Novel exons' represent?

            Missed exons:   70627/364162  ( 19.4%)
           Novel exons:    3595/181415  (  2.0%)
        Missed introns:   36856/226049  ( 16.3%)
         Novel introns:    1791/139213  (  1.3%)
           Missed loci:       0/13156   (  0.0%)
            Novel loci:      92/14996   (  0.6%)

thank you .

Why the total transcripts in the .tmp file was less than the total transcripts in the input .gtf file

Hi,
I am taking a few stringtie assemblies of human single-cell RNAseq data and annotating them back against Gencode v27 gtf files using gffcompare. I have noticed some differences between the total number of transcripts in the one input.gtf and the output file of .tmap after I used the many default modes of gffcompare.
The command is used as follows:
gffcompare-0.10.1.Linux_x86_64/gffcompare -R -r /ReferenceAnnotation/gencode.v27.annotation.gtf -G
-o gffResultPrefix stringtie_result/SRR.gtf
I have got some information about the differences between the "GffCompare" and "CuffCompare" from the website at https://github.com/gpertea/gffcompare, as follows:
Another important difference is that the input transcripts are no longer discarded when they are found to be "intron redundant", i.e. contained within other, longer isoforms. CuffCompare had the -G option to prevent collapsing of such intron redundant isoforms into their longer "containers", but GffCompare has made this the default mode of operation (hence the -G option is no longer needed and is simply ignored when given).
Maybe I am missing something obvious, so could you please give advice on how I can run the newest version of gffcompare to preserve the missing transcripts?

What's the mean of class code 'k'

Hi:
I can't find the class code k in cuffcompare.So,what's the mean of class code 'k'?

Error allocating memory

Hello, I've been following the protocol published on Nature and running gffcompare on two RNAseq replicates that I have. I removed the -G flag as (from what I understand) it's now the default behaviour.

Still, when I run
gffcompare -r hg19.gtf merged.gtf
I get the following error:
Error allocating memory. Aborted (core dumped)

When I run the tool in debugging mode with -D I see a lot of
GFF Warning: unusual segment inclusion: exon(X-X) within CDS(X-X) (ID=X)
and of
GFF warning: merging adjacent/overlapping segments of X on X (X-X, X-X)
and then:

Kept X ref transcripts out of Y
 W duplicate reference transcripts discarded.
Prefix for output files: gffcmp
Loading query file #1: merged.gtf
Error allocating memory.
Aborted (core dumped)

Any idea on what could be causing this?
I am working on an Ubuntu 16.04 machine and I tried both with the precompiled binaries and with my own build (following instruction in the README file).

Many thanks in advance!
Cheers

gffcompare shown wrong result when using strict-match

Hello,
Recently, I was running the GFFcompare and found some confusing results.
The query file has one single-exon transcript which is slightly different from the reference file at the boundary, and the query file has also one multiple-exon transcript which is slightly different from the reference file at the boundary. When I using GFFCompare with "--strict-match -e 0 -d 0", I was expected that the transcript level of two data should be 0% due to the error at the boundary, but the results have shown 100%. What are the possible reasons this might happens?

Thank you
grassking100

gffcmp.stats

# gffcompare v0.11.5 | Command line was:
#gffcompare --strict-match --chr-stats --debug -T -e 0 -d 0 --no-merge -r subanswer.gff3 subpredict.gff3
#
#> Genomic sequence: region_0385 
#     Query mRNAs :       1 in       1 loci  (0 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :       1 in       1 loci  (0 multi-exon)
# Super-loci w/ reference transcripts:        1
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    99.8    |
        Exon level:     0.0     |     0.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:       0
       Matching transcripts:       1
              Matching loci:       1

          Missed exons:       0/1	(  0.0%)
           Novel exons:       0/1	(  0.0%)
           Missed loci:       0/1	(  0.0%)
            Novel loci:       0/1	(  0.0%)
#> Genomic sequence: region_0333 
#     Query mRNAs :       1 in       1 loci  (1 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :       1 in       1 loci  (1 multi-exon)
# Super-loci w/ reference transcripts:        1
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    99.9    |
        Exon level:    66.7     |    66.7    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:       1
       Matching transcripts:       1
              Matching loci:       1

          Missed exons:       0/3	(  0.0%)
           Novel exons:       0/3	(  0.0%)
        Missed introns:       0/2	(  0.0%)
         Novel introns:       0/2	(  0.0%)
           Missed loci:       0/1	(  0.0%)
            Novel loci:       0/1	(  0.0%)

#= Summary for dataset: subpredict.gff3 
#     Query mRNAs :       2 in       2 loci  (1 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :       2 in       2 loci  (1 multi-exon)
# Super-loci w/ reference transcripts:        2
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    99.8    |
        Exon level:    50.0     |    50.0    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:       1
       Matching transcripts:       2
              Matching loci:       2

          Missed exons:       0/4	(  0.0%)
           Novel exons:       0/4	(  0.0%)
        Missed introns:       0/2	(  0.0%)
         Novel introns:       0/2	(  0.0%)
           Missed loci:       0/2	(  0.0%)
            Novel loci:       0/2	(  0.0%)

 Total union super-loci across all input datasets: 2 
2 out of 2 consensus transcripts written in gffcmp.annotated.gtf (0 discarded as redundant)

subanswer.gff3

##gff-version 3
region_0385	.	gene	500	1432	.	+	.	ID=region_0769_gene_1;Parent=region_0769
region_0385	.	mRNA	500	1432	.	+	.	ID=region_0769_gene_1_mRNA;Parent=region_0769_gene_1
region_0385	.	exon	500	1432	.	+	.	ID=region_0769_gene_1_mRNA_exon_1;Parent=region_0769_gene_1_mRNA
###
region_0333	.	gene	501	2878	4	-	.	ID=AT2G26870_gene
region_0333	.	mRNA	501	2878	4	-	.	ID=AT2G26870;Parent=AT2G26870_gene
region_0333	.	exon	501	1315	4	-	.	Parent=AT2G26870
region_0333	.	exon	1409	1781	4	-	.	Parent=AT2G26870
region_0333	.	exon	2383	2878	4	-	.	Parent=AT2G26870

subpredict.gff3

##gff-version 3
region_0385	.	gene	500	1434	.	+	.	ID=region_0769_gene_1;Parent=region_0769
region_0385	.	mRNA	500	1434	.	+	.	ID=region_0769_gene_1_mRNA;Parent=region_0769_gene_1
region_0385	.	exon	500	1434	.	+	.	ID=region_0769_gene_1_mRNA_exon_1;Parent=region_0769_gene_1_mRNA
###
region_0333	.	gene	499	2878	.	-	.	ID=region_0666_gene_1;Parent=region_0666
region_0333	.	mRNA	499	2878	.	-	.	ID=region_0666_gene_1_mRNA;Parent=region_0666_gene_1
region_0333	.	exon	499	1315	.	-	.	ID=region_0666_gene_1_mRNA_exon_3;Parent=region_0666_gene_1_mRNA
region_0333	.	exon	1409	1781	.	-	.	ID=region_0666_gene_1_mRNA_exon_2;Parent=region_0666_gene_1_mRNA
region_0333	.	exon	2383	2878	.	-	.	ID=region_0666_gene_1_mRNA_exon_1;Parent=region_0666_gene_1_mRNA

Class code 'NA' when it should be '='?

I have run stringtie merge and I then use stringtie_merged.gtf to compare to the latest version of the gencode annotation with gffcompare. All of the files should be formatted the same (i.e. both files have 'chr1') and I mapped the data with the same gencode annotation file.
gffcompare -r gencode.v27.primary_assembly.annotation.gtf -o gff stringtie_merged.gtf
There are ~ 1000 (<1%) of transcripts that are not being given a class code.
For example when I create a transcript count matrix (with prepDE.py) transcript ENST00000640668.1 has read counts for all samples in the transcript count matrix but has no class code. When I search for this transcript in stringtie merged.gtf and look at each exon of this transcript and compare it to the genocde.gtf all of the co-ordinates exactly match so should gffcompare be giving it an '=' class code???
I have attached my stringtie_merged.gtf if this help to troubleshoot.
stringtie_merged.gtf.gz

Build fails

make release
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFastaIndex.cpp -o ../gclib/GFastaIndex.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFaSeqGet.cpp -o ../gclib/GFaSeqGet.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/gff.cpp -o ../gclib/gff.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/gdna.cpp -o ../gclib/gdna.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/codons.cpp -o ../gclib/codons.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GBase.cpp -o ../gclib/GBase.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GStr.cpp -o ../gclib/GStr.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GArgs.cpp -o ../gclib/GArgs.o
/software/packages/spack/opt/spack/linux-centos7-x86_64/gcc-8.2.0/gcc-6.3.0-ceghynmzmemy2imu275hslrgvwoh4pch/bin/g++ -O3 -DNDEBUG -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c gtf_tracking.cpp -o gtf_tracking.o
gtf_tracking.cpp: In function ‘void read_transcripts(FILE*, GList&, bool)’:
gtf_tracking.cpp:606:7: error: ‘class GffReader’ has no member named ‘keepingAttrs’; did you mean ‘keepAttrs’?
gffr.keepingAttrs(keepAttrs, true);
^~~~~~~~~~~~
gtf_tracking.cpp:607:10: error: ‘class GffReader’ has no member named ‘mergingCloseExons’; did you mean ‘mergeCloseExons’?
gffr.mergingCloseExons(true);
^~~~~~~~~~~~~~~~~
gtf_tracking.cpp: In function ‘void read_mRNAs(FILE*, GList&, GList, bool, int, const char, bool)’:
gtf_tracking.cpp:640:8: error: ‘class GffReader’ has no member named ‘mergingCloseExons’; did you mean ‘mergeCloseExons’?
gffr->mergingCloseExons(true);
^~~~~~~~~~~~~~~~~
gtf_tracking.cpp:641:8: error: ‘class GffReader’ has no member named ‘keepingAttrs’; did you mean ‘keepAttrs’?
gffr->keepingAttrs(!isRefData, isRefData || gtf_tracking_largeScale );
^~~~~~~~~~~~
make: *** [gtf_tracking.o] Error 1

Any suggestions?

definition for single-exon transcript

Hi,

What is the definition of being correct of a single-exon transcript?

Thanks!

How to interpret sensitivity and precision?

Hello,

Thank you very much for this helpful tool. But I'd like to ask how are sensitivity and precision at different levels (bases, exons, transcripts) calculated? In other words, how do you define a "truth positive" or a "match"? Can you provide more details here?

Thanks!

Tagged Versions?

It would be great if you could identify tagged releases of gffcompare and gclib that will work to build this, as I don't want to use an untagged version on our compute cluster. Also, for what it's worth, adding a flag to identify where gclib is would definitely beat hard coding "../gclib".

Re: Deleted

Deleted, put comment in under wrong username

Fuzz matching of intron chains and exons

Hi Geo,

Thanks for a great tool!

I'm using this to compare various short and long read "assemblies" and it would be great if there was an option to allow a "fuzz" distance between intron and exon starts and ends.

I started looking at the code and noticed that you had something for this already:

gffcompare/gffcompare.cpp

Line 104 in 4ed82e0

//strictMatching=false; // really match *all* exon coords for '=' class code!

and also later in the file where you do the actual matching.

I'm guessing there was a good reason for commenting it out and was wondering what that might be?

I'd be interested in making my own changes to re-enable it for my work, but wanted to find out the gotchas that you might have encountered before diving in.

Thanks,
Chris

Warning: reference transcript has undetermined strand, discarded.

Error while executing the below command
gffcompare -V -R -r merged.gtf -o gffcompare_output -i gtf_list
Error happen: Warning: reference transcript has undetermined strand, discarded.

merged.gtf comes form the output of stringtie --merge
gtf_list is a list of gtf files of 12 samples which are input file of stringtie --merge

-G option no longer exists

This option is listed both in the current README.md as an example, as well as in Box 1 of the Nature Protocols paper (though with different meanings, it seems). However, gffcompare v0.9.7 does not list -G in the usage statement and does not accept it as input.

An explanation (and/or alternative) in the README would be appreciated!

Also (off topic) it's not instantly clear what's in the 6 different output files. A quick primer would save a new user a lot of time.

Apart from that - where has this tool been all my life?! So useful, thanks!

Difficulty matching behavior between older and newer gffcompare

Hi,

I am taking a few stringtie merged assemblies of mouse RNAseq data and annotating them back against Gencode vM12 gtf files using gffcompare. I have noticed some differences between older versions of gffcompare and the newer versions where the arguments and default modes of gffcompare have been changed.

With version v0.9.6f, I would annotate the merged gtf file against the gencode gtf with the -G -C options which were defined as doing the following:

-G generic GFF input file(s): do not assume Cufflinks/Stringtie GTF input, (do not discard intron-redundant transfrags)
-C include the "contained" transcripts in the .combined.gtf file

In the newest version cloned from github, v0.9.9d, -G is now the default and -C appears to do the opposite of what -C was defined as doing in the older versions, though I could be misinterpreting this.

-C discard the "contained" transfrags in the .combined.gtf (i.e. collapse intron-redundant transfrags across all query files)

So, I figured I would want to run the newer versions of gffcompare without -G and -C in order to mimic the behavior of the older versions of gffcompare with the -G and -C options. However, I am seeing cases where transcripts are being treated as duplicates, even when they are not the same as each other and then getting merged together. So, I tried running with -C and -C -K options, but generally, no matter what I do, I haven't been able to identify the right combination of arguments to mimic the behavior of older versions of gffcompare with -G -C. I also noticed that whenever I run with -C option, that I get a .combined.gtf file instead of a .annotated.gtf file when running without the -C option.

I've attached an image of IGV showing gencode annotatoin for a single exon gene with two transcripts of differing lengths in the gencode annotation. I also show the stringtie assembly and then the older version of gffcompare with -G -C and the newest version of gffcompare with multiple different options. You can see that in the newer version, the contained transcript is removed. This is a simple example, but I see this in multi exon genes as well.

Maybe I am missing something obvious, so could you please give advice on how I can run the newest version of gffcompare to preserve the contained transcripts?

gffcompare giving 'NA's

I have ~ 2000 transcripts that don't get given a class code when running gff compare. I am comparing the gencode.gtf with my stringtie_merged.gtf.

I'm confused as when I produce a transcript count matrix (via Stringtie/prepDE.py) there are transcripts that have lots of counts but supposedly aren't present in stringtie_merged.gtf.
eg ENST00000632434.1 has lots of read counts for all samples and has 'NA' class code.
When I grep ENST00000632434.1 from stringtie_merged.gtf I get an MSTRG id:
chr5 StringTie transcript 116085026 116293286 1000 + . gene_id "MSTRG.32862"; transcript_id "ENST00000632434.1"; gene_name "COMMD10"; ref_gene_id "ENSG00000145781.8";

When I then look at the MSTRG.32862 transcript in my count matrix is has very low counts.
If ENST00000632434.1 is present in all samples (as indicated by the transcript count matrix) then why isn't it being given a class code??

Exons and introns coordinates

The .stats file, mentions a number of novel exons and a number of novel introns identified. Is there an easy way to get the coordinates of these regions?

e.g.
Novel exons:   16055/631658  (  2.5%)
Novel introns:    5926/381131  (  1.6%)

Thank you in advance
Foivos

how to find novel genes and transcripts in gffcompare?

I am also new to RNA-SEQ analysis. I also followed this same protocol for my dataset. And i am having some problem for identifying novel transcripts and genes from the below gffcompare ".stats" file

i was running the following command $ gffcompare –r chrX_data/genes/chrX.gtf –G –o merged stringtie_merged.gtf. and i got the result file as
= Summary for dataset: stringtie_merged.gtf
Query mRNAs : 253181 in 70635 loci (216728 multi-exon transcripts)
(23264 multi-transcript loci, ~3.6 transcripts per locus)
Reference mRNAs : 216257 in 60158 loci (189357 multi-exon)
Super-loci w/ reference transcripts: 51791
-----------------| Sensitivity | Precision |

Base level: 100.0 | 93.0 |
Exon level: 99.9 | 93.6 |

Intron level: 99.4 | 94.0 |

Intron chain level: 99.7 | 87.1 | Transcript level: 99.8 | 85.2 | Locus level: 100.0 | 84.4 |

Matching intron chains: 188838
Matching transcripts: 215729
Matching loci: 60158

Missed exons: 0/623537 ( 0.0%)
Novel exons: 23741/674273 ( 3.5%)
Missed introns: 2160/383827 ( 0.6%)
Novel introns: 4747/405880 ( 1.2%)
Missed loci: 0/60158 ( 0.0%)
Novel loci: 10239/70635 ( 14.5%)

Total union super-loci across all input datasets: 70632 253181 out of 253181 consensus transcripts written in merged.annotated.gtf (0 discarded as redundant).

from this result where i can find the novel genes and transcripts?

0 reference transcripts loaded.

Hi,

I am trying to use gffcompare to compare my assembled transcriptome to a reference gtf that contains information about small open reading frames (sORFs). The reference gtf was obtained by processing downloaded data from several sORF databases. The reference gtf is called all_38.gtf and is attached at the end.

mytranscriptome.gtf is generated with reference to ensembl hg38 v99 reference.

I used the command: /home/e0470749/gffcompare/gffcompare -r /gpfs/eplab/Nathaniel/norf_replicate/all_38.gtf -o /gpfs/eplab/Nathaniel/norf_replicate/norfs mytranscriptome.gtf.

This outputs a message file:

0 reference transcripts loaded.
  237788 query transfrags loaded.
  2714 duplicate query transfrags discarded.

All the expected output files are generated except for the .refmap file which I need for downstream analysis. I assume this is because 0 reference transcripts were loaded into the program. Is there something wrong with my custom reference gtf file?

all_38.gtf.zip

Sensitivity/Precision Stats

For the stats produced by gffcompare. Can you explain how they are calculated?
If, 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏=𝑻𝑷/(𝑻𝑷+𝑭𝑷)
What is a false positive?
And how do you calculate this on the 'base level'?

 #-----------------| Sensitivity | Precision  |      
 Base level:    82.4     |    76.5    |
        Exon level:    81.2     |    82.9    |
      Intron level:    86.1     |    94.8    |
Intron chain level:    56.9     |    52.4    |
  Transcript level:    55.2     |    38.9    |
       Locus level:    70.1     |    48.0    |

how to deal with falsely merged gene regions into one super-locus?

Hello,
I used gffcompare to merge two gtf files with the following command: gffcompare -i gtffiles.txt

However, it seems that there are a falsely merged gene regions that are in fact two different genes. Do you have suggestions to deal with this?
Is there a way to specify to only merge after they share at least 2 exons or at least X amount of overlapping base pairs?

I would be gratefull for your reply,
Kind regards,
Ruth

Question on gffcompare output

Hello, I am currently working on the isoform detection analysis using gffcompare.
I was wondering if there are some cases on the reference (Gencode for instance) that have quite a similar isoform structure variation where it might be difficult for the input transcript to be discriminated, how would the input transcript be decided for its annotation? If the input transcript gets annotated to one of the closest matching reference transcripts, what would be the deciding factor? (more matching bases?)
I could not get a clear answer that I wanted from the utility page. If you help with this issue, I would really appreciate it!

Jungwoo.

single-exon transcript in reference

Here is my testing code.
gffcompare -r test.gff test1.gff
The two gff files are identical, but I got this summary.

#= Summary for dataset: test1.gff
#     Query mRNAs :   89544 in   56270 loci  (57969 multi-exon transcripts)
#            (10996 multi-transcript loci, ~1.6 transcripts per locus)
# Reference mRNAs :   57969 in   25225 loci  (57969 multi-exon)
...

When i run this order with -M, I got this summary. It works good.

#= Summary for dataset: test1.gff
#     Query mRNAs :   57969 in   25225 loci  (57969 multi-exon transcripts)
#            (10569 multi-transcript loci, ~2.3 transcripts per locus)
# Reference mRNAs :   57969 in   25225 loci  (57969 multi-exon)

Brifly, single-exon reference transcripts are discarded even without -N option.

How to compare CDS gff with genome guided assembled gff

I am working on plant species transcriptome. Its draft genome is known. We want to identify some new transcripts in genome-guided assembly.

I need your help to remove overlapping sequencing by comparing already reported CDS gff file with assembled transcript gff file using StringTie.

Can we also remove the redundancy of transcript which mapping to the same location on the genome by using gff file.

Thanks

cannot locate input file: –r [likely solution]

Error while executing the below command

gffcompare -R –r Reference.gtf –o merged stringtie_merged.gtf
Error: cannot locate input file: –r

Reference.gtf is available in the same folder with full permission

Version: gffcompare v0.9.7

gff compare gives identical results for multiple files

Hi there,

I'm doing something weird but I can't find my mistake. I'm running gff compare v 0.9.8.Linux_x86_64.
I have 16 stringtie assemblies, but when I run gff compare on the individual assemblies, I get identical results in the .stats files. However, the file paths indicate that I have correctly provided different .gtf files. If I run multiple assemblies at once, the .stats file only contains the command run.

The reference genome is Atlantic salmon (Salmo salar). I'm also very skeptical that I have 100% sensitivity and precision.

Here are the contents of a stats file:

gffcompare v0.9.8 | Command line was:
./gffcompare -r /media/genetics/Data1/Jeff_workstation_1/subread-1.5.0-p1/GCF_000233375.1_ICSASG_v2_genomic.gff -o /media/genetics/Data1/Jeff_workstation_1/gffcompare-0.9.8.Linux_x86_64/compare/comp /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled/11-B5690_S3_mRNA_ARGU1_NGSLib3b.gtf

= Summary for dataset: /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled/11-B5690_S3_mRNA_ARGU1_NGSLib3b.gtf
Query mRNAs : 136272 in 81266 loci (111377 multi-exon transcripts)
(20690 multi-transcript loci, ~1.7 transcripts per locus)
Reference mRNAs : 136039 in 81267 loci (111148 multi-exon)
Super-loci w/ reference transcripts: 80037
-----------------| Sensitivity | Precision |
Base level: 100.0 | 100.0 |
Exon level: 100.0 | 100.0 |
Intron level: 99.5 | 99.5 |
Intron chain level: 100.0 | 99.8 |
Transcript level: 100.0 | 99.8 |
Locus level: 100.0 | 100.0 |

 Matching intron chains:  111148
   Matching transcripts:  136039
          Matching loci:   81267

      Missed exons:       0/562650	(  0.0%)
       Novel exons:       0/562955	(  0.0%)
    Missed introns:    2190/471248	(  0.5%)
     Novel introns:       0/471248	(  0.0%)
       Missed loci:       0/81267	(  0.0%)
        Novel loci:       0/81266	(  0.0%)

Total union super-loci across all input datasets: 81266

question about code "="

I always has one doubt for the code "=", if the query is an exact match of intron chain, but the last exon is longer than the reference, why is the query not an alternative splice.

Why not two '='

This is my gtf file for test, saved as 'Test.gtf':

1       StringTie       transcript      9182204 9192089 1000    +       .       gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "3.691087"; FPKM "0.354073"; TPM "0.688700";
1       StringTie       exon    9182204 9182316 1000    +       .       gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; exon_number "1"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "0.261039";
1       StringTie       exon    9183633 9183807 1000    +       .       gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; exon_number "2"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "4.634286";
1       StringTie       exon    9191393 9192089 1000    +       .       gene_id "origin.STRG.178"; transcript_id "origin.STRG.178.1"; exon_number "3"; reference_id "ENST00000412639"; ref_gene_id "ENSG00000234546"; ref_gene_name "LINC01759"; cov "4.010363";
1       StringTie       transcript      9182752 9182952 1000    .       .       gene_id "origin.STRG.179"; transcript_id "origin.STRG.179.1"; cov "6.796020"; FPKM "0.651919"; TPM "1.268033";
1       StringTie       exon    9182752 9182952 1000    .       .       gene_id "origin.STRG.179"; transcript_id "origin.STRG.179.1"; exon_number "1"; cov "6.796020";

I typed following command:

gffcompare -r Test.gtf Test.gtf

and this is gffcmp.tracking:

TCONS_00000001  XLOC_000001     origin.STRG.178|origin.STRG.178.1       =       q1:origin.STRG.178|origin.STRG.178.1|3|0.354073|0.688700|3.691087|985
TCONS_00000002  XLOC_000002     origin.STRG.178|origin.STRG.178.1       i       q1:origin.STRG.179|origin.STRG.179.1|1|0.651919|1.268033|6.796020|201

See, I compare this file with itself, and two transcripts were not marked two '=', instead, origin.STRG.179 was marked i compared with origin.STRG.178. Is there any explanation?

Question: How do you find novel transcripts using GFFcompare?

I am trying to find novel transcripts from an RNA-seq database. I tried to follow the protocol provided in the Nature Protocols paper that described the use of Stringtie; the paper also suggested using GFFcompare for comparing the assembled transcripts with the reference, to quantify and find the novel transcripts.

I followed the entire protocol, and this was the output of GFFcompare when comparing the reference GTF with that of one of the samples after mapping -

#= Summary for dataset: ./hisat2/ERR188044_chrX.gtf 
#-----------------| Sensitivity | Precision  |
        Base level:    51.7     |    79.5    |
        Exon level:    46.7     |    85.2    |
      Intron level:    47.2     |    93.9    |
Intron chain level:    31.2     |    64.4    |
  Transcript level:    31.6     |    52.3    |
       Locus level:    36.6     |    50.1    |

     Matching intron chains:     580
       Matching transcripts:     664
              Matching loci:     397

          Missed exons:    4395/8804	( 49.9%)
           Novel exons:     426/4874	(  8.7%)
        Missed introns:    3832/7946	( 48.2%)
         Novel introns:      83/3992	(  2.1%)
           Missed loci:     610/1086	( 56.2%)
            Novel loci:     273/795	( 34.3%)

 Total union super-loci across all input datasets: 749 
1270 out of 1270 consensus transcripts written in 188044_only.annotated.gtf (0 discarded as redundant)

In the paper however, the number of novel genes and transcripts is clearly mentioned -

So, I was wondering how they were able to calculate the number of novel transcripts and genes, from the data that GFFcompare provides (they even mention in the paper that they got it from GFFcompare). I see the numbers related to novel exons and novel introns, but how do I translate it to the number and identities of the novel transcripts/genes?

straight annotation mode for single input file

Currently the *.combined.gtf file renames the transcripts as they are "merged". There should be a new "annotation mode" option which would simply keep the original transcript names (without trying to rename them), using the longer isoform's transcript ID in case of merging (which shouldn't happen at all if that file is the output of stringtie --merge) or, optionally, even disabling the merging of compatible/equivalent isoforms (separate option?). Sure the latter option would make the name of this file a bit awkward (in which case perhaps it should be changed to *.annotated.gtf or *.class.gtf ?).

too many input files (limit set to 500 at compile time)

I used cuffcompare to process 1000 files and get an error below:

Error: too many input files (limit set to 500 at compile time)
(if you need to raise this limit set a new value for
MAX_QFILES in gtf_tracking.h and recompile)

But I can't find the "MAX_QFILES" option in the gtf_tracking.h

How can I raise the limit of the input files?

Output path of refmap and tmap

Hi,

This is only a very small issue, but I wanted to tell about it.
It was not clear to me that the refmap and tmap is not written to the current directory, but in the directory of the GFF files.
It is not a big deal, but I asked myself for about 2 hours why gffcompare generates all other files like .stats, .tracking and so on, but no tmap or refmap.

Sorry for bothering with this issue.

Cheers
Chris

Missing tmap and refmap files

I tried version 0.11.2 (from anaconda) and 0.11.4 (precompiled binary) as following:

gffcompare \
-r reference.gtf \
-s .genome.fa \
-o prefix \
merged_annotation.gtf

where merged_annotation.gtf is the output of stringtie merge. After this run I get the following files:

.tracking
.annotated.gtf
.loci
.stats

but I am missing the .tmap and .refmap

Am I doing something wrong?

Thank you in advance
Foivos

Error running gffcompare

I am trying to run the following command:

gffcompare -r ref.gff -G -o merged stringtie-merged.gtf

ref.gff - downloaded from NCBI

string-merged.gtf - obtained from stringtie --merge command

Error encountered : GFF Error: overlapping duplicate transcript feature (ID=gene29892)

When I grep "gene29892" from both the ref.gff and merged stringtie-merged.gtf

From ref.gff

NC_007957.1	RefSeq	gene	74631	74744	.	+	.	ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=1/2
NC_007957.1	RefSeq	gene	146276	147073	.	+	.	ID=gene29892;Dbxref=GeneID:4025012;Name=rps12;exception=trans-splicing;gbkey=Gene;gene=rps12;gene_biotype=protein_coding;locus_tag=ViviCp045;part=2/2
NC_007957.1	RefSeq	CDS	74631	74744	.	+	0	ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1	RefSeq	CDS	146276	146507	.	+	0	ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1	RefSeq	CDS	147048	147073	.	+	2	ID=cds41168;Parent=gene29892;Dbxref=Genbank:YP_567100.1,GeneID:4025012;Name=YP_567100.1;exception=trans-splicing;gbkey=CDS;gene=rps12;product=ribosomal protein S12;protein_id=YP_567100.1;transl_table=11
NC_007957.1	RefSeq	exon	146276	146507	.	+	.	ID=id318095;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=1;gbkey=exon;gene=rps12
NC_007957.1	RefSeq	exon	147048	147073	.	+	.	ID=id318096;Parent=gene29892;Dbxref=GeneID:4025012;exon_number=2;gbkey=exon;gene=rps12

From stringtie-merged.gtf

NC_007957.1	StringTie	transcript	74631	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	74631	74744	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	146276	146507	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "2"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	147048	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "3"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	transcript	146276	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; gene_name "rps12"; ref_gene_id "gene29892"; 
NC_007957.1	StringTie	exon	146276	147073	1000	+	.	gene_id "MSTRG.117"; transcript_id "gene29892"; exon_number "1"; gene_name "rps12"; ref_gene_id "gene29892";

What could be possibly wrong as I cannot see any duplicate values!

Only 1 output file, input GTF is overwritten (v0.9.9)

I'm using the command below to compare a merged transcript GTF file from StringTie with annotated UCSC transcripts. gffcompare (v0.9.9) completes but only produces one output file, test_gffcompare.stats, which is empty except for 2 header lines of information about the command call. Also, my input merged_transcripts.gtf was overwritten with the correct gene names, so it appears that gffcompare is working in terms of reference annotation matching. Is overwriting the input file correct behavior? Where are the other output files (e.g. test_gffcompare.combined.gtf, test_gffcompare.loci)? I basically followed the steps outlined here, and I cloned the gffcompare repo just this morning.

Thanks :]
Mallory

gffcompare -r UCSC_Main_Human_refGene.gtf -T -V -o test_gffcompare merged_transcripts.gtf

no refmap/tmap generated from gffcompare?

Hello,

I've run gffcompare with the following command:
gffcompare -r reference.gtf -R -C -K -o sample1 query.gff

It generated sample1.combined.gtf, sample1.redundant.gtf, sample1.tracking, sample1.loci, sampe1.stats . However, according to this page, shouldn't I also get sample1.refmap and sample1.tmap? I can't find them, even when I drop "-C" and "-K" option.

The reference gtf file I used looks like this:

chr1 BestRefSeq exon 11874 12227 . + . transcript_id "NR_046018"; gene_id "100287102";
chr1 BestRefSeq exon 12613 12721 . + . transcript_id "NR_046018"; gene_id "100287102";
chr1 BestRefSeq exon 13221 14409 . + . transcript_id "NR_046018"; gene_id "100287102";
chr1 BestRefSeq exon 14362 14829 . - . transcript_id "NR_024540"; gene_id "653635";

and the sample1.gff looks like this:

chr1 pinfish mRNA 14421 195419 0 - . gene_id "847e65b5-7e7b-428b-936e-6a376f435050"; transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";
chr1 pinfish exon 14421 14829 0 - . transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";
chr1 pinfish exon 14970 15038 0 - . transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";
chr1 pinfish exon 186317 186469 0 - . transcript_id "1afec315-0f47-448d-bcf1-fee886bdfd83|3";

Should I modify my reference or query file to get the refmap/tmap? Any advice would be really helpful!

No coverage in .tmap file

Hi,
I am running stringtie merge and then gffcompare:
gffcompare -r gencode.gtf -G -o gff stringtie_merged.gtf
When I look at the gff.stringtie_merged.gtf.tmap all values are zero for FPKM,TPM, and coverage for all transcripts??

Memory corruption while running gffcompare

Hello!

I have been following the procedures provided in the article (HISAT2, Pertea et al., 2016) and successfully merged 34 single-end RNA-seq samples into a stringtie_merged.gtf. I am having trouble running the gffcompare program where it simply crashes out. The error looks like this:

gffcompare -r /home/slab/s6ma/archive/bin/hisat_indices/gene_annotations/Homo_sapiens.GRCh38.91.gtf -G -o merged stringtie_merged.gtf
*** glibc detected *** gffcompare: malloc(): memory corruption: 0x000000000c3ea820 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76e46)[0x7f69f4576e46]
/lib/x86_64-linux-gnu/libc.so.6(+0x79eb3)[0x7f69f4579eb3]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x70)[0x7f69f457bcd0]
gffcompare[0x422440]
gffcompare[0x42a379]
gffcompare[0x4406fc]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xfd)[0x7f69f451eeed]
gffcompare[0x401fe9]
======= Memory map: ========
00400000-00465000 r-xp 00000000 00:27 61384051 /home/slab/s6ma/archive/src/gffcompare-0.10.2.Linux_x86_64/gffcompare
00664000-00666000 rw-p 00064000 00:27 61384051 /home/slab/s6ma/archive/src/gffcompare-0.10.2.Linux_x86_64/gffcompare
00666000-00680000 rw-p 00000000 00:00 0
020ae000-119d8000 rw-p 00000000 00:00 0 [heap]
7f69f0000000-7f69f0021000 rw-p 00000000 00:00 0
7f69f0021000-7f69f4000000 ---p 00000000 00:00 0
7f69f42e8000-7f69f42fd000 r-xp 00000000 00:11 1060115 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f69f42fd000-7f69f44fd000 ---p 00015000 00:11 1060115 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f69f44fd000-7f69f44fe000 rw-p 00015000 00:11 1060115 /lib/x86_64-linux-gnu/libgcc_s.so.1
7f69f4500000-7f69f4682000 r-xp 00000000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4682000-7f69f4881000 ---p 00182000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4881000-7f69f4885000 r--p 00181000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4885000-7f69f4886000 rw-p 00185000 00:11 3164071 /lib/x86_64-linux-gnu/libc-2.13.so
7f69f4886000-7f69f488b000 rw-p 00000000 00:00 0
7f69f4890000-7f69f4911000 r-xp 00000000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4911000-7f69f4b10000 ---p 00081000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4b10000-7f69f4b11000 r--p 00080000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4b11000-7f69f4b12000 rw-p 00081000 00:11 3164075 /lib/x86_64-linux-gnu/libm-2.13.so
7f69f4b18000-7f69f4b38000 r-xp 00000000 00:11 3164068 /lib/x86_64-linux-gnu/ld-2.13.so
7f69f4d17000-7f69f4d37000 rw-p 00000000 00:00 0
7f69f4d37000-7f69f4d38000 r--p 0001f000 00:11 3164068 /lib/x86_64-linux-gnu/ld-2.13.so
7f69f4d38000-7f69f4d39000 rw-p 00020000 00:11 3164068 /lib/x86_64-linux-gnu/ld-2.13.so
7f69f4d39000-7f69f4d3a000 rw-p 00000000 00:00 0
7f69f4d3c000-7f69f4d41000 rw-p 00000000 00:00 0
7fffd4764000-7fffd4785000 rw-p 00000000 00:00 0 [stack]
7fffd4800000-7fffd4801000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

Does anyone know how to fix this?
I have tried to re-install the gffcompare program but this error still occurs. I have also tried gffcompare on unmerged gtf files but still would not work. So I am highly certain that the issue is within the gffcompare program. Thank you

gffcompare run on stringtie --merge gtf file results in zeros in the summary text

Hello,

I have a problem of empty output for the
Each set of paired end sequence data were aligned using hisat2 -- according to the published protocol for string tie. And then stringtie run on them with a command similar to:

stringtie -B -G Homo_sapiens.GRCh38.88.gtf rnaseq1.bam -o rnaseq1.gtf
stringtie -B -G Homo_sapiens.GRCh38.88.gtf rnaseq2.bam -o rnaseq2.gtf
...
stringtie -B -G Homo_sapiens.GRCh38.88.gtf rnaseqn.bam -o rnaseqn.gtf

putting these files in a text file I call stringtiefilelist

rnaseq1.gtf
rnaseq2.gtf
...
rnaseqn.gtf

I run stringtie merge as follows:

stringtie --merge  -G Homo_sapiens.GRCh38.88.gtf  -o stringtie_merged.gtf stringtiefilelist

Each of these steps produces expected output. But then when I run gffcompare

gffcompare -r gencode.v.25.annotation.gtf -o stringtie_merged.txt stringtie_merged.gtf

I get empty output so to speak

#= Summary for dataset: stringtie_merged.gtf
#     Query mRNAs :  238442 in   34534 loci  (238442 multi-exon transcripts)
#            (21005 multi-transcript loci, ~6.9 transcripts per locus)
# Reference mRNAs :  171986 in   33152 loci  (171986 multi-exon)
# Super-loci w/ reference transcripts:        0
#-----------------| Sensitivity | Precision  |
        Base level:     0.0     |     0.0    |
        Exon level:     0.0     |     0.0    |
      Intron level:     0.0     |     0.0    |
Intron chain level:     0.0     |     0.0    |
  Transcript level:     0.0     |     0.0    |
       Locus level:     0.0     |     0.0    |

     Matching intron chains:       0
       Matching transcripts:       0
              Matching loci:       0

          Missed exons:  544338/544338  (100.0%)
           Novel exons:  634910/634910  (100.0%)
        Missed introns:  350157/350157  (100.0%)
         Novel introns:  398915/398915  (100.0%)
           Missed loci:   33152/33152   (100.0%)
            Novel loci:   34534/34534   (100.0%)

Is there perhaps a problem in the parameters I am using?

Do I need to convert the gtf file to a gff file? A successful run I had used as input a gff file. Perhaps this is the reason.

Extract transcript from assembly based on class code using gffread

Hi,

I used CuffCompare to compare the genome guided assembled GTF file with the genome annotation. Now I want to extract transcripts with class codes annotated with “u”, “x,” and “i” using gffread program.

Please share command line to extract these transcripts.

Thanks

Sensitivity lower for stringtie merged gff than for component files

I've merged couple of gff files with stringtie merge tool. The resulting gff has lower sensitivity than each of single gff files used to create merged file. I'm comparing all gff files to the same reference (human, ensembl). Both tools: stringtie and gffcompare are called with default options. The difference is quite high: 30 - 50. Am I missing something? How is that possible?
Update: merged set created with gffcompare (with -i option) behaves according to my expectations: sensitivity is higher than for each component files. It seems more like a stringtie issue now...

gffcompare fails to match gtf without chr prepended to chromosome names

The HISAT index provided by http://ccb.jhu.edu/software/hisat2/index.shtml does not have "chr" in the chromosome names, therefore I modified the gencode standard gtf to remove the "chr" before each chromosome as well as renaming "chrM" to "MT" to match the HISAT index.

However, after assembly with Stringtie, I attempt to use gffcompare to compare to the modified gtf, and I find that gffcompare discards every transcript except the unaligned contigs. In the stats file I see

# Reference mRNAs : 271 in 262 loci (27 multi-exon)

Running again with "chr" prepended to the transcript names in the gtf all of the transcripts are kept

# Reference mRNAs : 142428 in 22607 loci (135360 multi-exon)

why -x and -y option print empty file

I use command:
gffread -x sample.CDS.fa -y sample.CDS.protein.fa -g GRCh38.fasta sample.gtf
and No errors appear,but both sample.CDS.fa and sample.CDS.protein.fa are empty file,Why is this happening？

the head of the sample.gtf：

1	StringTie	transcript	356184	357780	1000	+	.	gene_id "MSTRG.49"; transcript_id "MSTRG.49.1"; 
1	StringTie	exon	356184	356438	1000	+	.	gene_id "MSTRG.49"; transcript_id "MSTRG.49.1"; exon_number "1"; 
1	StringTie	exon	357540	357780	1000	+	.	gene_id "MSTRG.49"; transcript_id "MSTRG.49.1"; exon_number "2"; 
1	StringTie	transcript	1488502	1572128	1000	-	.	gene_id "MSTRG.70"; transcript_id "MSTRG.70.1"; 
1	StringTie	exon	1488502	1488533	1000	-	.	gene_id "MSTRG.70"; transcript_id "MSTRG.70.1"; exon_number "1"; 
1	StringTie	exon	1565130	1572128	1000	-	.	gene_id "MSTRG.70"; transcript_id "MSTRG.70.1"; exon_number "2"; 
1	StringTie	transcript	2147579	2148044	1000	-	.	gene_id "MSTRG.79"; transcript_id "MSTRG.79.1"; 
1	StringTie	exon	2147579	2147793	1000	-	.	gene_id "MSTRG.79"; transcript_id "MSTRG.79.1"; exon_number "1"; 
1	StringTie	exon	2147992	2148044	1000	-	.	gene_id "MSTRG.79"; transcript_id "MSTRG.79.1"; exon_number "2";

Error building gffcompare

Hi @gpertea ,

I followed the instructions to install gffcompare and I got the error below with make release. Could you provide some help? Thanks!

g++ -O3 -DNDEBUG -std=c++0x -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFastaIndex.cpp -o ../gclib/GFastaIndex.o
g++ -O3 -DNDEBUG -std=c++0x -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/GFaSeqGet.cpp -o ../gclib/GFaSeqGet.o
g++ -O3 -DNDEBUG -std=c++0x -Wall -Wextra -I../gclib -D_REENTRANT -fno-exceptions -fno-rtti -c ../gclib/gff.cpp -o ../gclib/gff.o
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)’:
../gclib/gff.cpp:1988: error: template argument for ‘template class GVec’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GeneCDS’
../gclib/gff.cpp:1988: error: trying to instantiate ‘template class GVec’
../gclib/gff.cpp:1989: error: template argument for ‘template class GArray’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GSegMatch’
../gclib/gff.cpp:1989: error: trying to instantiate ‘template class GArray’
../gclib/gff.cpp: In member function ‘void GffObj::processGeneSegments(GffReader*)::GeneCDSChain::addCDS(int, uint, uint)’:
../gclib/gff.cpp:1998: error: request for member ‘Add’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp: In member function ‘void GffObj::processGeneSegments(GffReader*)::GeneCDSChain::addMatch(int, int, int)’:
../gclib/gff.cpp:2003: error: request for member ‘Add’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::singleExonCDSMatch(uint, uint, int&)’:
../gclib/gff.cpp:2009: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2011: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2012: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2012: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::singleCDStoExon(GffObj&, int&)’:
../gclib/gff.cpp:2022: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2023: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2024: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2026: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2026: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::multiCDStoExon(GffObj&, int&)’:
../gclib/gff.cpp:2043: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2047: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2048: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2050: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2050: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2063: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2064: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2074: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2076: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)::GeneCDSChain::containedBy(GffObj&, int&)’:
../gclib/gff.cpp:2097: error: request for member ‘Count’ in ‘((GffObj::processGeneSegments(GffReader*)::GeneCDSChain*)this)->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp: In member function ‘bool GffObj::processGeneSegments(GffReader*)’:
../gclib/gff.cpp:2105: error: template argument for ‘template class GHash’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GeneCDSChain’
../gclib/gff.cpp:2105: error: trying to instantiate ‘template class GHash’
../gclib/gff.cpp:2105: error: invalid type in declaration before ‘(’ token
../gclib/gff.cpp:2106: error: template argument for ‘template class GPVec’ uses local type ‘GffObj::processGeneSegments(GffReader*)::GeneCDSChain’
../gclib/gff.cpp:2106: error: trying to instantiate ‘template class GPVec’
../gclib/gff.cpp:2106: error: invalid type in declaration before ‘;’ token
../gclib/gff.cpp:2128: error: request for member ‘Find’ in ‘cdsChainById’, which is of non-class type ‘int’
../gclib/gff.cpp:2133: error: request for member ‘Add’ in ‘cdsChains’, which is of non-class type ‘int’
../gclib/gff.cpp:2134: error: request for member ‘shkAdd’ in ‘cdsChainById’, which is of non-class type ‘int’
../gclib/gff.cpp:2143: error: request for member ‘Count’ in ‘cdsChains’, which is of non-class type ‘int’
../gclib/gff.cpp:2144: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2153: error: request for member ‘Count’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
../gclib/gff.cpp:2158: error: request for member ‘First’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
../gclib/gff.cpp:2159: error: request for member ‘Count’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::cdsList’, which is of non-class type ‘int’
../gclib/gff.cpp:2160: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2161: error: invalid types ‘int[int]’ for array subscript
../gclib/gff.cpp:2165: error: request for member ‘First’ in ‘gc->GffObj::processGeneSegments(GffReader*)::GeneCDSChain::mxs’, which is of non-class type ‘int’
make: *** [../gclib/gff.o] Error 1

update release tags

We're installing gffcompare on our cluster and it would help us to quickly find and update to the latest stable release if you could keep the tags up-to-date. The latest tag says 0.9.8, but the version string in gffcompare.cpp says 0.9.9d.

Many thanks for your consideration

.tracking file shouldn't be generated for a single query file

gffcompare class code "x", what is the minimum overlap?

Hello,

What is the minimum overlap required on the opposite strand for the class code "x" to be assigned? I am curious if this can be altered at all within the script as I am interested in search for antisense transcription in my dataset but would like to know how that is being classified using gffcompare. I've looked in your papers and in the cuffcompare repository/website but can't seem to find this information.

Any help is appreciated!

Thank you,

Diego

How can I extract the novel exons?

I am having the following statistic from gffcompare

      Missed exons:   81439/302312  ( 26.9%)
       Novel exons:    2536/146326  (  1.7%)
    Missed introns:   38931/183784  ( 21.2%)
     Novel introns:     556/100652  (  0.6%)
       Missed loci:       0/11150   (  0.0%)
        Novel loci:     166/10328   (  1.6%)

I can see there is 2536 novel exons in my input file. Is there any way I can extract this exons?

Thanks

How to statistics the novel transcripts

Hello,
I read the Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown.
then I download the file chrX_data.tar.gz to practice.
I have a problem for novel transcripts. I use one simple,ERR188044,to practice ,like this:
gffcompare -r /disk/chrX_data/genes/chrX.gtf -G -o /disk/luping/chrX_data/gffcompare/ERR188044 /disk/chrX_data/gtf/ERR188044_chrX.gtf.

#= Summary for dataset: /disk/chrX_data/gtf/ERR188044_chrX.gtf

Query mRNAs : 1234 in 786 loci (890 multi-exon transcripts)

(226 multi-transcript loci, ~1.6 transcripts per locus)

Reference mRNAs : 2102 in 1086 loci (1856 multi-exon)

Super-loci w/ reference transcripts: 461

#-----------------| Sensitivity | Precision |
Base level: 51.6 | 80.3 |
Exon level: 46.7 | 85.4 |
Intron level: 47.2 | 93.8 |
Intron chain level: 31.2 | 65.2 |
Transcript level: 31.6 | 53.8 |
Locus level: 36.6 | 50.6 |

 Matching intron chains:     580
   Matching transcripts:     664
          Matching loci:     397

      Missed exons:    4398/8804	( 50.0%)
       Novel exons:     418/4863	(  8.6%)
    Missed introns:    3832/7946	( 48.2%)
     Novel introns:      79/3995	(  2.0%)
       Missed loci:     581/1086	( 53.5%)
        Novel loci:     260/786	( 33.1%)

Total union super-loci across all input datasets: 741
1234 out of 1234 consensus transcripts written in /disk/luping/chrX_data/gffcompare/ERR188044.annotated.gtf (0 discarded as redundant)

then I statistics the novel transcripts using class_code "j",but this is different from the results of table7

ERR188044 Novel transcripts 615 Did I have any mistake?

gpertea / gffcompare Goto Github PK

gffcompare's People

Contributors

Stargazers

Watchers

Forkers

gffcompare's Issues

gffcmp.stats

subanswer.gff3

subpredict.gff3

Query mRNAs : 1234 in 786 loci (890 multi-exon transcripts)

(226 multi-transcript loci, ~1.6 transcripts per locus)

Reference mRNAs : 2102 in 1086 loci (1856 multi-exon)

Super-loci w/ reference transcripts: 461

Recommend Projects

Recommend Topics

Recommend Org