Comments (7)
On second though, perhaps this is because I'm not specifying a GTF file, now that I did, the process takes much much longer (still waiting for it to finish). Is it now regenerating the lengths?
Another question: Are you defining the gene lengths (or the exon lengths?) as the combined length of all exons? So a projection of all exons on one axis and then computing the combined lengths? Meaning that exotic transcripts that may hardly occur may still add to gene/combined exon lengths?
And, related, one last question: irap_raw2metric takes as input "genes.raw.rsem.tsv" (if rsem was used), so if I'm not mistaken, that means that it only uses the gene,exon,transcript option to determine what lengths to use. In a normal RNAseq case you'd say you don't want intronic reads to count so leave them out and yet, in the logs I find that "--feature gene" is used by irap_raw2metric. Shouldn't that be "--feature exon"? Because I assume RSEM will also only use exonic reads.
Highest regards.
from irap.
Hi Freek,
The help message was a bit out of date - I just pushed a commit fixing that. irap_raw2metric does not generate a tsv.gz or html file. It may save the quantification in tsv format or a matrix market format.
Yes, if you provide the GTF it will calculate the lengths.
By the default the gene length is computed as the sum of the length of all collapsed
exons of the gene.
exon1 exon2
=-----= =-----=
exon3
=-----=
Collapsed
=-------= =-----=
Length:
=-------==-----=
Regarding you last point, the --feature option is to indicate which feature was quantified (gene, exon or transcript). If you provide a file with gene quantification you will want to pass --feature gene. Hopefully it is clear from the explanation above that the gene length does not include the introns. Please let me know if this is not clear.
Cheers.
from irap.
Hi Nuno,
Thank you, this is very helpful indeed.
I do have yet another question then.
In the --help output there is only the option to --exclude biotypes from the normalization process. So I made a list of all biotypes except protein_coding and supplied it using the --exclude option. However this does not seem to alter the results, I keep getting back a TSV with all genes TPM normalized as is done in the default pipeline.
I found an option --mass_biotypes in the script itself (not part of the help), I also tried to use that to normalize on protein coding genes only but I get the error: long flag "mass_biotypes" is invalid
Perhaps it is easier if I describe what I want.
What I want is to normalize on protein_coding genes only (this makes for more stable result in the face of more or less successful rRNA removal for example). I did this by filtering the raw output of rsem (genes.raw.rsem.tsv) for only protein coding genes and then running:
irap_raw2metric --tsv genes.raw.rsem.tsv --lengths Homo_sapiens.GRCh38.87.gtf.lengths.Rdata --feature gene --metric tpm --out default_iRAP.tsv
(By the way I supply irap_raw2metric with biotype names as they are defined in the standard ensemble GTF file, i.e. protein_coding, lincRNA, rRNA, etc)
That works. However my colleagues and I are also interest in at least 1 non-protein coding gene. Which is why I'm playing with irap_raw2metric. I was hoping that I could normalize on protein coding genes but still get TPM values for non-protein_coding genes. I guess the then it is not really called TPM anymore because the sum of the values would be above 1 million? So if you have any thoughts on this they are very much appreciated as well. A possible solution would be include the biotype of the non-protein coding gene of interest perhaps, but this would mean that I start changing values if I add more biotypes of interest later on.
I still wonder how would a working command for TPM normalization look like on only protein_coding genes for irap_raw2metric? The following does not seem to work:
source /home/genomics/testsw/irap-0.8.5p2/irap_setup.sh
irap_raw2metric --tsv genes.raw.rsem.tsv --lengths Homo_sapiens.GRCh38.87.gtf.lengths.Rdata --feature gene --metric tpm --out exclude_iRAP.tsv --exclude IG_pseudogene,Mt_tRNA,TR_D_gene,IG_C_pseudogene,Mt_rRNA,TR_J_gene,unitary_pseudogene,miRNA,TR_V_pseudogene,sRNA,bidirectional_promoter_lncRNA,pseudogene,IG_V_gene,antisense,rRNA,snRNA,IG_V_pseudogene,transcribed_processed_pseudogene,TR_J_pseudogene,misc_RNA,processed_pseudogene,IG_J_gene,snoRNA,IG_D_gene,sense_overlapping,IG_J_pseudogene,sense_intronic,scRNA,polymorphic_pseudogene,processed_transcript,unprocessed_pseudogene,TEC,ribozyme,vaultRNA,macro_lncRNA,3prime_overlapping_ncRNA,transcribed_unitary_pseudogene,IG_C_gene,TR_C_gene,TR_V_gene,transcribed_unprocessed_pseudogene,scaRNA,lincRNA,non_coding
I hope you don't mind all these questions, I very much appreciate your work on iRAP and your time.
Highest regards,
Freek.
from irap.
Hi,
I'll try to cover all the questions.
In the --help output there is only the option to --exclude biotypes from the normalization process. So I >made a list of all biotypes except protein_coding and supplied it using the --exclude option. However this >does not seem to alter the results, I keep getting back a TSV with all genes TPM normalized as is done >in the default pipeline.
This option (--exclude) is deprecated.
I found an option --mass_biotypes in the script itself (not part of the help), I also tried to use that to >normalize on protein coding genes only but I get the error: long flag "mass_biotypes" is invalid
Please try the irap_raw2metric in the new version (1.0.0a). The help should be
informative and it should work.
(By the way I supply irap_raw2metric with biotype names as they are defined in the standard ensemble >GTF file, i.e. protein_coding, lincRNA, rRNA, etc)
Those are the only biotypes that iRAP knows about.
That works. However my colleagues and I are also interest in at least 1 non-protein coding gene. Which >is why I'm playing with irap_raw2metric. I was hoping that I could normalize on protein coding genes but >still get TPM values for non-protein_coding genes. I guess the then it is not really called TPM anymore >because the sum of the values would be above 1 million?
Yes, for the non-coding genes is a modified version of TPM.
So if you have any thoughts on this they are very much appreciated as well. A possible solution would be include the biotype of the non-protein coding gene of interest perhaps, but this would mean that I start changing values if I add more biotypes of interest later on.
Yes, including the biotype of the non-protein coding gene is the most
clean and easy to explain option. Assuming that your non-coding gene is a lincRNA, you may normalize the expression based solely on the expression of protein coding and
lincRNA by running irap_raw2metric with --mass_biotype
"protein_coding,lincRNA".
I still wonder how would a working command for TPM normalization look like on only protein_coding genes for irap_raw2metric? The following does not seem to work:
The following should work....
source /home/genomics/testsw/irap-1.0.0/irap_setup.sh
irap_raw2metric --tsv /rst1/2017-0205_illuminaseq/data/seqData/analyzed/180222_NB501997_0051_AHTFJNBGX3/0051_P2017SEQE66S18_S1/0051_P2017SEQE66S18_S1/star/rsem/genes.raw.rsem.tsv --lengths /rst1/2017-0205_illuminaseq/data/references/Reference_Genomes/GRCh38.87/Annotations/Homo_sapiens.GRCh38.87.gtf.lengths.Rdata --feature gene --metric tpm --out /rst1/2017-0205_illuminaseq/scratch/swo-358/exclude_iRAP.tsv --mass_biotypes protein_coding
I hope you don't mind all these questions, I very much appreciate your work on iRAP and your time.
Thank you for asking ;-)
Cheers.
from irap.
Hi Nuno,
I'm still getting unexpected results.
(Note from below I removed all folder structures to make it more readable.)
First I use the standard normalization that irap_raw2metric offers:
(one possible source of error is that I use the Rdata file generated bji iRAP-0.8.5.p2 by the way)
Script:
source /home/genomics/testsw/irap-1.0.0a/irap_setup.sh
irap_raw2metric -i genes.raw.rsem.tsv --lengths Homo_sapiens.GRCh38.87.gtf.lengths.Rdata --gtf Homo_sapiens.GRCh38.87.gtf --feature gene --metric tpm --out default_iRAP.tsv
output:
FO 28/03-14:54] Loading genes.raw.rsem.tsv done.
[INFO 28/03-14:54] Loading Homo_sapiens.GRCh38.87.gtf.lengths.Rdata
[INFO 28/03-14:54] Loading Homo_sapiens.GRCh38.87.gtf.lengths.Rdata done.
Read 2575494 rows and 9 (of 9) columns from 1.312 GB file in 00:00:20
GTF attributes ensembl90
[INFO 28/03-14:55] biotype col:gene_biotype
[INFO 28/03-14:55] GTF file loaded: Homo_sapiens.GRCh38.87.gtf 703935 entries
[INFO 28/03-14:55] Saved default_iRAP.tsv
So far so good, I get the same value as from a standard iRAP run, as expected. Now using protein_coding based normalization:
Script:
source /home/genomics/testsw/irap-1.0.0a/irap_setup.sh
irap_raw2metric -i genes.raw.rsem.tsv --lengths Homo_sapiens.GRCh38.87.gtf.lengths.Rdata --gtf Homo_sapiens.GRCh38.87.gtf --feature gene --metric tpm --out protein-coding_iRAP.tsv --mass_biotypes protein_coding
output:
[INFO 28/03-15:36] Loading genes.raw.rsem.tsv
[INFO 28/03-15:36] Loading genes.raw.rsem.tsv done.
[INFO 28/03-15:36] Loading Homo_sapiens.GRCh38.87.gtf.lengths.Rdata
[INFO 28/03-15:36] Loading Homo_sapiens.GRCh38.87.gtf.lengths.Rdata done.
Read 2575494 rows and 9 (of 9) columns from 1.312 GB file in 00:00:19
GTF attributes ensembl90
[INFO 28/03-15:37] biotype col:gene_biotype
[INFO 28/03-15:37] GTF file loaded: Homo_sapiens.GRCh38.87.gtf 703935 entries
[INFO 28/03-15:37] Found protein_coding
[INFO 28/03-15:37] #Features used to compute the mass:702663
[INFO 28/03-15:37] Saved protein-coding_iRAP.tsv
Something is wrong because the this gives exactly the same results as before (58051 genes with the same mean expression.)
The only way I have been successful in renormalizing was by feeding the script a list of only protein coding genes (~19000 genes).
Am I still doing something wrong here? Should I be using a new Rdata-file, generated with iRAP-1.0.0a? By the way, the command above reports that using --mass-biotypes requires a GTF but I don't see it using it.
Highest regards,
Freek.
from irap.
Hi Freek,
Thank you for the detailed description. Indeed the parameter mass_biotypes was being ignored when computing TPMs - for rpkm, fpkm, fpkm-uq and uq-fpkm was working as expected. I suspect that my subconscious had some objections to change the TPM formula ;-)
The code irap_raw2metric's code (in devel) was changed for the parameter mass_biotypes to work as expected while computing TPMs. A new release should come out by the end of the week. Please let me know if it works for you...or not.
Cheers.
from irap.
Closing this issue since the new version is finally out. Cheers.
from irap.
Related Issues (20)
- aux/R/irap_de.R::process.cmdline.args mis-casts --min (opt$min_count) as a string, should be integer HOT 1
- sample name do not match
- fastq_pre_barcodes.c:34:17: fatal error: bam.h: No such file or directory
- Downloads to a submodule?
- iRAP pipeline for EMBL-EBI Expression Atlas HOT 1
- Configuration file for strand specific protocol
- mapper options in config file is ignored HOT 3
- our_prefix error when analysis name is same as the raw read file name without suffix HOT 2
- failed to load GTF file HOT 3
- RSEM strandness HOT 12
- TPM values rounded to 2 decimals HOT 3
- Docker link in wiki is wrong HOT 1
- Error running Docker image HOT 2
- iRAP's Kallisto call seems to provide read length at fragment size parameter HOT 1
- sh: 1: set: Illegal option -o pipefail error in "irap_gtf_to_fasta". HOT 4
- iRAP does not recognise stage0 completion for large genomes with HISAT2 HOT 5
- docker image for v1.0.1 HOT 2
- docker image 1.0.1 not working HOT 1
- No rule to make target HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from irap.