davidaknowles / leafcutter Goto Github PK

Annotation-free quantification of RNA splicing. Yang I. Li, David A. Knowles, Jack Humphrey, Alvaro N. Barbeira, Scott P. Dickinson, Hae Kyung Im, Jonathan K. Pritchard

Home Page: http://davidaknowles.github.io/leafcutter/

License: Apache License 2.0

R 41.47% Python 14.81% Shell 1.53% Perl 4.69% C++ 36.10% Stan 1.34% Batchfile 0.07%

leafcutter's People

Contributors

Stargazers

Watchers

Forkers

xtmgah lpantano rgunning mpvillerius q-kim awalsh17 zhanxw weichengtang slowkow rmagoglia david-a-parry mgandal ddpinto anykine luna5124 scchess danielnavarrogomez yanxiasklmg jchenpku noahpieta xflicsu gmumuxi tw7649116 wangdi2014 dnousome freestatman snaqvi1990 monkeylearner koduu yiminghu tkzeng mw201608 adawon renzhonglu loganylchen mdshw5 jgmeyerucsd wenmm fengyq wz950818 chrishah gouthamatla 11649070 bfairkun bkutlu hmontenegro harel1 huangjingying zhushen3128 xjyx yx-xu oxb-dc maegsul xuyongyue amrr101 garrettjenkinson dantaki tbabola gordonaa hj1994412 lonet-n boxiangliu biobit jackhump iain-s borg-org dawei1203 jmerkin vinayakvsv standardgalactic zjardyn jianguozhou3 theron-palmer dimitrimeistermann kur1sutaru vincenttano francois-a yfarjoun meenal002 lucafusarbassini pairxbio neumoratx wook2014 karlaarz hsun3163 ethanwillis bek myheadintheclouds josephlalli tianwen0003 guliba jacarey hannahhanle fulcrumgenomics barrymoore adelpozomt prabindh nextrnatx abhilashdasari reinertanalytics

leafcutter's Issues

leafcutter_ds.R not recognizing other organism .txt.gz exon file

Hello,

I have tried converting USCS, Ensembl and gencode mm10 (mm14 for gencode) .gtf files to the .txt.gz using the gtf_to_exons.R script but when I run leafcutter_ds.R I get an error that the file does not exist even though I am passing the correct location/directory for the generate .txt.gz

Calculation of PSI

Hi there,

I've been playing with LeafCutter, and going through the code and manuscript, I can't quite figure out how the PSI values are calculated. For example, in Figure 2a in the manuscript, I can't quite figure out how the PSI values were calculated for the clustering.

I've been able to run through the example data set, and have loaded all the count data into R. Is there some function included in the package for converting the count data to PSI, or is that calculated elsewhere? Thanks!

Leafviz annotation caveats

Gtf files from ensembl usually only has chromosome number as in the chromosome column.

It seems the prepare_results.R only works with "chr" implemented annotation.

If I use directly from original ensembl gtf, it will popup this error message:

Error in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y, check_na_matches(na_matches)) :
Can't join on 'chr' x 'chr' because of incompatible types (integer / character)
Calls: %>% ... left_join -> left_join.tbl_df -> left_join_impl -> .Call
Execution halted

adding grid and gtable packages as dependencies

@davidaknowles - downloading pretty tables now uses the gtable and grid packages, can you add these as dependencies to the the R package please?

Segfault from bedtools in prepare_results.R

I'm trying to run prepare_results.R on the YRI vs EU example in /worked_out_example. At the line

all.introns_intersect <- fread(all.introns.cmd)

bedtools is segfaulting:

***** WARNING: File ../leafviz/annotation_codes/gencode_hg19/gencode_hg19_all_introns.bed.gz has inconsistent naming convention for record:
GL000220.1	154617	154624	"CH507-513H4.5"	"ENSG00000281383.1_2"	-	"ENST00000629969.1_1"	1	"lincRNA"	"OTTHUMG00000189717.1_2"

sh: line 1: 17448 Segmentation fault: 11  ( bedtools intersect -a for_leafviz/all_junctions.bed -b ../leafviz/annotation_codes/gencode_hg19/gencode_hg19_all_introns.bed.gz -wa -wb -loj -f 1 )

I think the has inconsistent naming convention for record is todo with checking for e.g. chr1 vs 1 naming of chromosomes, but nothing like that is going on there. Maybe bedtools doesn't like the GL000220.1 chromosome name for some reason? This is my all_junctions.bed:
https://www.dropbox.com/s/43eq0zdtv27owh7/all_junctions.bed?dl=0
and gencode_hg19_all_introns.bed.gz is here:
https://www.dropbox.com/s/pt1pbh5r40pjjs8/gencode_hg19_all_introns.bed.gz?dl=0

LeafViz mysteriously stalls

This is a bit of a weird one but if you rapidly click between different clusters or switch tabs a few times the Shiny session just stalls. @davidaknowles can you confirm this happens for you too? It doesn't grey out (which would be a crash) but all the reactive objects (the PCA plot, the rows of the cluster view etc) stop responding to clicks.

I think this to do with the reactivity logic I have set up. It's pretty messy so I'll try to streamline it.

License?

What is the license to this code? There is no LICENSE file and I don't see a license referenced in the README.

From Github's article on OS licensing

Generally speaking, the absence of a license means that the default copyright laws apply. This means that you retain all rights to your source code and that nobody else may reproduce, distribute, or create derivative works from your work.

Github has created a helpful website for choosing licenses.

weighting samples by group/individual

I am trying to use leafcutter to perform a joint analysis of sex bias in splicing across different species. Essentially, I am modeling Species and Sex as covariates, and then looking for an effect of Sex across species. The issue is that one species (human) comes from a different dataset (GTEx, as opposed to the other species that I've generated data for myself), has a much larger sample size, and is generally more variable (being an outbred population). The effect is that the sex bias in human "dominates" the analysis, i.e. splicing events sex-biased in humans only show up as sex-biased in the joint analysis regardless of any effect in the other species, because there are so many more samples. When looking for sex bias in gene expression, I can deal with this issue using the random effect and sample weighting functionality in the limma/voom package.

I'm wondering if it would be possible to do something similar with leafcutter. The possibility of random effect has been discussed in #44, but I'm wondering about the possibility of sample weighting. The approach I'm using for gene expression is detailed in: https://academic.oup.com/nar/article/43/15/e97/2414275#39424591
Obviously this exact approach is not possible with the DM GLM, but I'm wondering if it would be possible to supply some user-determined sample weights (I would imagine based on the N in each species) and have that be incorporated in the modeling.

Thanks
Sahin

update STAR params to reflect later versions of parameter options.

http://davidaknowles.github.io/leafcutter/articles/Usage.html (Differential splicing)

Alignment params : STAR --genomeDir hg19index/ --twopassMode --outSAMstrandField intronMotif --readFilesCommand zcat --outSAMtype BAM

No longer default in STAR STAR-2.5.3a.

Suggest: --twopassMode Basic
Suggest: --outSAMtype BAM Unsorted or --outSAMtype BAM SortedByCoordinate

Depending on that leafcutter expects from STAR.

PSI and covariates

Hello,
I was wondering if there is a way to get additional output when adjusting for covariates, so that it is possible to:

output the PSI values at the sample level after covariate fitting? Right now it only seems possible to output adjusted PSI values per group or otherwise raw PSI values per sample; and
use it for the PCA plots in leafviz.
Also, it looks like that when using covariates, the PSI values in leafviz (i.e. psi given per group for a significantly DS cluster) aren't the fitted PSI values.

Thanks!

LeafViz Gene-level vizualization: Error: nrow(myclusters) > 0 is not TRUE

Hello there,

I am encountering a couple of issues with the Gene-level visualization panel of LeafViz.

Re: Error: nrow(myclusters) > 0 is not TRUE

The nrow(myclusters) error is generated in an inconsistent manner when running run_leafviz.R with the same leafviz.RData:
A) A splicing event visualization plot is generated successfully for a gene, but a gene-level plot fails in one session, but not a second session.
B) A splicing event-plot is generated for but the gene-level plot fails, but after viewing gene-level plots for other genese, the failing gene will, in the same session, produce a plot.

prepare_results.R ran without any warnings or errors.

run_leafviz.R output with an example fail:

./run_leafviz.R $ANALYSISDIR/leafviz/$ANALYSIS"_leafviz.RData"
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
Attaching package: ‘DT’
The following objects are masked from ‘package:shiny’:
dataTableOutput, renderDataTable
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
[1] "Loading results from .../leafviz.RData"
Listening on http://......

Attaching package: ‘shinyjs’
The following object is masked from ‘package:intervals’:
show
The following object is masked from ‘package:Rcpp’:
show
The following object is masked from ‘package:shiny’:
runExample
The following objects are masked from ‘package:methods’:
removeClass, show

TableGrob (2 x 1) "arrange": 2 grobs
z cells name grob
1 1 (1-1,1-1) arrange gtable[layout]
2 2 (2-2,1-1) arrange gtable[layout]
Warning: Error in : nrow(myclusters) > 0 is not TRUE
Stack trace (innermost first):
104: stopifnot
103: make_gene_plot
102: print
101: withCallingHandlers
100: suppressWarnings
99: renderPlot [/Users/anna/bin/leafcutter/leafviz/server.R#209]
89: reactive:plotObj
78: plotObj
77: origRenderFunc
76: output$select_gene_plot
1: shiny::runApp

I can send more information if you let me know what would be useful?

leafcutter still not recognizing mm10 .txt.gz file in leafcutter_ds.R

I tried re-running with de-bugged leafcutter_cluster.py but I am still gettting this error when running leafcutter_ds.R
Loading required package: Rcpp
Error in parse_args(OptionParser(usage = "%prog [options] counts_file groups_file", :
required at most 2 positional arguments, got 3
Execution halted

I used your gtf_2_exon.R on the gencode.vM14.annotation.gtf.gz

subsequently this .txt.gz file will not work in the plotting splice junctions as using the Leafcutter Shiny App @davidaknowles @jackhump

LeafViz improvements

Hideable full width panel saying "Welcome to LeafViz" and some description of what this is, usage, and link to documentation on viewing your own data.
Default landing to an interesting gene, e.g. Rbfox1
Gene view clusters labels currently overlap e.g. for Rbfox1, maybe print diagonally? (not Diagon Alley)
leafvis -> LeafVis (on menu bar)
Make logo and LeafViz link to homepage
Add data title somewhere, e.g. instead of “Differential splicing events” have “GTEx Brain vs Heart differential splicing events” (this is available as experimentCode)

"all.introns$ensemblID" related error during preparing results using leafviz/prepare_results.R

Hi all,

Thanks for developing LeafCutter, it works great for my dataset so far! However, I have an issue with leafviz/prepare_results.R scripts for visualizing the results.

I tried both annotation files provided by you and the annotation files I've created; but I am receiving the following error anyway:

Loading required package: leafcutter
Loading required package: Rcpp

Attaching package: âdplyrâ

The following objects are masked from âpackage:data.tableâ:
between, first, last
The following objects are masked from âpackage:statsâ:
filter, lag
The following objects are masked from âpackage:baseâ:
intersect, setdiff, setequal, union
Preparing for visualisation
Results to be saved in: OLBE
Using annotation at: /home/fahri/leafcutter_RNASeq_LCL_77/leafcutter/leafviz/gencode_hg19_fk/gencode_hg19_fk
Loading counts from RNAseq_brainFC_d12938out_perind_numers.counts.gz
Loading metadata from RNASeq_brainFC_diff_intron_withCOV_d12938out.txt
Loading exons from /home/fahri/leafcutter_RNASeq_LCL_77/leafcutter/leafviz/gencode_hg19_fk/gencode_hg19_fk_all_exons.txt.gz
Read 999773 rows and 10 (of 10) columns from 0.096 GB file in 00:00:03
Read 999773 rows and 10 (of 10) columns from 0.096 GB file in 00:00:03
Read 999773 rows and 10 (of 10) columns from 0.096 GB file in 00:00:03
[1] "Annotating junctions"
[1] "Preparing results"
Warning message:
In is.na(all.introns$ensemblID) :
is.na() applied to non-(list or vector) of type 'NULL'
Error in $<-.data.frame(*tmp*, transcripts, value = ".") :
replacement has 1 row, data has 0
Calls: $<- -> $<-.data.frame
Execution halted

Is it related to the fact that the annotation files do not have a header (except the exon file), hence there is no all.introns$ensemblID ? Or something else (high likely)? Do you have any idea what the problem could be?

Thanks in advance!

+ and - clusters

Hi,
Thank you for building leafcutter! The results look promising, but it is puzzling to us why there are two distinct clusters for the same splicing event. When we stepped through the code, we realized that the '+' and '-' suffix on each cluster seemed to be indicating reads mapped on the forward or reverse strands.

Our question is:
If our RNA-sequencing isn't performed on a stranded kit, is there some way to disable the strandedness and have these identified clusters be "merged"?

Thanks,
Ellen

Does the leafcutter works for time series experiment?

Hi,
I have RNA-Seq samples from tomato fruit development stages. I want to identify sig changed AS from these stages. In the example of leafcutter, I can only make comparisons such as:
stage 1 vs stage 2
stage 2 vs stage 3
stage 1 vs stage 3.

How can I perform statistical analysis in a time-series way?
Thanks.

sQTL mapping: how to deal with 0/0

Hi David,

Thanks for making leafcutter available. I am wondering what your strategy is to deal with 0/0 in sQTL mapping. I have a few thoughts but none are quite optimal. The first one is to simply through these samples out (for the particular introns). However, neither matrix eQTL or fastqtl will be happy with missing data. Another thought is to consider them zero. However this perhaps does not reflect what is really going on. The final thought is to maybe use Laplace smoothing, making the proportion 1/N where N is the total number of introns in the cluster. The last one makes the most sense to me.

Do you have any thoughts on this?

Bosh

clustering counts

Hello, I've re-aligned the Geuvadis RNA-Seq data using STAR. I've made junc files using leafcutter, as well as clustered all the data using two settings: -m10 and the default -m50. I'm trying to understand the numbers reported in perind.counts.gz using IGV. Here are the numbers:

for -m 10: 8/24
for -m 50: 7/21

Here is a screenshot of the bam with the junction in question (20:270948-271056) highlighted in orange:

Can you help me interpret these counts based on this picture?

Thanks a lot.

[BUG] PCA from prepare_phenotype_table.py has incorrect dimension

HI @davidaknowles,

The PCA output from prepare_phenotype_table.py has N columns, where N is the number of clusters. Instead, I think one wants M columns, where M is the number of individuals. On line 149 there is matrix = np.transpose(matrix) # important bug fix, which seems to be causing the problem.

'Rcpp::loadRcppModules' is deprecated.Use 'loadModule' instead.

Hello,

Thank you for building this tool. I was running the working example and I keep getting a warning message about deprecated load used for Rcpp. Please see below :

merging 10 junction files...
merging 10 files done.
Loading required package: Rcpp
Warning messages:
1: replacing previous import ‘dplyr::filter’ by ‘stats::filter’ when loading ‘leafcutter’
2: replacing previous import ‘dplyr::lag’ by ‘stats::lag’ when loading ‘leafcutter’
3: 'Rcpp::loadRcppModules' is deprecated.
Use 'loadModule' instead.
See help("Deprecated")
Loading counts from ../example_data/testYRIvsEU_perind_numers.counts.gz
Loading metadata from ../example_data/test_diff_intron.txt
Encoding as CEU =0, YRI =1
Loading required package: doMC
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
Settings:
$output_prefix
[1] "leafcutter_ds"

$max_cluster_size
[1] Inf

$min_samples_per_intron
[1] 5

$min_samples_per_group
[1] 3

$min_coverage
[1] 20

$timeout
[1] 30

$num_threads
[1] 4

$help
[1] FALSE

Running differential splicing analysis...
Differential splicing summary:
statuses Freq
1 Success 523
2 <=1 sample with coverage>min_coverage 7
3 Not enough valid samples 71
4 <2 introns used in >=min_samples_per_intron samples 1
Saving results...
No exon_file provided.
Warning message:
In bind_rows_(x, .id) : Unequal factor levels: coercing to character
All done, exiting
Loading required package: Rcpp
Warning messages:
1: replacing previous import ‘dplyr::filter’ by ‘stats::filter’ when loading ‘leafcutter’
2: replacing previous import ‘dplyr::lag’ by ‘stats::lag’ when loading ‘leafcutter’
3: 'Rcpp::loadRcppModules' is deprecated.
Use 'loadModule' instead.
See help("Deprecated")
Loading counts from ../example_data/testYRIvsEU_perind_numers.counts.gz
Loading metadata from ../example_data/test_diff_intron.txt
Loading exons from ../leafcutter/data/gencode19_exons.txt.gz
Saving 3 plots to ds_plots.pdf
[1] "chr1:clu_11"
[1] CEU
Levels: CEU YRI

The Tool of Step 0

There is some problem so that I can't use STAR.And should I use olegoindex to build index to use olege? Could I use other tools to do step 0? Ididn't know too much of bioinfo.Could you help?

Error in covariate analysis when using smart initialization

Hello,
Running leafcutter using covariates with the latest git code gives the following error message:
Error in solve.default(t(xNull) %% xNull, t(xNull) %% y_norm): system is computationally singular: reciprocal condition number = ###

Running without covariates or with random initialization gives no errors. Any ideas what the issue may be?

Thanks!

Improved documentation: Step 1. Prepare the LeafCutter differential splicing results for visualisation

This part is unclear and an example that exactly matched the results from http://davidaknowles.github.io/leafcutter/articles/Usage.html would be really helpful.

Rscript prepare_results.R --iFolder
--oFolder
--support
--annotation_code <annotation_code>
--code --FDR \ iFolder - the folder that contains the results of the differential intron excision analysis. The script assumes all results share the same path and code.


oFolder - the folder where the results will be written.
support - the same support file as used for the previous analysis.
annotation_code - as before.
code - the same code used for the rest of the analysis, eg testYRIvsEU
FDR - the benjamini-hochberg false discovery rate with which to filter the results.

Usage.html Typo Step 3. Differential intron excision analysis leafcutter_ds.R syntax

leafcutter/docs/articles/Usage.html

Line 158 in ff05d2c

 <pre><code>../scripts/leafcutter_ds.R --num_threads 4 --exons_file=../leafcutter/data/gencode19_exons.txt.gz ../example_data/testYRIvsEU_perind_numers.counts.gz ../example_data/test_diff_intron.txt</code></pre> 

replace --exons_file with --exon_file

Re conflict: ln 13 leafcutter_ds.R : make_option(c("-e","--exon_file")

Version 2.0, January 2004, sourced 25/04/2018
git clone https://github.com/davidaknowles/leafcutter

Error: segfault from C stack overflow

Hi,

when I ran leafcutter with the following commands, I got the "Error: segfault from C stack overflow" error:

My code:
./prepare_results.R --meta_data_file diff_introns.CaseVsCont.txt --code leafcutter YRIvsEU_perind_numers.counts.10166_Exc.gz leafcutter_ds_cluster_significance.txt leafcutter_ds_effect_sizes.txt lv3_robust.trnscpt.hg38.annotation -f 0.05

I would be glad, if you could help.

Thanks.

Error when using exon file

Hi, When I use the exon file with the example or my data, I get the following error:
Saving results...
Loading exons from /leafcutter/leafcutter/data/gencode19_exons.txt.gz
Error in try({ :
unused argument (error = function(err) warning(as.character(err)))
Execution halted

I have tried changing the files to all use chr as well as removing chr from all files. Do you know what the problem could be?

Enhancement: Add stats to ds_plots.pdf

It would be useful if ds_plots.pdf was standalone and contained:

Stats for the gene displayed, counts, etc.
Table of genes (with gene/cluster mapping) with same stats.

Possible bug in clustering approach

Hello,

After reading through the leafcutter_cluster.py code, I've noticed that there may be edge cases where introns are not clustered properly. Specifically, I mean that some of the final clusters (after the refine_clusters step has been run) may not actually form well-connected graphs as described in the paper.

For example, suppose you have three overlapping introns with coordinates (1, 10), (5, 10), and (5, 20), with respective read counts of 10, 5, 10. Cluster_intervals will group these introns together since they overlap. Then, refine_linked will connect this cluster of overlapping introns into a graph since (1, 10) and (5,10) share an edge and (5,10) and (5, 20) also share an edge.

However, during the refine_cluster step, the (5, 10) intron will be purged since it has a read count < 10. Then you are left with two introns: (1, 10) and (5, 20). These will be re-clustered together at line 367 since they still overlap. Refine_cluster will proceed to then return this cluster to the parent function refine_clusters where it will be written to output as a "refined cluster". However, despite overlapping, these two introns do not share an acceptor or a donor site and thus shouldn't form a "refined_cluster".

I believe this behavior is inconsistent with the description of the method in the paper. But, please let me know if I am misinterpreting something. Hope this is helpful.

Regards,
Dan

feature request: Make leafcutter agnostic of chromosome identifiers

Hi,

It would be fantastic - essential for me ;) - if leafcutter could process non-human or non-standard chromosome identifiers. Currently, it cannot, as per this line in leafcutter_cluster.py: chromLst = ["chr%d"%x for x in range(1,23)]+['chrX','chrY']+["%d"%x for x in range(1,23)]+['X','Y']
and skipping everything else entirely ( if chrom not in chromLst: continue ).
Use cases for this request abound and include:

organisms with chromosome counts exceeding the number 23 (zebrafish, dog, cow...)
organisms with different chr. naming schemes (cat, fruit fly, nematode)
unplaced scaffolds, alternate versions (haplotypes)
alignments against genome assemblies using different identifiers (e.g. NC_000075.6 via NCBI)

Arguably, leafcutter should make no assumptions whatsoever as to how chromosomes are identified.

Chromosome IDs could be provided by the user in a file, or (much preferred!) be generated on the fly based on a digest of the .junc files generated before, as per a preprocessing step based on something like: cut -f1 *.junc | sort | uniq

Thanks much,
Sebastian

PS: Related to issue #25

Problems recreating results from the example

I ran the code of steps 1 and 2 in the Usage section of the tutorial on the provided data, however running the script leafcutter_cluster.py provides a different file compared to the tutorial:

zcat testYRIvsEU_perind_numers.counts.gz | head

RNA.NA06986_CEU.chr1.bam RNA.NA06994_CEU.chr1.bam RNA.NA18486_YRI.chr1.bam RNA.NA06985_CEU.chr1.bam RNA.NA18487_YRI.chr1.bam RNA.NA06989_CEU.chr1.bam RNA.NA06984_CEU.chr1.bam RNA.NA18488_YRI.chr1.bam RNA.NA18489_YRI.chr1.bam RNA.NA18498_YRI.chr1.bam
chr1:880180:880422:clu_1_- 38 9 56 44 23 1 28 24 14 5
chr1:880180:880437:clu_1_- 3 6 15 17 4 0 3 11 4 1
chr1:999969:1001210:clu_2_- 42 23 12 5 4 25 7 14 27 8
chr1:999969:1001329:clu_2_- 17 10 2 8 1 6 6 4 17 4
chr1:1274033:1274667:clu_3_- 10 3 9 6 4 3 6 2 11 8
chr1:1274033:1274742:clu_3_- 9 7 9 6 6 0 3 6 6 9
chr1:1310170:1310377:clu_4_- 10 3 42 48 23 0 16 6 2 0
chr1:1310170:1310534:clu_4_- 40 19 163 313 112 0 60 58 13 14
chr1:1326245:1328776:clu_5_- 6 25 21 10 6 8 7 4 11 20

versus the data from the tutorial

RNA.NA06986_CEU.chr1.bam RNA.NA06994_CEU.chr1.bam RNA.NA18486_YRI.chr1.bam RNA.NA06985_CEU.chr1.bam RNA.NA18487_YRI.chr1.bam RNA.NA06989_CEU.chr1.bam RNA.NA06984_CEU.chr1.bam RNA.NA18488_YRI.chr1.bam RNA.NA18489_YRI.chr1.bam RNA.NA18498_YRI.chr1.bam
chr1:17055:17233:clu_1 21 13 18 20 17 12 11 8 15 25
chr1:17055:17606:clu_1 4 11 12 7 2 0 5 2 4 4
chr1:17368:17606:clu_1 127 132 128 55 93 90 68 43 112 137
chr1:668593:668687:clu_2 3 11 1 3 4 4 8 1 5 16
chr1:668593:672093:clu_2 11 16 23 10 3 20 9 6 23 31

could anybody tell me the commit used to create the data from the tutorial?
Best regards
Nicolas

Installation woes

I created an Anaconda R environment and downloaded leafcutter this way:

conda create -n leafcutter --yes r-essentials
source activate leafcutter
cd ~/workspace-git/
git clone https://github.com/davidaknowles/leafcutter

I tried to install using two different ways but got errors both times. First I tried the instructions in the README:

(leafcutter) [obotvinnik@tscc-login2 workspace-git]$ cd leafcutter/
(leafcutter) [obotvinnik@tscc-login2 leafcutter]$ R CMD INSTALL --build .
Warning: invalid package ‘.’
Error: ERROR: no packages specified

Then I tried moving up a directory and installing that way:

(leafcutter) [obotvinnik@tscc-login2 leafcutter]$ cd ..
(leafcutter) [obotvinnik@tscc-login2 workspace-git]$ R CMD INSTALL --build leafcutter
Warning: invalid package ‘leafcutter’

I'm not too familiar with R so I'm not sure how to proceed.

Missing R package pre-reqs

Hi there - I was attempting to run through the worked_out_example.sh script, and encountered a few hang-ups. It looks like the optparse and doMC packages were not installed along with LeafCutter when installing the package via devtools. Installing the packages manually did the trick though.

bam2junc: Unexpected cigar string (aligned with STAR 2.5.1b)

When running bam2junc.sh on a STAR-aligned and sorted bam-file, containing also unmapped reads.

unexpected CIGAR string: 9M13425N16M25H in HISEQ:56:D1W4DACXX:7:1206:7478:89627: CCATATTCACTTTGTAGAATGCCAT
 at ../scripts/sam2bed.pl line 206
        main::samToBed('HASH(0x1c8e810)', 1) called at ../scripts/sam2bed.pl line 101
Traceback (most recent call last):
  File "../scripts/filter_cs.py", line 26, in <module>
    sys.stdout.write(ln)
IOError: [Errno 32] Broken pipe

Scaffolded genomes update?

Great software, thanks. I had a quick suggestion. If software is designed to be annotation free, it would be nice to use it on genomes other than human. Would be good to change this line in leafcutter_cluster.py to account for non-human chromosome numbers, and scaffolded genomes:

chromLst = ["chr%d"%x for x in range(1,23)]+['chrX','chrY', ]+["%d"%x for x in range(1,23)]+['X','Y']

EDIT: strike that--I see you've got lines accounting for chroms not in chromLst

KeyError: 'chrom'

Hi,

When trying to prepare the leafcutt cluster output to run a splice QTL analysis I get the following error:

Traceback (most recent call last):
File "../scripts/prepare_phenotype_table.py", line 166, in
main(args[0], int(options.npcs) )
File "../scripts/prepare_phenotype_table.py", line 58, in main
chrom = dic['chrom'].replace("chr",'')
KeyError: 'chrom'

What can cause this error?

Cheers,
Marc

Run-to-run variability

Hello,
Running the leafcutter differential splicing analysis multiple times on the same case/control dataset using the exact same parameters and cluster count inputs, gives varying loglr and p value results between runs. This happens irrespective of using covariates or not.

Could you perhaps explain why the results differ per run? Is there a permutation or estimation step that causes the variation and, if so, can it be fixed by setting a particular seed value?

Thanks!

clustering does not works on other organism

the clustering does not work one other organism due to the chromosome list only accept number and x,y
line 25 of leafcutter_cluster.py
chromLst = ["chr%d"%x for x in range(1,23)]+['chrX','chrY']+["%d"%x for x in range(1,23)]+['X','Y']
could you please update it?

Issue with leafviz shiny app

The new updates to leafcutter are great! However, while I am able to get the shiny app to work for the example data, when I try to plot my own results I get the error "object 'gene_name_df' not found" when I click on any gene under Differential splicing events. Do you have any idea what is happening?

unable to install R package

Hello,

I get a segfault when I try to install the leafcutter R package (see traceback below). I'm wondering what version of R leafcutter was developed in (I'm using 3.3) as a first pass to see if this fixes the issue.

Thanks,
Jenny

Traceback:
1: .Call(Module__classes_info, xp)
2: Module(m, pkg, mustStart = TRUE)
3: doTryCatch(return(expr), name, parentenv, handler)
4: tryCatchOne(expr, names, parentenv, handlers[[1L]])
5: tryCatchList(expr, classes, parentenv, handlers)
6: tryCatch({ mod <- Module(m, pkg, mustStart = TRUE) if (isTRUE(direct)) { populate(mod, ns) } else { forceAssignInNamespace(m, mod, ns) } assign(.moduleMetaName(m), mod, envir = ns)}, error = function(e) { stop(sprintf("failed to load module %s from package %s\n%s", m, pkg, conditionMessage(e)))})
7: Rcpp::loadRcppModules()
8: fun(libname, pkgname)
9: doTryCatch(return(expr), name, parentenv, handler)
10: tryCatchOne(expr, names, parentenv, handlers[[1L]])
11: tryCatchList(expr, classes, parentenv, handlers)
12: tryCatch(fun(libname, pkgname), error = identity)
13: runHook(".onLoad", env, package.lib, package)
14: loadNamespace(package, lib.loc)
15: doTryCatch(return(expr), name, parentenv, handler)
16: tryCatchOne(expr, names, parentenv, handlers[[1L]])
17: tryCatchList(expr, classes, parentenv, handlers)
18: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
19: try({ attr(package, "LibPath") <- which.lib.loc ns <- loadNamespace(package, lib.loc) env <- attachNamespace(ns, pos = pos, deps)})
20: library(pkg_name, lib.loc = lib, character.only = TRUE, logical.return = TRUE)
21: withCallingHandlers(expr, packageStartupMessage = function(c) invokeRestart("muffleMessage"))
22: suppressPackageStartupMessages(library(pkg_name, lib.loc = lib, character.only = TRUE, logical.return = TRUE))
23: doTryCatch(return(expr), name, parentenv, handler)
24: tryCatchOne(expr, names, parentenv, handlers[[1L]])
25: tryCatchList(expr, classes, parentenv, handlers)
26: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste0(prefix, "\n ") } else prefix <- "Error : " msg <- paste0(prefix, conditionMessage(e), "\n") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error", condition = e))})
27: try(suppressPackageStartupMessages(library(pkg_name, lib.loc = lib, character.only = TRUE, logical.return = TRUE)))
28: tools:::.test_load_package("leafcutter", "/seq/jjenny/bin/R")
An irrecoverable exception occurred. R is aborting now ...

Little typo in demo

http://davidaknowles.github.io/leafcutter/articles/Usage.html

--exon(s)_file

Took me some time to realize this typo....

feature request: strand in perind.counts

It would be great to keep the strand information in the clustered junction output. It is very hard to get post-hoc. Or can you suggest an alternative?
Thanks a lot.

sQTL analysis

Hello, there is no mention of how sQTL analysis is done (which is claimed in the paper). I have completed steps 1 & 2 to get clustered junctions for my dataset and I also have genotypes for these samples (vcf). Is there a way to get the variants associated with the creation of novel splice junctions or the loss of a junction? Thank you very much.

chrX and chrY skipped in prepare_phenotype_table.py

Hi,

I noticed that prepare_phenotype_table.py excludes the X and Y chromosome when writing PSI values by chromosome. I have temporarily 'hacked' the script by setting chrX as chr23 and chrY as chr24, but is there any chance chrX and chrY could be included in the by chromosome output?

Thanks,
Sahin

Errors when running leafcutter_ds.R and ds_plots.R

Hi,
leafcutter_ds.R
Error in { :
task 1 failed - "object of type 'closure' is not subsettable"
Calls: leaf_cutter_effect_sizes -> bind_rows -> list_or_dots -> %do% ->
Execution halted

ds_plots.R
Error in strsplit(introns, ":") : non-character argument
Calls: make_differential_splicing_plot -> -> do.call -> strsplit
Execution halted

small sample size affecting cluster significance detection?

within my data I only have 3 replicates per sample so there are 6 total samples in a comparison. I am able to generate the leafcutter_ds_effect_size.txt and leafcutter_ds_cluster_significance.txt files but when I use them in the prepare_results.R script for visualization I get the following error:Error: No significant clusters found
However, I have been using other alternative splicing tools like rMATS, MISO and SUPPA and I get 100's of significant calls. So I was wondering if my small sample size is to blame for no significant cluster detection? @jackhump @goldenflaw

modeling random effects

Hello David,
Is there a way to model individual as random effect as part of the differential splicing analysis, so that one could account for non-independent samples (duplicates) or repeated measurements for the same individual?

Running quantify_psi, with ref to issue 34

Hi David

I have successfully run leafcutter for a differential analysis of splicing events across some samples that we have. The instructions were wonderfully clear and the results are compelling - thank you! I am now trying to retrieve the raw ratios or PSI scores per sample that would give us a bit more insight beyond the delta_psi output by differential leafcutter. I followed the discussion on issue 34 and tried, unsuccessfully, to run your quantify_psi script listed there. Is this the right script to use? I used the junc_clustering_perind_numers.counts matrix (output from the clustering algorithm) as input, script runs but the result is an object with the following message repeated over and over: requires numeric/complex matrix/vector arguments
Any advice?

Thanks!

effect_sizes.txt file empty

Dear developers,
I have tried the program with only 2 files WT and KO feeding the correct bam.junc file et the leafcutter_ds_effect_sizes.txt remains empty, is it a normal behavior due to the presence of only 1 replicate for each sample to be analysed for differential splicing ?

Any help would be greatly appreaciated.
Best

Issue with PSI calculated in effect_size_table

There is a subtle non-identifiability in the model between the vector of concentration parameters and the intercept, which means that the concentration parameters may sometimes compensate for the intercept. This shouldn't effect the likelihood ratio test but it means the the PSI and deltaPSI values we were calculating could be a bit off.

enrichment of high p-values

Hello David and Yang,

This is not an issue per se but I would like to hear your thoughts and opinions.

I mapped sQTLs with normalized leafcutter ratios, and the p-value distribution look non-uniform. In particular, there is a positive slope towards 1 as shown in the figure below.

I hypothesized that the positive slope is due to introns with low reads. For instance, if a particular intron only has a few reads per sample, it will produce low p-values regardless of the effect size. I therefore plotted the mean p-value against the binned reads as shown below.

This figure shows all bins besides the last 3 have median slighted above 0.5. This means that unless I set a threshold at mean count >=e^9.72 = 16647.24 reads, I will get non-uniform p-value distributions. However, this cutoff is unreasonably high.

Did you see similar things for your sQTL mappings? If so, how did you solve the problem?

P.S. My normalization method is similar to the GTEx eQTL normalization. First quantile normalize samples against each other and then normalized the ratio of each intron to a gaussian.

davidaknowles / leafcutter Goto Github PK

leafcutter's People

Contributors

Stargazers

Watchers

Forkers

leafcutter's Issues

Hi, When I use the exon file with the example or my data, I get the following error: Saving results... Loading exons from /leafcutter/leafcutter/data/gencode19_exons.txt.gz Error in try({ : unused argument (error = function(err) warning(as.character(err))) Execution halted

Recommend Projects

Recommend Topics

Recommend Org

Hi, When I use the exon file with the example or my data, I get the following error:
Saving results...
Loading exons from /leafcutter/leafcutter/data/gencode19_exons.txt.gz
Error in try({ :
unused argument (error = function(err) warning(as.character(err)))
Execution halted