esebesty / splicingfactory Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 1.01 MB

Splicing Diversity Analysis for Transcriptome Data

Home Page: https://www.bioconductor.org/packages/release/bioc/html/SplicingFactory.html

License: GNU General Public License v3.0

R 100.00%

gini-index rna-seq shannon-entropy simpson-index splicing transcriptomics

splicingfactory's People

Contributors

Stargazers

Watchers

splicingfactory's Issues

Package info updates

Update package URLs, emails, citation.

Implement IHW or similar for controlling FDR

Implement IHW or a similar method to control FDR and use average gene expression across samples or the number of transcripts as informative priors.

Calculate difference - label shuffling error

I encountered the following error when running calculate_difference() with label shuffling:
Error in ecdf(shuffled[i, ]) : 'x' must have 1 or more non-missing values

It turned out that some rows (genes) with only 0 values caused the issue.
Would be nice to find a solution/recommendation on how to handle this problem (e.g. by simply dropping genes with only 0s before difference calculation or something else).

Another issue related to this problem: after calculating diversity, there are some rows with a few really low values (values < .Machine$double.eps), that are handled as 0s so these rows also cause errors:

Error in if (ecdf(shuffled[i, ])(log2_fc[i]) >= 0.5) { : 
  missing value where TRUE/FALSE needed
Calls: calculate_difference -> label_shuffling
In addition: There were 50 or more warnings (use warnings() to see the first 50)

I avoided this issue with some pre-filtering:

# Convert really small values to 0s:
diversity_data[.Machine$double.eps > diversity_data] <- 0

# Filter out samples with only zeros:
diversity_data_filtered <- diversity_data %>% 
  mutate(rowsum=rowSums(select(., starts_with("dataset")))) %>% 
  filter(rowsum != 0) %>% 
  dplyr::select(-rowsum)

Build fails due to example data error

Build fails as vignette build fails here. Currently only 4 samples are present in the data instead of 40.

Use SummarizedExperiment rather than ExpressionSet

as required by Bioconductor devs.

SummarizedExperiment for user-facing functions

Ideally, calculate_diversity (and calculate_method) should return a SummarizedExperiment, that is the input for calculate_difference. This way, we can store gene/transcript annotation in a DataFrame accessible using the function rowData(), instead of additional columns in the data.frame that contains the expression/diversity values.

Besides SummarizedExperiment, should calculate_difference also accept a matrix and data.frame as input or do we ask users to create a SummarizedExperiment object?

calculate_difference should return a data.frame, not a SummarizedExperiment object.

User adjustable pseudocount for Laplace (Dirichlet) entropy

Laplace adds a pseudocount of 1 to all categories. If we allow the user to set the pseudocount, we change the function into a more general Dirichlet entropy calculation, where pseudocount = 1 means Laplace, pseudocount = 1/2 means Jeffreys, etc, etc.

Bootstrap resampling, if we have enough samples. Do we? We can calculate the bootstrapped confidence interval for the log2 fold change of the category means or medians.
Jackknife for the log2 fold changes.
Bootstrap using kallisto/salmon/sailfish bootstraps, but here we also need to aggregate the bootstrap values across samples. In the previous method, the bootstrap refers to drawing random samples, while here we draw random sets of reads (kind of) for each sample. Use bootstrap values to calculate a 95% CI for the differential diversity results.
Beta regression with a likelihood ratio or Wald-test using the normalized entropy values. Stg like entropy ~ condition + tpm | tpm vs entropy ~ tpm | tpm if we want to take into account the effect of gene expression on entropy.

Update example dataset in documentation

Update example dataset to use a more recent dataset, with a larger number of genes, where the pre-selected genes showing differential diversity are selected based on mean difference and we use TPM.

esebesty / splicingfactory Goto Github PK

splicingfactory's People

Contributors

Stargazers

Watchers

splicingfactory's Issues

Recommend Projects

Recommend Topics

Recommend Org