yoseflab / scone Goto Github PK

R 99.32% TeX 0.68%

scone's Introduction

SCONE

Single-Cell Overview of Normalized Expression data

SCONE (Single-Cell Overview of Normalized Expression), a package for single-cell RNA-seq data quality control (QC) and normalization. This data-driven framework uses summaries of expression data to assess the efficacy of normalization workflows.

Install from Bioconductor

We recommend installation of the package via bioconductor.

if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("scone")

Install from Github

Usually not recommended. To download the development version of the package, use

BiocManager::install("YosefLab/scone")

Install for R 3.3

You can download the latest release of SCONE for R 3.3 here. This is useful only for reproducing old results.

scone's People

Contributors

Stargazers

Watchers

Forkers

drisso ckmah genomicsnx 12379monty idot ameya225 jing-xinxing feigeliudan01 qindan2008 drighelli wook2014

scone's Issues

Apply weights to scoring.

Weighted PCA should be applied in evaluation step. One weighting scheme should be applied to evaluate all methods.

Update scone output

evaluation should be called metrics
metrics * scale factors should be called scores (greater = better)
and the mean of the scores is the way we sort the methods
Update manual page to reflect this.

do not compare bio and batch

I am not familiar with R but I think even if nlevels are equal, you still cannot compare bio and batch.

    if(nlevels(bio)==nlevels(batch)) {
      if(all(bio==batch)) {
        stop("Biological conditions and batches are confounded. They cannot both be included in the model, please set at least one of 'adjust_bio' and 'adjust_batch' to 'no.'")
      }
    }

Add argument to preserve zeroes in output matrix

This is more in the spirit of the original imputation framework proposed.

hdf5 doesn't currently work with multiple cores

We should for now add a check that if ncores>1 and hdf5 it stops.

Give user the possibility to get both scores and normalized data

Zero Likelihood in ZINB

Zero-Inflated Negative Binomial fails at initialization for Pollen et al. Data Set: (for yoseflab) /data/yosef2/Published_Data/Fluidigm/filt_expr.Rdat

fnr_obj = estimate_zinb(verbose = TRUE,Y = e)
Error in while (abs((ll_old - ll_new)/ll_old) > 1e-04 & iter < maxiter) { : 
  missing value where TRUE/FALSE needed

After investigating this error, ll_old = -INF

Better use of params

If params is passed as an argument, the user should be able to just pass param instead of passing the correct parameters again.

stability based on PAM

get_normalized wrapper function

Finishing scone pipeline

Implement biplot
Implement PAM scoring
Robustness to error in BiocParallel

Complains about confounding even though adjust_batch="no" and adjust_bio="no"

  if(!is.null(bio) & !is.null(batch)) {
    tab <- table(bio, batch)
    if(all(colSums(tab>0)==1)){
      if(nlevels(bio) == nlevels(batch)) {
        stop("Biological conditions and batches are confounded. They cannot both be included in the model, please set at least one of 'adjust_bio' and 'adjust_batch' to 'no.'")
      } else {
        nested <- TRUE
      }
    }
  }

Simplify sample filtering + add gene filtering

I think that the current functions for sample filtering are unnecessarily complicated. We should add a simple gene filtering function.

Need to figure out this before the workshop.

Plot scores

Biplot of PCA of evaluation scores

Custom adjustment

Add a way for the user to use their adjustment function rather than our linear model (e.g., ComBat).

ZINB FNR estimation does not always converge

R CMD check gives warning

* checking for missing documentation entries ... WARNING
Undocumented data sets:
  ‘cell_cycle_Tirosh’ ‘house_keeping_mouse_TitleCase’ ‘macklis_markers’

Public data analysis

Analyze data from scRNAseq with scone

Generate figures for scone paper

Default values for imputation and scaling?

no_batch, no_uv ==> no_bio

no_batch_no_uv_no_bio and no_batch_no_uv_bio are the same!

Score to Evaluate Clustering Stability (rather than tightness), as Summary of Co-Clustering Matrix

weighted PCA

talk to Sandrine

factor_sample_filter and metric_sample_filter should not produce pdf's

I think that it would be better to have an option plot=TRUE that just prints to the current plot device. If the user wants a pdf, they can call pdf() themselves.

Bug in number of PCs used for correlations?

When computing EXP_UV_COR and EXP_WV_COR only the first eval_pcs should be used.

scone() documentation error - bio argument

In the scone function documentation it says under the bio argument: "Ignored, if adjust_bio=0."

However, setting adjust_bio=0 gives the following error: "Error in match.arg(adjust_bio) : 'arg' must be NULL or a character vector"

Stabilize PAM param interface

The user currently selects a range of k for pam clustering (passed to fpc::pamk) but the resulting PAM_SIL score is based on complex considerations of that range. A few options:

Have the user select one value of eval_kclust.
(Optional) automatic eval_kclust selection.
Permit a range of eval_kclust, but limit options and/or wrap pamk more effectively.

implement biplot

Update set of cell cycle genes

We now have an irreproducible set of genes from email correspondence, we should instead use this:
http://genome.cshlp.org/content/suppl/2015/10/07/gr.192237.115.DC1/Table_S1.xlsx

Custom evaluation function

Return design matrices

This is useful if people want to do DE adding UV factors in the model

Replace weight arguments with a (weighted) projection function

Now the user provides weights to be used for wPCA.

The user should instead provide a function (default to PCA) to do projection (possibly weighted) which if needed would compute the weights internally

vignette

input: counts + QC matrix
gene + sample filtering
scone
exploration / visualization : heatmap + bipolar
extract normalized matrix

error when eval=TRUE and only one method

If there is only one normalization method to "compare", scone won't work because trying to apply(., 1, .) to a vector (line 325 of scone_main.R in develop).

I don't see any obvious reason why one should evaluate only one normalization, but it should be fixed.

Handling of ties in FQ_FN

Add an option to have ties=TRUE (perhaps it should be the default).

Alternatively, it could be a different function FQ_T_FN, to make it easier to add both to scone comparison.

Any preference @mbcole ?

Variance preserved measure for DE genes

Adding RLE metric to evaluation

Copying from an email conversation with @mbcole (quotes are by @mbcole)

On a different topic, I was wondering if we should add one more evaluation metric to scone. I was helping Sandrine running scone on Russell's data, and as in the Fluidigm data, no scaling ranks often higher than FQ, TMM and DESeq.

I think this is not what we want because when you look at box plots / RLE plots, you clearly see that without a scaling step the distributions are far from aligned. What do you think of a metric that compares the median of the distributions of the counts of each cells and penalizes methods for which the medians are very far from each other?

This definitely sounds like a good thing to keep track of - did you get a sense of why scale-free methods were scoring higher - which scores were inflating their approval? I have only 2 concerns with adding this one in: 1) This score is tailored for normalization by the median 2) Is the median always non-zero? I know that in some of the data sets we’ve had median zero across many samples.

I think these are not big issues because 1) we usually don't consider median normalization in the comparison; 2) we can use the median of the RLE distribution rather than that of the counts.

I will implement this and see if it's useful at all. If not, we can get rid of it later.

develop branch: R CMD check fails

R CMD check on the develop branch fails with an error:

[HB-X201]{hb}: R CMD check scone_0.0.2.tar.gz
* using log directory 'C:/Users/hb/braju.com.R/_GITHUB_forks/scone.Rcheck'
* using R version 3.2.4 Patched (2016-03-10 r70306)
* using platform: x86_64-w64-mingw32 (64-bit)
* using session charset: ISO8859-1
* checking for file 'scone/DESCRIPTION' ... OK
* this is package 'scone' version '0.0.2'
* checking CRAN incoming feasibility ... NOTE
Maintainer: 'Michael Cole <[email protected]>'
New submission

Strong dependencies not in mainstream repositories:
  scde

The Title field should be in title case, current version then in title case:
'Single Cell Overview of Normalized Expression data'
'Single Cell Overview of Normalized Expression Data'

The Description field should not start with the package name,
  'This package' or similar.

The Date field is not in ISO 8601 yyyy-mm-dd format.
* checking package namespace information ... OK
* checking package dependencies ... ERROR
Packages required but not available:
  'EDASeq' 'RUVSeq' 'diptest' 'fpc' 'mixtools' 'scde'

Namespace dependency not required: 'clusterCells'

See section 'The DESCRIPTION file' in the 'Writing R Extensions' manual.
* DONE

Status: 1 ERROR, 1 NOTE

PS. I'd like to suggest that the package version of the develop branch uses suffix -9000 (e.g. 0.0.2-9000) so that it is clear from the version itself that develop is being used. This is style has gathered a fair bit use recently (Hadley-verse of course).

conditional_pam & (max(eval_kclust) >= min(table(bio)))) throws warning when bio is null

Prepare data

Cf. scRNAseq package

data processing
metadata for running scone and evaluating performance

Robustness to BiocParallel

Handling an error from a normalization method (not propagated)
Handling error in the projection

Add custom UV factors as a matrix (e.g. from sva)

bug in scone_eval when batch is NULL

if( !is.null(batch) | !any(!is.na(batch)) ){
  KNN_BATCH = mean(attributes(knn(train = proj[!is.na(batch),],test = proj[!is.na(batch),],cl = batch[!is.na(batch)], k = eval_knn,prob = TRUE))$prob)
}else{
  KNN_BATCH = NA
}

fails when batch=NULL because !any(!is.na(NULL)) is TRUE

submat = subsampleClustering(proj, k=k)

It should be

submat = subsampleClustering(t(proj), k=k)

because the subsampleClustering documentation says it needs samples in columns.

Return values in biplot

invisibly, probably.

R version documentation

It seems that R >= 3.3 is required for the library. If this is true, it would be helpful to put that information at the install instructions.

Have a great day!