davismcc / archive-scater Goto Github PK

View Code? Open in Web Editor NEW

64.0 64.0 18.0 8.85 MB

An archived version of the scater repository, see https://github.com/davismcc/scater for the active version.

R 97.01% C++ 2.99%

archive-scater's People

Contributors

Stargazers

Watchers

Forkers

xiaojieqiu ti243 jimhester wikiselev al3n70rn kevinrue cplaisier ckmah lazappi daskelly genomicsnx naimmahi weizhiting 12379monty tlorin jun-lizst

archive-scater's Issues

Add wrapper for Salmon

After reading the Salmon bioRxiv pre-print I am sufficiently impressed. It would be very nice to add Salmon as an alternative to kallisto to that the package remains relatively agnostic to quantification methods.

Extend testAmp() code to general designs

Update the code to accept general experimental designs for the t-test.

Add option to colour points on plots by expression of a feature

Especially for t-SNE and PCA plots - useful for identifying cell types from clusters and visualisations.

Add unit testing

Using testthat framework. Will begin with testing of calculateQCMetrics etc.

Is there any point in testing internal validity of scater object since presumably validObject(..) is called multiple times?

calculateTPM() produces zero matrix

Hi Davis,

I have just tried both calculateTPM() and calculateFPKM() with the same effective_length argument and got some meaningful output for FPKM, but completely 0 expression for TPM. No errors or warnings were produced during the function calls. Is there any rounding in the TPM function?

Cheers,
Vlad

Add assayData accessor generator function

To make convenient, customisable use of the assayData slot of an SCESet object.

Slots in assayData can be defined manually:

example_sceset@assayData$transformed_counts <- transformed_counts_matrix
example_sceset@assayData$transformed_counts

But this is not very elegant and perhaps not intuitive. So perhaps we could have a function like:

## either
make_accessor("transformed_counts")
## or
transformed_counts <- make_accessor("transformed_counts")

So that now there is a method transformed_counts that can be used to access and assign data from/to example_sceset@assayData$transformed_counts in the SCESet object.

Fiddly things could arise if we want to make it a proper SCESet method. Accessor capability should be easy, assignment perhaps a bit trickier.

Bug and extra behaviour in in plotExplanatoryVariables

This causes an error, presumably because some dimensions were dropped somewhere:

example(plotExplanatoryVariables)
plotExplanatoryVariables(example_sceset, variables=vars[1])

Also consider adding "density" as the y-axis label of this plot.

Add methods for orthogonal data slots

We may want to have more general/flexible slots in an SCESet object.

For example, can we add ATAC data to an SCESet object that can have different dimensions from the expression data?

Hard coding such a slot(s) should be possible. Making it possible for the user to create their own slots requires some more thought, as it would perhaps need "redefining" the SCESet object - we need to explore this.

Subsetting is going to be tricky. We may insist that the number of cells/samples has to be the same. So we can only have expression matrices of shape $ngenes by ncells$ and ATAC (or other) matrices of the form $nregions * ncells$. In that way, subsetting by cells will be fine. We may need specific methods to subset the non-expression data. So normal "row" subsetting of an SCESet object will subset expression features (genes, transcripts) as now (backward compatible, default use, which I like), but to subset non-expression data there's a specific method: subset_atac() for instance (off the top of my head).

One general concern is how large non-expression data could get. e.g. imputed genotypes for 7 million variants is not going to play nicely in R. I imagine a few hundred thousand features for a few thousand cells will be OK, but expect problems with performance beyond that.

I got the error messages only for some plots

Hi Developer,

I got the error messages only for two plots not for other plots. Any idea?

Tia.

plot(sceset, block1 = "Plate", block2 = "Input", colour_by = "Approach", nfeatures = 500, exprs_values = "counts")
Error in eval(expr, envir, enclos) :
object 'Proportion_Library' not found

plotQC(sceset, type = "highest-expression")
Error in $<-.data.frame(*tmp*, "Var2", value = integer(0)) :
replacement has 0 rows, data has 28800

Use lockedEnvironment in newSCESet

Behold the weird and wondrous behaviour when environments are passed by reference:

require(scater)
example(newSCESet)
copy <- example_sceset

exprs(copy)[1,] <- 1
exprs(copy)[1,] # fine
exprs(example_sceset)[1,] # unexpected change to original object

It'd be better to switch to 'lockedEnvironment' in the assayDataNew call in newSCESet.

Add Shiny app GUI for workflow for less programmatically inclined users

I propose that we set up the GUI to work on an SCESet object. So we let the user build their basic SCESet object following the current approaches, and then fire up the gui like:

scater_gui(sce_set)

And this pops up the GUI in their browser. Then they can calculate QC metrics, check out all the visualisations, subset data and so on. Plotting is easy, I think. Might need to think a little about how the GUI changes/creates objects under the hood.

Add a `get_exprs` (or similar) function for use by other packages

I think a use_for_exprs slot is a great idea, I also think it's worthwhile implementing a function like

get_exprs <- function(object) {
x <- NULL
if(object@use_for_expression == "TPM") {
x <- tpm(object)
} else if .... etc 
return( x )
}

Salmon has changed output format

Salmon has made backwards-incompatible changes to the output format (https://github.com/COMBINE-lab/salmon/releases) that will albeit make reading in the data cleaner. I'll fix this either tonight or tomorrow.
This will probably require adding the rjson library to dependencies.

Distinguishing between technical and biological controls

Both technical (e.g., ERCCs) and biological controls (e.g., mitochondrial genes) are supplied to feature_controls in calculateQCMetrics. However, in some applications (e.g., calculating the drop-out rate, calculating technical variance), only the technical controls are of interest. It would be nice to have some interface that extracts only the technical controls, in an automatic manner that does not require knowledge of what those technical controls were named in the call to calculateQCMetrics. This is necessary if, for example, if I named my input vector Spike but the downstream code looked for fData(sce)$is_feature_control_ERCC; then the code wouldn't work.

The simplest way to do this would be to add technical_controls and biological_controls to the calculateQCMetrics function call, such that it's clear what's what when the user is specifying arguments. To avoid code breakage, these new arguments can be added alongside the existing feature_controls argument. For the time being, all values supplied to feature_controls can be treated as technical controls, until users migrate to the newer arguments.

readKallistoResults fails when log not supplied

If user has run kallisto externally and wants to read in results then kallisto_log is NULL but code still tries to iterate over. Happy to fix

Add theme_size argument for plotPCA

Add plot showing well positions for plate data

link to workflow of scater

Hi David

It is possible to repair the link of workflow

https://github.com/davismcc/scater/blob/master/inst/figures/scater_qc_workflow.png

Many thanks

Cynthia

Show the fitted dropout line in plotExprsFreqVsMean

A trend is fitted to the technical controls in plotExprsFreqVsMean to define the mean-dependent drop-out rate, above which the number of genes is counted and reported. Can this trend be shown on the plot, to facilitate the explanation of the reported value?

Also, reporting "genes above high technical dropout" is confusing. It seems to refer to genes with more dropouts, whereas being above the trend should be equivalent to having fewer dropouts.

Add a cellNames<- function

plotPCA/tSNE "size_by"/"color_by" to include gene names

It's really useful to be able to size or colour reduced dimension representations by the expression of a particular gene, e.g. if you're wanting to see if that pseudotime trajectory is actually correlated with a marker gene.

I'd suggest size/colour_by searches names(pData(sce)) first, then through featureNames(sce). Happy to implement if people think this is a good idea

PCA and tSNE plots are not working in a loop

The package is really effective and has high quality in my opinion! However, I have a small issue: plotting PCA or tSNE in a loop does not work.

Basically, I would like to see how the expression of some specific genes is distributed in PCA and tSNE among samples to be confirmed with possible clustering. Therefore I add additional columns to pData, each containing expression of a selected gene. Than I perform plots, something like this:

names <- c( "Atoh1",  "Ptf1a" )

scater::plotTSNE(res.qc[endog_genes,], ntop = 500, preplexity = 50,
                 colour_by = names[1],  size_by = "total_features", 
                 shape_by = "state", exprs_values = "cpm", rand_seed=42 )


scater::plotTSNE(res.qc[endog_genes,], ntop = 500, preplexity = 50,
                 colour_by = names[2],  size_by = "total_features", 
                 shape_by = "state", exprs_values = "cpm", rand_seed=42 )

This code works. However, when I put everything in a loop it fails:

for (n in names) {
  print(n)
  scater::plotTSNE(res.qc[endog_genes,], ntop = 500, preplexity = 50,
                 colour_by = n,  size_by = "total_features", 
                 shape_by = "state", exprs_values = "cpm", rand_seed=42 )

}

There is no error report, but no plots are created. It's no problem when there are only 2 genes, but I have up to 50. Why could this happen? Is it possible to fix this?

BiocCheck requirement: At least 80% man pages must have runnable examples

Add simple runnable examples to those functions missing them.

Requirement to pass BiocCheck.

Add "proportion dropout" to pData

To be calculated in calculateQCMetrics as proportion of genes expressed <= lowerDetectionLimit

Add Rtsne and cowplot to list of packages in README

Throw error if argument key not found

plotTSNE(object, argdoesnotexist = 1)
doesn't throw any error, which would be useful if someone spells an argument wrong (I got this after perlexity). Not sure how common this is across other functions that pass arguments by ellipses or if it's an easy fix

Defining the normalize method for SCESets

Can the normalize method be defined for a SCESet? I'm thinking of something that just recalculates the cpm and exprs fields based on pre-computed size factors, rather than something that calculates the factors themselves (the latter task probably involves enough work and attention that it requires a separate function call). I'd expect something like this would be in the function body:

normalize.body <- function(sceset, logExprsOffset=1, recompute.cpm=TRUE) {
    if (recompute.cpm) { 
        cpm(sceset) <- cpm(countData) # account for subsetting since initialization?
    }
    SFs <- sizeFactors(sceset) # assuming defined, see issue #41
    exprs(sceset) <- cpm.default(countData, prior.count = logExprsOffset, 
        lib.sizes=SFs * 1e6, log = TRUE)
    return(sceset)
}

This effectively returns log-"normalized counts" in exprs, which should be interpretable on the same scale as the raw counts if the size factors are centred around unity.

Add integration with QC tool from Tomi Ilicic

Tomi Ilicic, a student in Sarah Teichmann's lab at EBI, has developed an SVM approach to predicting problematic cells. I'm working with him to integrate this nicely with scater. I think his method will be published in Genome Biology shortly.

Move 'monocle' to suggestion rather than imports

Only required for to/fromCellDataSet

Colour plots according to arbitrary cell metrics

Can the colour_by argument in plotPCA and friends accept vectors (factors or numeric values) with which coloration can be performed? Currently, it seems I have to do something like this:

sce$whee <- arbitrary.values
plotPCA(sce, colour_by="whee")

... rather than the simpler one-step:

plotPCA(sce, colour_by=arbitrary.values)

Add "import" in roxygen docs for several packages

Specifically:

biomaRt
cluster
data.table
dplyr
edgeR
grid
matrixStats
plyr
reshape2
rhdf5
rjson
Rtsne

so that import appears in NAMESPACE as well as DESCRIPTION.

Recommendation from BiocCheck.

Add basic support for MultiAssayExperiment class

Clean out unused functions for Bioconductor release

Some "legacy" functions from previous plans for the package may remain.

New QC metrics

Hi both,

I'd like to add two metrics calculated in calculateQCMetrics:

n_detected_endogenous_features
pct_tpm_top_100_endogenous_features (and possibly top_500 too)
as these are used in the wtchg qc pipeline.
If these are okay with everyone, I'll go ahead and add them

Argument check in plotSCESet

This is a minor issue. In the plotSCESet function, there is an argument check exprs_values <- match.arg(exprs_values, c("exprs", "tpm", "fpkm", "counts")). However, the provided example data set also contains cpm values. So, when you try to select cpm on the plot tab in shiny, it gives you an illegal argument error. The argument list should probably be extended to include cpm.

Rename "coverage", "depth" to more intuitive names

e.g. coverage --> n_features_detected
depth --> total_feature_counts

any other QC metrics or similar that would benefit from more intuitive names?

Use scaled TPM for gene-level counts if TPM available

Some *nix users may need sudo to install package

Reported by Quin on a Ubuntu laptop. Not replicated on EBI cluster.

Define a class for single-cells data

Name?

Primary slot for expression values

Need to support (like ExpressionSet):

expr matrix
sample metadata
feature metadata

Slots:

counts
cpm
log2-cpm
gene wise mean expression
gene variance

Fix peculiar looking violin plots

Introduced by the new version of ggplot2 (2.0.0)

Add merge/combine method for SCESets

Add function to export SCESet to HDF5

Dplyr-style verbs for scater

I'm a big fan of dplyr for manipulating data frames and I think the same could be very easily applied to scater, in particular implementing

filter
rename
mutate

and possibly group-by.

For example, if you want to subset an SCESet currently you need to call something like

sce <- sce[sce$pct_dropout < 70, ]

whereas under the dplyr formalism this would become

sce <- filter(sce, pct_dropout < 70)

Similarly if we want a new column in pData we could call

sce <- mutate(sce, low_dropout = pct_dropout < 70)

rather than the current

pData(sce)$low_dropout <- sce$pct_dropout < 70

Since ExpressionSets have fData too you could easily extend this to check if the arguments were referring to pData and if not subset on fData, e.g.

sce <- filter(sce, n_cells_exprs > 50)

etc.

A nice consequence is you could then override magrittr's %>% operator to chain things together, so you could do

sce <- sce %>%
   filter(n_cells_exprs > 50) %>%
   mutate(low_dropout = pct_dropout < 70) %>%
   calcualteQCMetrics()

which I think would be really nice!

If people like this idea I would be happy to implement as it should fairly simply involve function calls to the underlying dataframes.

Change to adaptive default for perplexity in t-SNE plots

Make it something like n_cells / 5

Change pairs plot for explanatory variable to ggplot2 framework

Implement more complete show method for SCESet objects

Define a class for results

Need some sensible way to output results

Add plotting function to look at expression against transcript length.

Switch to tximport package for reading in transcript expression data

One for the new year.

Formally defined getters and setters for size factors

Can the sizeFactors getter and setter methods be defined for SCESet objects? It should be a simple matter of adding a size_factors column to the pData. The same can be done for normalization factors, though it seems that size factors might be simpler to work with, given that the former depends on the library size (and thus has to be updated upon subsetting, etc.) whereas the latter does not.