Git Product home page Git Product logo

archive-scater's People

Contributors

daskelly avatar davismcc avatar duyck avatar kevinrue avatar kieranrcampbell avatar lazappi avatar petehaitch avatar wikiselev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archive-scater's Issues

Add wrapper for Salmon

After reading the Salmon bioRxiv pre-print I am sufficiently impressed. It would be very nice to add Salmon as an alternative to kallisto to that the package remains relatively agnostic to quantification methods.

Add unit testing

Using testthat framework. Will begin with testing of calculateQCMetrics etc.

Is there any point in testing internal validity of scater object since presumably validObject(..) is called multiple times?

calculateTPM() produces zero matrix

Hi Davis,

I have just tried both calculateTPM() and calculateFPKM() with the same effective_length argument and got some meaningful output for FPKM, but completely 0 expression for TPM. No errors or warnings were produced during the function calls. Is there any rounding in the TPM function?

Cheers,
Vlad

Add assayData accessor generator function

To make convenient, customisable use of the assayData slot of an SCESet object.

Slots in assayData can be defined manually:

example_sceset@assayData$transformed_counts <- transformed_counts_matrix
example_sceset@assayData$transformed_counts

But this is not very elegant and perhaps not intuitive. So perhaps we could have a function like:

## either
make_accessor("transformed_counts")
## or
transformed_counts <- make_accessor("transformed_counts")

So that now there is a method transformed_counts that can be used to access and assign data from/to example_sceset@assayData$transformed_counts in the SCESet object.

Fiddly things could arise if we want to make it a proper SCESet method. Accessor capability should be easy, assignment perhaps a bit trickier.

Bug and extra behaviour in in plotExplanatoryVariables

This causes an error, presumably because some dimensions were dropped somewhere:

example(plotExplanatoryVariables)
plotExplanatoryVariables(example_sceset, variables=vars[1])

Also consider adding "density" as the y-axis label of this plot.

Add methods for orthogonal data slots

We may want to have more general/flexible slots in an SCESet object.

For example, can we add ATAC data to an SCESet object that can have different dimensions from the expression data?

Hard coding such a slot(s) should be possible. Making it possible for the user to create their own slots requires some more thought, as it would perhaps need "redefining" the SCESet object - we need to explore this.

Subsetting is going to be tricky. We may insist that the number of cells/samples has to be the same. So we can only have expression matrices of shape $ngenes by ncells$ and ATAC (or other) matrices of the form $nregions * ncells$. In that way, subsetting by cells will be fine. We may need specific methods to subset the non-expression data. So normal "row" subsetting of an SCESet object will subset expression features (genes, transcripts) as now (backward compatible, default use, which I like), but to subset non-expression data there's a specific method: subset_atac() for instance (off the top of my head).

One general concern is how large non-expression data could get. e.g. imputed genotypes for 7 million variants is not going to play nicely in R. I imagine a few hundred thousand features for a few thousand cells will be OK, but expect problems with performance beyond that.

I got the error messages only for some plots

Hi Developer,

I got the error messages only for two plots not for other plots. Any idea?

Tia.

plot(sceset, block1 = "Plate", block2 = "Input", colour_by = "Approach", nfeatures = 500, exprs_values = "counts")
Error in eval(expr, envir, enclos) :
object 'Proportion_Library' not found

plotQC(sceset, type = "highest-expression")
Error in $<-.data.frame(*tmp*, "Var2", value = integer(0)) :
replacement has 0 rows, data has 28800

Use lockedEnvironment in newSCESet

Behold the weird and wondrous behaviour when environments are passed by reference:

require(scater)
example(newSCESet)
copy <- example_sceset

exprs(copy)[1,] <- 1
exprs(copy)[1,] # fine
exprs(example_sceset)[1,] # unexpected change to original object

It'd be better to switch to 'lockedEnvironment' in the assayDataNew call in newSCESet.

Add Shiny app GUI for workflow for less programmatically inclined users

I propose that we set up the GUI to work on an SCESet object. So we let the user build their basic SCESet object following the current approaches, and then fire up the gui like:

scater_gui(sce_set)

And this pops up the GUI in their browser. Then they can calculate QC metrics, check out all the visualisations, subset data and so on. Plotting is easy, I think. Might need to think a little about how the GUI changes/creates objects under the hood.

Distinguishing between technical and biological controls

Both technical (e.g., ERCCs) and biological controls (e.g., mitochondrial genes) are supplied to feature_controls in calculateQCMetrics. However, in some applications (e.g., calculating the drop-out rate, calculating technical variance), only the technical controls are of interest. It would be nice to have some interface that extracts only the technical controls, in an automatic manner that does not require knowledge of what those technical controls were named in the call to calculateQCMetrics. This is necessary if, for example, if I named my input vector Spike but the downstream code looked for fData(sce)$is_feature_control_ERCC; then the code wouldn't work.

The simplest way to do this would be to add technical_controls and biological_controls to the calculateQCMetrics function call, such that it's clear what's what when the user is specifying arguments. To avoid code breakage, these new arguments can be added alongside the existing feature_controls argument. For the time being, all values supplied to feature_controls can be treated as technical controls, until users migrate to the newer arguments.

Show the fitted dropout line in plotExprsFreqVsMean

A trend is fitted to the technical controls in plotExprsFreqVsMean to define the mean-dependent drop-out rate, above which the number of genes is counted and reported. Can this trend be shown on the plot, to facilitate the explanation of the reported value?

Also, reporting "genes above high technical dropout" is confusing. It seems to refer to genes with more dropouts, whereas being above the trend should be equivalent to having fewer dropouts.

plotPCA/tSNE "size_by"/"color_by" to include gene names

It's really useful to be able to size or colour reduced dimension representations by the expression of a particular gene, e.g. if you're wanting to see if that pseudotime trajectory is actually correlated with a marker gene.

I'd suggest size/colour_by searches names(pData(sce)) first, then through featureNames(sce). Happy to implement if people think this is a good idea

PCA and tSNE plots are not working in a loop

The package is really effective and has high quality in my opinion! However, I have a small issue: plotting PCA or tSNE in a loop does not work.

Basically, I would like to see how the expression of some specific genes is distributed in PCA and tSNE among samples to be confirmed with possible clustering. Therefore I add additional columns to pData, each containing expression of a selected gene. Than I perform plots, something like this:

names <- c( "Atoh1",  "Ptf1a" )

scater::plotTSNE(res.qc[endog_genes,], ntop = 500, preplexity = 50,
                 colour_by = names[1],  size_by = "total_features", 
                 shape_by = "state", exprs_values = "cpm", rand_seed=42 )


scater::plotTSNE(res.qc[endog_genes,], ntop = 500, preplexity = 50,
                 colour_by = names[2],  size_by = "total_features", 
                 shape_by = "state", exprs_values = "cpm", rand_seed=42 )

This code works. However, when I put everything in a loop it fails:

for (n in names) {
  print(n)
  scater::plotTSNE(res.qc[endog_genes,], ntop = 500, preplexity = 50,
                 colour_by = n,  size_by = "total_features", 
                 shape_by = "state", exprs_values = "cpm", rand_seed=42 )

}

There is no error report, but no plots are created. It's no problem when there are only 2 genes, but I have up to 50. Why could this happen? Is it possible to fix this?

Throw error if argument key not found

plotTSNE(object, argdoesnotexist = 1)
doesn't throw any error, which would be useful if someone spells an argument wrong (I got this after perlexity). Not sure how common this is across other functions that pass arguments by ellipses or if it's an easy fix

Defining the normalize method for SCESets

Can the normalize method be defined for a SCESet? I'm thinking of something that just recalculates the cpm and exprs fields based on pre-computed size factors, rather than something that calculates the factors themselves (the latter task probably involves enough work and attention that it requires a separate function call). I'd expect something like this would be in the function body:

normalize.body <- function(sceset, logExprsOffset=1, recompute.cpm=TRUE) {
    if (recompute.cpm) { 
        cpm(sceset) <- cpm(countData) # account for subsetting since initialization?
    }
    SFs <- sizeFactors(sceset) # assuming defined, see issue #41
    exprs(sceset) <- cpm.default(countData, prior.count = logExprsOffset, 
        lib.sizes=SFs * 1e6, log = TRUE)
    return(sceset)
}

This effectively returns log-"normalized counts" in exprs, which should be interpretable on the same scale as the raw counts if the size factors are centred around unity.

Add integration with QC tool from Tomi Ilicic

Tomi Ilicic, a student in Sarah Teichmann's lab at EBI, has developed an SVM approach to predicting problematic cells. I'm working with him to integrate this nicely with scater. I think his method will be published in Genome Biology shortly.

Colour plots according to arbitrary cell metrics

Can the colour_by argument in plotPCA and friends accept vectors (factors or numeric values) with which coloration can be performed? Currently, it seems I have to do something like this:

sce$whee <- arbitrary.values
plotPCA(sce, colour_by="whee")

... rather than the simpler one-step:

plotPCA(sce, colour_by=arbitrary.values)

New QC metrics

Hi both,

I'd like to add two metrics calculated in calculateQCMetrics:

  • n_detected_endogenous_features
  • pct_tpm_top_100_endogenous_features (and possibly top_500 too)
    as these are used in the wtchg qc pipeline.
    If these are okay with everyone, I'll go ahead and add them

Argument check in plotSCESet

This is a minor issue. In the plotSCESet function, there is an argument check exprs_values <- match.arg(exprs_values, c("exprs", "tpm", "fpkm", "counts")). However, the provided example data set also contains cpm values. So, when you try to select cpm on the plot tab in shiny, it gives you an illegal argument error. The argument list should probably be extended to include cpm.

Define a class for single-cells data

Name?

Primary slot for expression values

Need to support (like ExpressionSet):

  • expr matrix
  • sample metadata
  • feature metadata

Slots:

  • counts
  • cpm
  • log2-cpm
  • gene wise mean expression
  • gene variance

Dplyr-style verbs for scater

I'm a big fan of dplyr for manipulating data frames and I think the same could be very easily applied to scater, in particular implementing

  • filter
  • rename
  • mutate

and possibly group-by.

For example, if you want to subset an SCESet currently you need to call something like

sce <- sce[sce$pct_dropout < 70, ]

whereas under the dplyr formalism this would become

sce <- filter(sce, pct_dropout < 70)

Similarly if we want a new column in pData we could call

sce <- mutate(sce, low_dropout = pct_dropout < 70)

rather than the current

pData(sce)$low_dropout <- sce$pct_dropout < 70

Since ExpressionSets have fData too you could easily extend this to check if the arguments were referring to pData and if not subset on fData, e.g.

sce <- filter(sce, n_cells_exprs > 50)

etc.

A nice consequence is you could then override magrittr's %>% operator to chain things together, so you could do

sce <- sce %>%
   filter(n_cells_exprs > 50) %>%
   mutate(low_dropout = pct_dropout < 70) %>%
   calcualteQCMetrics()

which I think would be really nice!

If people like this idea I would be happy to implement as it should fairly simply involve function calls to the underlying dataframes.

Formally defined getters and setters for size factors

Can the sizeFactors getter and setter methods be defined for SCESet objects? It should be a simple matter of adding a size_factors column to the pData. The same can be done for normalization factors, though it seems that size factors might be simpler to work with, given that the former depends on the library size (and thus has to be updated upon subsetting, etc.) whereas the latter does not.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.