davismcc / archive-scater Goto Github PK
View Code? Open in Web Editor NEWAn archived version of the scater repository, see https://github.com/davismcc/scater for the active version.
An archived version of the scater repository, see https://github.com/davismcc/scater for the active version.
After reading the Salmon bioRxiv pre-print I am sufficiently impressed. It would be very nice to add Salmon as an alternative to kallisto to that the package remains relatively agnostic to quantification methods.
Update the code to accept general experimental designs for the t-test.
Especially for t-SNE and PCA plots - useful for identifying cell types from clusters and visualisations.
Using testthat framework. Will begin with testing of calculateQCMetrics etc.
Is there any point in testing internal validity of scater object since presumably validObject(..) is called multiple times?
Hi Davis,
I have just tried both calculateTPM() and calculateFPKM() with the same effective_length argument and got some meaningful output for FPKM, but completely 0 expression for TPM. No errors or warnings were produced during the function calls. Is there any rounding in the TPM function?
Cheers,
Vlad
To make convenient, customisable use of the assayData slot of an SCESet object.
Slots in assayData can be defined manually:
example_sceset@assayData$transformed_counts <- transformed_counts_matrix
example_sceset@assayData$transformed_counts
But this is not very elegant and perhaps not intuitive. So perhaps we could have a function like:
## either
make_accessor("transformed_counts")
## or
transformed_counts <- make_accessor("transformed_counts")
So that now there is a method transformed_counts
that can be used to access and assign data from/to example_sceset@assayData$transformed_counts
in the SCESet object.
Fiddly things could arise if we want to make it a proper SCESet method. Accessor capability should be easy, assignment perhaps a bit trickier.
This causes an error, presumably because some dimensions were dropped somewhere:
example(plotExplanatoryVariables)
plotExplanatoryVariables(example_sceset, variables=vars[1])
Also consider adding "density" as the y-axis label of this plot.
We may want to have more general/flexible slots in an SCESet object.
For example, can we add ATAC data to an SCESet object that can have different dimensions from the expression data?
Hard coding such a slot(s) should be possible. Making it possible for the user to create their own slots requires some more thought, as it would perhaps need "redefining" the SCESet object - we need to explore this.
Subsetting is going to be tricky. We may insist that the number of cells/samples has to be the same. So we can only have expression matrices of shape subset_atac()
for instance (off the top of my head).
One general concern is how large non-expression data could get. e.g. imputed genotypes for 7 million variants is not going to play nicely in R. I imagine a few hundred thousand features for a few thousand cells will be OK, but expect problems with performance beyond that.
Hi Developer,
I got the error messages only for two plots not for other plots. Any idea?
Tia.
plot(sceset, block1 = "Plate", block2 = "Input", colour_by = "Approach", nfeatures = 500, exprs_values = "counts")
Error in eval(expr, envir, enclos) :
object 'Proportion_Library' not found
plotQC(sceset, type = "highest-expression")
Error in $<-.data.frame
(*tmp*
, "Var2", value = integer(0)) :
replacement has 0 rows, data has 28800
Behold the weird and wondrous behaviour when environments are passed by reference:
require(scater)
example(newSCESet)
copy <- example_sceset
exprs(copy)[1,] <- 1
exprs(copy)[1,] # fine
exprs(example_sceset)[1,] # unexpected change to original object
It'd be better to switch to 'lockedEnvironment'
in the assayDataNew
call in newSCESet
.
I propose that we set up the GUI to work on an SCESet object. So we let the user build their basic SCESet object following the current approaches, and then fire up the gui like:
scater_gui(sce_set)
And this pops up the GUI in their browser. Then they can calculate QC metrics, check out all the visualisations, subset data and so on. Plotting is easy, I think. Might need to think a little about how the GUI changes/creates objects under the hood.
I think a use_for_exprs slot is a great idea, I also think it's worthwhile implementing a function like
get_exprs <- function(object) {
x <- NULL
if(object@use_for_expression == "TPM") {
x <- tpm(object)
} else if .... etc
return( x )
}
Salmon has made backwards-incompatible changes to the output format (https://github.com/COMBINE-lab/salmon/releases) that will albeit make reading in the data cleaner. I'll fix this either tonight or tomorrow.
This will probably require adding the rjson library to dependencies.
Both technical (e.g., ERCCs) and biological controls (e.g., mitochondrial genes) are supplied to feature_controls
in calculateQCMetrics
. However, in some applications (e.g., calculating the drop-out rate, calculating technical variance), only the technical controls are of interest. It would be nice to have some interface that extracts only the technical controls, in an automatic manner that does not require knowledge of what those technical controls were named in the call to calculateQCMetrics
. This is necessary if, for example, if I named my input vector Spike
but the downstream code looked for fData(sce)$is_feature_control_ERCC
; then the code wouldn't work.
The simplest way to do this would be to add technical_controls
and biological_controls
to the calculateQCMetrics
function call, such that it's clear what's what when the user is specifying arguments. To avoid code breakage, these new arguments can be added alongside the existing feature_controls
argument. For the time being, all values supplied to feature_controls
can be treated as technical controls, until users migrate to the newer arguments.
If user has run kallisto externally and wants to read in results then kallisto_log is NULL but code still tries to iterate over. Happy to fix
Hi David
It is possible to repair the link of workflow
https://github.com/davismcc/scater/blob/master/inst/figures/scater_qc_workflow.png
Many thanks
Cynthia
A trend is fitted to the technical controls in plotExprsFreqVsMean
to define the mean-dependent drop-out rate, above which the number of genes is counted and reported. Can this trend be shown on the plot, to facilitate the explanation of the reported value?
Also, reporting "genes above high technical dropout" is confusing. It seems to refer to genes with more dropouts, whereas being above the trend should be equivalent to having fewer dropouts.
It's really useful to be able to size or colour reduced dimension representations by the expression of a particular gene, e.g. if you're wanting to see if that pseudotime trajectory is actually correlated with a marker gene.
I'd suggest size/colour_by searches names(pData(sce))
first, then through featureNames(sce)
. Happy to implement if people think this is a good idea
The package is really effective and has high quality in my opinion! However, I have a small issue: plotting PCA or tSNE in a loop does not work.
Basically, I would like to see how the expression of some specific genes is distributed in PCA and tSNE among samples to be confirmed with possible clustering. Therefore I add additional columns to pData, each containing expression of a selected gene. Than I perform plots, something like this:
names <- c( "Atoh1", "Ptf1a" )
scater::plotTSNE(res.qc[endog_genes,], ntop = 500, preplexity = 50,
colour_by = names[1], size_by = "total_features",
shape_by = "state", exprs_values = "cpm", rand_seed=42 )
scater::plotTSNE(res.qc[endog_genes,], ntop = 500, preplexity = 50,
colour_by = names[2], size_by = "total_features",
shape_by = "state", exprs_values = "cpm", rand_seed=42 )
This code works. However, when I put everything in a loop it fails:
for (n in names) {
print(n)
scater::plotTSNE(res.qc[endog_genes,], ntop = 500, preplexity = 50,
colour_by = n, size_by = "total_features",
shape_by = "state", exprs_values = "cpm", rand_seed=42 )
}
There is no error report, but no plots are created. It's no problem when there are only 2 genes, but I have up to 50. Why could this happen? Is it possible to fix this?
Add simple runnable examples to those functions missing them.
Requirement to pass BiocCheck.
To be calculated in calculateQCMetrics as proportion of genes expressed <= lowerDetectionLimit
plotTSNE(object, argdoesnotexist = 1)
doesn't throw any error, which would be useful if someone spells an argument wrong (I got this after perlexity
). Not sure how common this is across other functions that pass arguments by ellipses or if it's an easy fix
Can the normalize
method be defined for a SCESet
? I'm thinking of something that just recalculates the cpm
and exprs
fields based on pre-computed size factors, rather than something that calculates the factors themselves (the latter task probably involves enough work and attention that it requires a separate function call). I'd expect something like this would be in the function body:
normalize.body <- function(sceset, logExprsOffset=1, recompute.cpm=TRUE) {
if (recompute.cpm) {
cpm(sceset) <- cpm(countData) # account for subsetting since initialization?
}
SFs <- sizeFactors(sceset) # assuming defined, see issue #41
exprs(sceset) <- cpm.default(countData, prior.count = logExprsOffset,
lib.sizes=SFs * 1e6, log = TRUE)
return(sceset)
}
This effectively returns log-"normalized counts" in exprs
, which should be interpretable on the same scale as the raw counts if the size factors are centred around unity.
Tomi Ilicic, a student in Sarah Teichmann's lab at EBI, has developed an SVM approach to predicting problematic cells. I'm working with him to integrate this nicely with scater
. I think his method will be published in Genome Biology shortly.
Only required for to/fromCellDataSet
Can the colour_by
argument in plotPCA
and friends accept vectors (factors or numeric values) with which coloration can be performed? Currently, it seems I have to do something like this:
sce$whee <- arbitrary.values
plotPCA(sce, colour_by="whee")
... rather than the simpler one-step:
plotPCA(sce, colour_by=arbitrary.values)
Specifically:
so that import appears in NAMESPACE as well as DESCRIPTION.
Recommendation from BiocCheck
.
Some "legacy" functions from previous plans for the package may remain.
Hi both,
I'd like to add two metrics calculated in calculateQCMetrics:
This is a minor issue. In the plotSCESet
function, there is an argument check exprs_values <- match.arg(exprs_values, c("exprs", "tpm", "fpkm", "counts"))
. However, the provided example data set also contains cpm values. So, when you try to select cpm on the plot tab in shiny, it gives you an illegal argument error. The argument list should probably be extended to include cpm.
e.g. coverage --> n_features_detected
depth --> total_feature_counts
any other QC metrics or similar that would benefit from more intuitive names?
Reported by Quin on a Ubuntu laptop. Not replicated on EBI cluster.
Name?
Primary slot for expression values
Need to support (like ExpressionSet):
Slots:
Introduced by the new version of ggplot2 (2.0.0)
I'm a big fan of dplyr for manipulating data frames and I think the same could be very easily applied to scater, in particular implementing
filter
rename
mutate
and possibly group-by
.
For example, if you want to subset an SCESet
currently you need to call something like
sce <- sce[sce$pct_dropout < 70, ]
whereas under the dplyr formalism this would become
sce <- filter(sce, pct_dropout < 70)
Similarly if we want a new column in pData we could call
sce <- mutate(sce, low_dropout = pct_dropout < 70)
rather than the current
pData(sce)$low_dropout <- sce$pct_dropout < 70
Since ExpressionSets have fData too you could easily extend this to check if the arguments were referring to pData and if not subset on fData, e.g.
sce <- filter(sce, n_cells_exprs > 50)
etc.
A nice consequence is you could then override magrittr's %>%
operator to chain things together, so you could do
sce <- sce %>%
filter(n_cells_exprs > 50) %>%
mutate(low_dropout = pct_dropout < 70) %>%
calcualteQCMetrics()
which I think would be really nice!
If people like this idea I would be happy to implement as it should fairly simply involve function calls to the underlying dataframes.
Make it something like n_cells / 5
Need some sensible way to output results
One for the new year.
Can the sizeFactors
getter and setter methods be defined for SCESet
objects? It should be a simple matter of adding a size_factors
column to the pData
. The same can be done for normalization factors, though it seems that size factors might be simpler to work with, given that the former depends on the library size (and thus has to be updated upon subsetting, etc.) whereas the latter does not.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.