felixthestudent / cellpypes Goto Github PK
View Code? Open in Web Editor NEWCell type pipes for R
License: GNU General Public License v3.0
Cell type pipes for R
License: GNU General Public License v3.0
Hi,
First of all, thank you for this wonderful tool.
I have a question about how to convert from count matrix to cp10k units.
It is said that the unit of argument threshold of the 'rule' function is cp10k. Is the way to get the unit cp10k from count matrix dividing the gene-specific UMI count by the total number of UMI in the cell and multiplying by 10,000?
Can you please tell me if this is correct? If not, what is the right way to get cp10k from count matrix?
Thank you very much in advance!
Create an object slot where the user can put existing labels (e.g. output from Louvain, reference label transfer, etc.).
Write functions to:
parent
to the rule
function and then we simply define subsets of these with cellpypes.Perhaps it makes sense to set options at beginning of pypes like this:
obj %>%
options(...) %>%
rule(...) %>% rule(...)
Parameters I can imagine:
Here's code to get you started, taken from the script playground/ms.Rmd
. It has these innovations over other plot_last functions I made in different scripts:
check_obj
to verify object has the expected slotsdrop=T
like this: obj$embed[,1, drop=T]
plot_last <- function(obj, what="rule") {
check_obj(obj)
if(what=="rule") {
last_rule <- obj$rules[nrow(obj$rules),]
boolean=evaluate_rule(obj,
last_rule$class, last_rule$feature,
last_rule$operator, last_rule$threshold)
plot_title <- paste0("Rule: ", last_rule$feature, last_rule$operator,
last_rule$threshold)
} else if (what=="class") {
last_class <- obj$rules[nrow(obj$rules), "class"]
boolean=drop(classify(obj, classes=last_class, return_logical_matrix = T))
plot_title <- paste0("Class: ", last_class)
} else { stop("plot_last argument 'what' should either be rule or class.") }
ggplot(data=data.frame(V1=obj$embed[,1, drop=T], # tbl makes drop necessary
V2=obj$embed[,2, drop=T],
last=boolean),
aes(V1, V2, col=last))+coord_fixed()+
geom_point() +
xlab( colnames(obj$embed)[1] ) +
ylab( colnames(obj$embed)[2] ) +
ggtitle(plot_title)
}
Currently all rules are combined with AND, but there might be cases where OR is useful.
Motivation
Use-case 1: For example, markers a1 and a2 might have nice but imperfect overlap, and both mark cell type A. If A has subtypes it may well be we'd want to set the subtype's parent to either a1 OR a2. For this, introducing a group called "A" as a kind of virtual class (no own rules) would help.
Use-case 2: for plotting, we might want to have all myeloid cells in the same colors. Introducing the "myeloid" class could achieve this:
obj %>%
rule("mono", "CD14", .1) %>%
rule("DC", "CD1C", .1) %>%
group_classes("myeloid"=c("mono","DC") %>%
rule("T", "CD3E", .1) %>%
plot_classes( c("myeloid", "T"))
Since mono and DC have no common parent, this can not be achieved without the group_classes function.
How to implement
obj$classes currently has "class" and "parent" columns. parent is either a class name or the special key word "..root..". I propose to create the second key word "..group.." and the slot obj$class_groups; this could be a named list (name is group name, elements are classes belonging to the group.
Some classes may have many overlapping cells, it would be nice to visualize this somehow.
In bioinformatics, the features (genes) go in rows. For the rest of statistics and data science it's the other way around :D
Here's code to generate a dot plot of marker genes, given an identity_dataframe. Perhaps add this function to cellpypes package?
Usage example:
Tsubsets_seurat = data.frame(
Th = malt$seurat_clusters %in% c(2, 5, 7),
Ttox=malt$seurat_clusters == 3
)
# The difference in purity is invisible in dot plots:
genes = c("CD4", "CD8B", "CD8A")
dot_plot(FetchData(malt, genes, slot="counts"),
malt$nCount_RNA,
Tsubsets_seurat)
Function code:
# stuff to develop the function:
# obj = malt %>% pype_from_seurat()
# identity_dataframe = data.frame(Bcell=malt$seurat_clusters==0,
# Tcell=malt$seurat_clusters==7)
# counts = FetchData(malt, c("CD3E", "CD79B"), slot = "counts")
# totalUMI = malt$nCount_RNA
dot_plot <- function(counts, totalUMI, identity_dataframe, return_data=FALSE){
res =data.frame(
celltype = NULL,
gene = NULL,
percent_expressing = NULL,
average_expression = NULL
)
for(celltype in colnames(identity_dataframe)) {
for(i in 1:ncol(counts)) {
umi = counts[, i]
relevant <-identity_dataframe[,celltype]
res = rbind(res,
data.frame(
celltype = celltype,
gene = colnames(counts)[i],
percent_expressing = 100 * mean(umi[relevant] > 0),
average_expression = 1e4*mean(umi[relevant]/totalUMI[relevant])
))
}
}
plot <- res %>%
ggplot(aes(celltype, gene, size=percent_expressing, col=average_expression)) +
geom_point() +
scale_size_continuous(name = "Percent Expressed") +
viridis::scale_color_viridis(name="Average Expression") +
cowplot::theme_cowplot()
if(return_data) return(res)
return(plot)
}`
Currently, my scripts have an obj$embed slot. I would like to change this to obj$embedding (as discussed with Simon on 9th June).
Here is a minimal example with the problem:
simulated_umis %>% rule("B", "MS4A1", ">", 1e-4) %>% rule("T", "CD3E", ">", .1e-4) %>% rule("Treg", "FOXP3", ">", .05e-4, parent="T") %>% plot_classes(c("B","T","Treg"))
As you can see, we can't plot Tregs and all other Tcells in the same plot, because all Tregs are removed as overlap this way.
Possible solutions:
Bonus feature:
..leafs..
or leafs()
. Then I could do something like this: classify(classes=c(..leafs.., "T")).Problem:
For large data set (MS data from Schirmer group, for example), the interactive commands (rule + plot_last) are permissively slow.
Solution:
Precomputing the totalUMI S would speed things up for the MS data set, since raw
is in gene-wise HDF5Array
format (so cell-wise computations take long).
I found it unnecessary to precompute K when using an NxN neighbor graph (SNN > .1), since that uses matrix multiplication which is incredibly fast. I still have to test an Nxk neighbor graph (i.e. matrix with indices of kNN for each cell), that might be slower.
Both cases could be solved with something like this:
obj <- list(raw, neighbors, embedding)
obj <- precompute(obj) # computes S
obj <- precompute(obj, genes=c("PLP1", "AQP1") # precomputes genes
Starting with Seurat objects where the raw matrices are processed with SoupX, giving counts with decimals.
Here's my script
```obj <- list(
raw = SeuratObject::GetAssayData(lympho_NKT, "counts"),
neighbors=as(lympho_NKT@graphs[["CSS_PCA"]], "dgCMatrix")>.1,
embed =FetchData(lympho_NKT, vars=c("umapCSS_1","umapCSS_1")),
totalUMI = lympho_NKT$nCount_RNA
)
pype <- obj %>%
rule("T", "CD3E", ">", 2) %>%
rule("CD8+ T", "CD8A", ">", 1, parent="T")
I get this error:
Error in check_obj(obj) : s and as.integer(s) are not equal
Mean relative difference: 0.0001429285
Using a non-existent parent gives a cryptic error message. Minimal example:
simulated_umis %>% rule("T", "CD3E", ">", 1e-4, parent="oops") %>% plot_classes()
Note that the error only occurs through plot_classes. I think that's good behavior, we would not want to create obstacles when interactively developing rules.
The error is cryptic, please fix:
Error in if (current_parent == "..root..") { : argument is of length zero
pype_from_seurat gave the following error:
Error in methods::as(seurat@graphs[[graph_choice]], "dgCMatrix") : no method or default for coercing “NULL” to “dgCMatrix”
The solution was to use the explicit object creation strategy outlined in the tutorial, like this:
obj <- list(
raw = SeuratObject::GetAssayData(midgut.combined, "counts"),
neighbors= as(midgut.combined@graphs[["integrated_nn"]], "dgCMatrix")>.1,
embed = FetchData(midgut.combined, c("UMAP_1", "UMAP_2")),
totalUMI = midgut.combined$nCount_RNA
)
Converting Seurat objects to cellpypes object is best done with a wrapper that guides the user towards reasonable choices, e.g. pype_from_seurat function. Parameters that might make sense:
Useful accessors and code:
s <- CreateSeuratObject(t(malt_raw))
wnn <- s
wnn[["ADT"]] <- CreateAssayObject(counts = t(malt_prot))
Key(wnn[["RNA"]]); Key(wnn[["ADT"]])
DefaultAssay(wnn) <- 'RNA'
wnn <- NormalizeData(wnn) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()
DefaultAssay(wnn) <- 'ADT'
VariableFeatures(wnn) <- rownames(wnn[["ADT"]])
wnn <- NormalizeData(wnn, normalization.method = 'CLR', margin = 2) %>%
ScaleData() %>% RunPCA(npcs=10, reduction.name="apca", approx=F)
wnn <- FindMultiModalNeighbors(
wnn, reduction.list = list("pca", "apca"),
dims.list = list(1:30, 1:10)
)
Hi Felix,
Thank you so much for this very useful tool!
I was wondering if you were considering adding a function to backtrace the classification to individual cells of the Single-cell RNA seq data. I would love to be able to import the results classifier back to my Seurat object with the classification.
Al the best,
Peter
combine() or similar in name
deseq <- function(obj, class)
function, that returns obj again with DESeq2 results. Then one could start playing with plotting log-foldchanges against each other, e.g. for different marker gene sets or hierarchy levels.I am opening this issue as a notification because cellpypes
is listed here as a package that relies (depends/imports/suggests) on Seurat. As you may know, we recently released Seurat v5 as a beta in March of this year, with new updates for spatial, multimodal, and massively scalable analysis. For more information on updates and improvements, check out our website https://satijalab.org/seurat/.
We are now preparing to release Seurat v5 to CRAN, and plan to submit it on October 23rd. While we have tried our best to keep things backward-compatible, it is possible that updates to Seurat and SeuratObject might break your existing functionality or tests. We wanted to reach out before the new version is on CRAN, so that there's time to report issues/incompatibilities and prepare you for any changes in your code base that might be necessary.
We apologize for any disruption or inconvenience, but hope that the improvements to Seurat v5 will benefit your users going forward.
To test the upcoming release, you can install Seurat from the seurat5
branch using the instructions available on this page: https://satijalab.org/seurat/articles/install.
Thank you!
Seurat v5 team
Currently I have the "class_boolean" function, that returns a boolean vector or matrix.
I want to rename this to 'class_matrix' function to make it clear that a matrix is returned.
Simon also suggested a "classify" function, and I could name it "celltype_factor" and "celltype_matrix", not sure yet.
class_factor requires a few decisions:
I think readibility is improved if all calls to rule
function share the same format. Here is a skeleton:
# skeleton:
obj %>%
rule( "A", "gene1", ">", 1e-3 ) %>%
rule( "A", "gene1", ">", 1e-3, parent = "A") %>%
rule( "A", "gene1", ">", 1e-3, parent = "A")
Two things might be useful, please experiment with them:
Perhaps think about this more!
Plot_last creates two plots:
It's notoriously difficult to make the two plots align vertically, because they are different in width. The rel_widths ratio varies depending on how many digits the maximum expression has (as this is displayed in feat's legend).
Solution I'd like to implement:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.