felixthestudent / cellpypes Goto Github PK

View Code? Open in Web Editor NEW

49.0 1.0 3.0 3.1 MB

Cell type pipes for R

License: GNU General Public License v3.0

R 100.00%

celltype-annotation classification-algorithm scrnaseq single-cell-rna-seq

cellpypes's People

Contributors

Stargazers

Watchers

Forkers

eegk crsky1023 animesh

cellpypes's Issues

threshold in rule function

Hi,
First of all, thank you for this wonderful tool.
I have a question about how to convert from count matrix to cp10k units.

It is said that the unit of argument threshold of the 'rule' function is cp10k. Is the way to get the unit cp10k from count matrix dividing the gene-specific UMI count by the total number of UMI in the cell and multiplying by 10,000?
Can you please tell me if this is correct? If not, what is the right way to get cp10k from count matrix?

Thank you very much in advance!

Existing labels

Create an object slot where the user can put existing labels (e.g. output from Louvain, reference label transfer, etc.).

Write functions to:

manually edit selected clusters. Specifically, a vector of existing classes could be supplied as parent to the rule function and then we simply define subsets of these with cellpypes.
rename classes
group classes together (see issue #8)

options function for cellpypes

Perhaps it makes sense to set options at beginning of pypes like this:

obj %>%
  options(...) %>%
  rule(...) %>% rule(...)

Parameters I can imagine:

threshold_units: "fractions", "permille", "per10k", "ppm". This way, I could write .1 instead of .1e-3 in all rules, increasing
readibility. Also, the feature plot function could display permille/per10k/... in its color legend.
perhaps pull in meta.data here, so that you can write rules for totalUMI, percent.mito etc.?
method: "count_pooling", "saver", or others?

plot_last function

Here's code to get you started, taken from the script playground/ms.Rmd. It has these innovations over other plot_last functions I made in different scripts:

uses check_obj to verify object has the expected slots
works if embedding is a tbl, which requires using drop=T like this: obj$embed[,1, drop=T]
that's it :)

plot_last <- function(obj, what="rule") {
  check_obj(obj)
  if(what=="rule") {
    last_rule <- obj$rules[nrow(obj$rules),]
    boolean=evaluate_rule(obj,
                          last_rule$class, last_rule$feature,
                          last_rule$operator,   last_rule$threshold)
    plot_title <- paste0("Rule: ", last_rule$feature, last_rule$operator, 
                         last_rule$threshold)
  } else if (what=="class") {
    last_class <- obj$rules[nrow(obj$rules), "class"]
    boolean=drop(classify(obj, classes=last_class, return_logical_matrix = T))
    plot_title <- paste0("Class: ", last_class)
  } else { stop("plot_last argument 'what' should either be rule or class.") }

  ggplot(data=data.frame(V1=obj$embed[,1, drop=T], # tbl makes drop necessary
                         V2=obj$embed[,2, drop=T],
                         last=boolean),
         aes(V1, V2, col=last))+coord_fixed()+
    geom_point()  +
    xlab( colnames(obj$embed)[1] ) + 
    ylab( colnames(obj$embed)[2] ) +
    ggtitle(plot_title)
    
}

Function: group_classes (OR logic for rules)

Currently all rules are combined with AND, but there might be cases where OR is useful.

Motivation
Use-case 1: For example, markers a1 and a2 might have nice but imperfect overlap, and both mark cell type A. If A has subtypes it may well be we'd want to set the subtype's parent to either a1 OR a2. For this, introducing a group called "A" as a kind of virtual class (no own rules) would help.

Use-case 2: for plotting, we might want to have all myeloid cells in the same colors. Introducing the "myeloid" class could achieve this:

obj %>%
rule("mono", "CD14", .1) %>%
rule("DC",   "CD1C", .1) %>%
group_classes("myeloid"=c("mono","DC") %>%
rule("T",    "CD3E", .1) %>%
plot_classes( c("myeloid", "T"))

Since mono and DC have no common parent, this can not be achieved without the group_classes function.

How to implement
obj$classes currently has "class" and "parent" columns. parent is either a class name or the special key word "..root..". I propose to create the second key word "..group.." and the slot obj$class_groups; this could be a named list (name is group name, elements are classes belonging to the group.

plot_overlap function

Some classes may have many overlapping cells, it would be nice to visualize this somehow.

How to handle hierarchies? Overlap between parent and child class is obviously a superset-subset relationship, so that'd be pointless. See also the issue on class_factor function.

Genes in rows (not columns)

In bioinformatics, the features (genes) go in rows. For the rest of statistics and data science it's the other way around :D

dot_plot function

Here's code to generate a dot plot of marker genes, given an identity_dataframe. Perhaps add this function to cellpypes package?

Usage example:

Tsubsets_seurat = data.frame(
  Th = malt$seurat_clusters %in% c(2, 5, 7),
  Ttox=malt$seurat_clusters == 3
)


# The difference in purity is invisible in dot plots:
genes = c("CD4", "CD8B", "CD8A")
dot_plot(FetchData(malt, genes, slot="counts"),
         malt$nCount_RNA,
         Tsubsets_seurat)

Function code:

# stuff to develop the function:
# obj = malt %>% pype_from_seurat() 
# identity_dataframe = data.frame(Bcell=malt$seurat_clusters==0,
#                                 Tcell=malt$seurat_clusters==7)
# counts   = FetchData(malt, c("CD3E", "CD79B"), slot = "counts")
# totalUMI = malt$nCount_RNA


dot_plot <- function(counts, totalUMI, identity_dataframe, return_data=FALSE){
  
  res =data.frame(
    celltype = NULL,
    gene     = NULL,
    percent_expressing = NULL,
    average_expression = NULL
  )
  
  for(celltype in colnames(identity_dataframe)) {
    for(i in 1:ncol(counts)) {
      umi = counts[, i]
      relevant <-identity_dataframe[,celltype] 
      res = rbind(res, 
                  data.frame(
                    celltype = celltype,
                    gene     = colnames(counts)[i],
                    percent_expressing = 100 * mean(umi[relevant] > 0),
                    average_expression = 1e4*mean(umi[relevant]/totalUMI[relevant]) 
                  ))
    }
  }
  
  
  plot <- res %>%
    ggplot(aes(celltype, gene, size=percent_expressing, col=average_expression)) +
    geom_point() +
    scale_size_continuous(name = "Percent Expressed") +
    viridis::scale_color_viridis(name="Average Expression") +
    cowplot::theme_cowplot()
  
  if(return_data) return(res)
  return(plot)
  
}`

embedding instead of embed

Currently, my scripts have an obj$embed slot. I would like to change this to obj$embedding (as discussed with Simon on 9th June).

plot_classes/classify: Handle overlap between parent and child

Here is a minimal example with the problem:

simulated_umis %>% rule("B", "MS4A1", ">", 1e-4) %>% rule("T", "CD3E", ">", .1e-4) %>% rule("Treg", "FOXP3", ">", .05e-4, parent="T") %>% plot_classes(c("B","T","Treg"))

As you can see, we can't plot Tregs and all other Tcells in the same plot, because all Tregs are removed as overlap this way.

Possible solutions:

don't replace overlap if it is between a (leaf) class and one of its ancestors. Instead, use leaf class, not ancestor class.
warn the user that more than 15 % of a class is overlapping with another class... ties in with plot_overlap function I am planning I guess?

Bonus feature:

perhaps introduce shortcuts that can be passed to classify's classes argument. For example ..leafs.. or leafs(). Then I could do something like this: classify(classes=c(..leafs.., "T")).

precompute function

Problem:
For large data set (MS data from Schirmer group, for example), the interactive commands (rule + plot_last) are permissively slow.

Solution:
Precomputing the totalUMI S would speed things up for the MS data set, since raw is in gene-wise HDF5Array format (so cell-wise computations take long).

I found it unnecessary to precompute K when using an NxN neighbor graph (SNN > .1), since that uses matrix multiplication which is incredibly fast. I still have to test an Nxk neighbor graph (i.e. matrix with indices of kNN for each cell), that might be slower.

Both cases could be solved with something like this:

obj <- list(raw, neighbors, embedding)
obj <- precompute(obj)  # computes S
obj <- precompute(obj, genes=c("PLP1", "AQP1") # precomputes genes

cellpypes does not handle objects processed with SoupX

Starting with Seurat objects where the raw matrices are processed with SoupX, giving counts with decimals.

Here's my script
```obj <- list(
    raw      = SeuratObject::GetAssayData(lympho_NKT, "counts"),
    neighbors=as(lympho_NKT@graphs[["CSS_PCA"]], "dgCMatrix")>.1, 
    embed    =FetchData(lympho_NKT, vars=c("umapCSS_1","umapCSS_1")),
    totalUMI = lympho_NKT$nCount_RNA
)
pype <- obj %>%
  rule("T",           "CD3E",    ">", 2)                  %>% 
  rule("CD8+ T",      "CD8A",    ">", 1,  parent="T")

I get this error:

Error in check_obj(obj) : s and as.integer(s) are not equal
Mean relative difference: 0.0001429285

parent invalid gives unclear error message

Using a non-existent parent gives a cryptic error message. Minimal example:
simulated_umis %>% rule("T", "CD3E", ">", 1e-4, parent="oops") %>% plot_classes()

Note that the error only occurs through plot_classes. I think that's good behavior, we would not want to create obstacles when interactively developing rules.

The error is cryptic, please fix:

Error in if (current_parent == "..root..") { : argument is of length zero

pype_from_seurat only works on simple cases

pype_from_seurat gave the following error:

Error in methods::as(seurat@graphs[[graph_choice]], "dgCMatrix") : no method or default for coercing “NULL” to “dgCMatrix”

The solution was to use the explicit object creation strategy outlined in the tutorial, like this:

obj <- list(
    raw      = SeuratObject::GetAssayData(midgut.combined, "counts"),
    neighbors= as(midgut.combined@graphs[["integrated_nn"]], "dgCMatrix")>.1,
    embed    = FetchData(midgut.combined, c("UMAP_1", "UMAP_2")),
    totalUMI = midgut.combined$nCount_RNA
)

pype_from_seurat function

Converting Seurat objects to cellpypes object is best done with a wrapper that guides the user towards reasonable choices, e.g. pype_from_seurat function. Parameters that might make sense:

neighborGraph="kNN/SNN/WNN/..."
assay should be RNA

Useful accessors and code:

s <- CreateSeuratObject(t(malt_raw))
wnn <- s
wnn[["ADT"]] <- CreateAssayObject(counts = t(malt_prot))
Key(wnn[["RNA"]]); Key(wnn[["ADT"]]) 

DefaultAssay(wnn) <- 'RNA'
wnn <- NormalizeData(wnn) %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA()
DefaultAssay(wnn) <- 'ADT'
VariableFeatures(wnn) <- rownames(wnn[["ADT"]])
wnn <- NormalizeData(wnn, normalization.method = 'CLR', margin = 2) %>% 
  ScaleData()  %>% RunPCA(npcs=10, reduction.name="apca", approx=F)
wnn <- FindMultiModalNeighbors(
  wnn, reduction.list = list("pca", "apca"), 
  dims.list = list(1:30, 1:10)
)

Backtrace which cells that got classified?

Hi Felix,
Thank you so much for this very useful tool!

I was wondering if you were considering adding a function to backtrace the classification to individual cells of the Single-cell RNA seq data. I would love to be able to import the results classifier back to my Seurat object with the classification.

Al the best,
Peter

Function to combine objects

combine() or similar in name

combines cellpypes objects, which is useful for pseudobulking and setting same thresholds for all samples
makes it easier to compute Differential expression
perhaps I could write a deseq <- function(obj, class) function, that returns obj again with DESeq2 results. Then one could start playing with plotting log-foldchanges against each other, e.g. for different marker gene sets or hierarchy levels.
I propose that DE genes in scRNAseq are also something to explore, not a static inference result. While problematic for multiple testing correction (how to correct for interactively playing around until results are judged as good?), I like the idea that DE is also amenable to exploration.

Prepare for upcoming Seurat v5 release

I am opening this issue as a notification because cellpypes is listed here as a package that relies (depends/imports/suggests) on Seurat. As you may know, we recently released Seurat v5 as a beta in March of this year, with new updates for spatial, multimodal, and massively scalable analysis. For more information on updates and improvements, check out our website https://satijalab.org/seurat/.
We are now preparing to release Seurat v5 to CRAN, and plan to submit it on October 23rd. While we have tried our best to keep things backward-compatible, it is possible that updates to Seurat and SeuratObject might break your existing functionality or tests. We wanted to reach out before the new version is on CRAN, so that there's time to report issues/incompatibilities and prepare you for any changes in your code base that might be necessary.

We apologize for any disruption or inconvenience, but hope that the improvements to Seurat v5 will benefit your users going forward.
To test the upcoming release, you can install Seurat from the seurat5 branch using the instructions available on this page: https://satijalab.org/seurat/articles/install.

Thank you!
Seurat v5 team

class_factor function and class_matrix function

Currently I have the "class_boolean" function, that returns a boolean vector or matrix.
I want to rename this to 'class_matrix' function to make it clear that a matrix is returned.

Simon also suggested a "classify" function, and I could name it "celltype_factor" and "celltype_matrix", not sure yet.

class_factor requires a few decisions:

How to deal with overlap? Several options. class_factor could label cells with strings such as "Unassigned" or "multiple_labels". Alternatively it could create cell type labels by concatenating the two (or more) classes for which a cell fulfills the rules (such as T_B or T_B_monocyte).
Hierarchies: by default, should only "leaves" of the hierarchy tree, i.e. classes that do not have children.

format_rules function

I think readibility is improved if all calls to rule function share the same format. Here is a skeleton:

# skeleton:
obj %>%
  rule(  "A",       "gene1",          ">",   1e-3              )             %>%
  rule(  "A",       "gene1",          ">",   1e-3, parent = "A")             %>%
  rule(  "A",       "gene1",          ">",   1e-3, parent = "A")

Two things might be useful, please experiment with them:

a function that generates above skeleton, so that I can copy-paste it into my R script and start modifying it.
Perhaps it is too annoying though to adjust white spaces every time, let's see.
a function that takes an object AFTER the rules were added, and outputs the rules formatted like this.
The user could then copy-paste the nicely formatted rules into his Rscript, so that at least after developing the
code the script looks nice.

Perhaps think about this more!

plot_last: make plots the same size

Plot_last creates two plots:

rule plot, colours indicate TRUE/FALSE
feature plot, colour indicates gene expression levels

It's notoriously difficult to make the two plots align vertically, because they are different in width. The rel_widths ratio varies depending on how many digits the maximum expression has (as this is displayed in feat's legend).

Solution I'd like to implement:

make both plots a little wider, by extending the x-axis (increasing the maximal UMAP1 coordinate value)
for feat, put the legend into this area as inset graph. Does that work?!
for rule, put TRUE / FALSE together with the corresponding number of cells there, colored same as the points.