stemangiola / sccomp Goto Github PK

Testing differences in cell type proportions from single-cell data.

Home Page: https://stemangiola.github.io/sccomp/

R 90.40% C++ 0.01% Stan 9.59%

bayesian composition cytof microbiome r-package single-cell proportions batch-correction differential-proportion unwanted-variation multilevel random-effects

sccomp's Introduction

sccomp - Outlier-aware and count-based compositional analysis of single-cell data

sccomp tests differences in cell type proportions from single-cell data. It is robust against outliers, it models continuous and discrete factors, and capable of random-effect/intercept modelling.

Please cite PNAS - sccomp: Robust differential composition and variability analysis for single-cell data

Characteristics

Complex linear models with continuous and categorical covariates
Multilevel modelling, with population fixed and random effects/intercept
Modelling data from counts
Testing differences in cell-type proportionality
Testing differences in cell-type specific variability
Cell-type information share for variability adaptive shrinkage
Testing differential variability
Probabilistic outlier identification
Cross-dataset learning (hyperpriors).

Installation

Bioconductor

if (!requireNamespace("BiocManager")) install.packages("BiocManager")
BiocManager::install("sccomp")

Github

devtools::install_github("stemangiola/sccomp")

Function	Description
`sccomp_estimate`	Fit the model onto the data, and estimate the coefficients
`sccomp_remove_outliers`	Identify outliers probabilistically based on the model fit, and exclude them from the estimation
`sccomp_test`	Calculate the probability that the coefficients are outside the H0 interval (i.e. test_composition_above_logit_fold_change)
`sccomp_replicate`	Simulate data from the model, or part of the model
`sccomp_predict`	Predicts proportions, based on the mode, or part of the model
`sccomp_remove_unwanted_variation`	Removes the variability for unwanted factors
`plot`	Plors summary plots to asses significance

Analysis

sccomp can model changes in composition and variability. By default, the formula for variability is either ~1, which assumes that the cell-group variability is independent of any covariate or ~ factor_of_interest, which assumes that the model is dependent on the factor of interest only. The variability model must be a subset of the model for composition.

Binary factor

Of the output table, the estimate columns start with the prefix c_ indicate composition, or with v_ indicate variability (when formula_variability is set).

From Seurat, SingleCellExperiment, metadata objects

sccomp_result = 
  single_cell_object |>
  sccomp_estimate( 
    formula_composition = ~ type, 
    .sample =  sample, 
    .cell_group = cell_group, 
    bimodal_mean_variability_association = TRUE,
    cores = 1 
  ) |> 
  sccomp_remove_outliers(cores = 1) |> # Optional
  sccomp_test()

From counts

sccomp_result = 
  counts_obj |>
  sccomp_estimate( 
    formula_composition = ~ type, 
    .sample = sample,
    .cell_group = cell_group,
    .count = count, 
    bimodal_mean_variability_association = TRUE,
    cores = 1, verbose = FALSE
  ) |> 
  sccomp_remove_outliers(cores = 1, verbose = FALSE) |> # Optional
  sccomp_test()

Summary plots

plots = sccomp_result |> plot()

## Joining with `by = join_by(cell_group, sample)`
## Joining with `by = join_by(cell_group, type)`

A plot of group proportion, faceted by groups. The blue boxplots represent the posterior predictive check. If the model is likely to be descriptively adequate to the data, the blue box plot should roughly overlay with the black box plot, which represents the observed data. The outliers are coloured in red. A box plot will be returned for every (discrete) covariate present in formula_composition. The colour coding represents the significant associations for composition and/or variability.

plots$boxplot

## [[1]]

A plot of estimates of differential composition (c_) on the x-axis and differential variability (v_) on the y-axis. The error bars represent 95% credible intervals. The dashed lines represent the minimal effect that the hypothesis test is based on. An effect is labelled as significant if bigger than the minimal effect according to the 95% credible interval. Facets represent the covariates in the model.

plots$credible_intervals_1D

We can plot the relationship between abundance and variability. As we can see below, they are positively correlated, you also appreciate that this relationship is by model for single cell RNA sequencing data.

sccomp models, these relationship to obtain a shrinkage effect on the estimates of both the abundance and the variability. This shrinkage is adaptive as it is modelled jointly, thanks for Bayesian inference.

plots$credible_intervals_2D

Contrasts

seurat_obj |>
  sccomp_estimate( 
    formula_composition = ~ 0 + type, 
    .sample = sample,
    .cell_group = cell_group, 
    bimodal_mean_variability_association = TRUE,
    cores = 1, verbose = FALSE
  ) |> 
  sccomp_test( contrasts =  c("typecancer - typehealthy", "typehealthy - typecancer"))

## # A tibble: 60 × 18
##    cell_group  parameter factor c_lower c_effect c_upper   c_pH0   c_FDR c_n_eff
##    <chr>       <chr>     <chr>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 B immature  typecanc… <NA>    -1.79    -1.24   -0.676 0       0            NA
##  2 B immature  typeheal… <NA>     0.676    1.24    1.79  0       0            NA
##  3 B mem       typecanc… <NA>    -2.44    -1.78   -1.16  0       0            NA
##  4 B mem       typeheal… <NA>     1.16     1.78    2.44  0       0            NA
##  5 CD4 cm S10… typecanc… <NA>    -1.32    -0.927  -0.535 0       0            NA
##  6 CD4 cm S10… typeheal… <NA>     0.535    0.927   1.32  0       0            NA
##  7 CD4 cm hig… typecanc… <NA>     0.780    1.73    2.71  0.00100 1.67e-4      NA
##  8 CD4 cm hig… typeheal… <NA>    -2.71    -1.73   -0.780 0.00100 1.67e-4      NA
##  9 CD4 cm rib… typecanc… <NA>     0.398    1.05    1.66  0.00200 5.00e-4      NA
## 10 CD4 cm rib… typeheal… <NA>    -1.66    -1.05   -0.398 0.00200 5.00e-4      NA
## # ℹ 50 more rows
## # ℹ 9 more variables: c_R_k_hat <dbl>, v_lower <dbl>, v_effect <dbl>,
## #   v_upper <dbl>, v_pH0 <dbl>, v_FDR <dbl>, v_n_eff <dbl>, v_R_k_hat <dbl>,
## #   count_data <list>

Categorical factor (e.g. Bayesian ANOVA)

This is achieved through model comparison with loo. In the following example, the model with association with factors better fits the data compared to the baseline model with no factor association. For comparisons check_outliers must be set to FALSE as the leave-one-out must work with the same amount of data, while outlier elimination does not guarantee it.

If elpd_diff is away from zero of > 5 se_diff difference of 5, we are confident that a model is better than the other reference. In this case, -79.9 / 11.5 = -6.9, therefore we can conclude that model one, the one with factor association, is better than model two.

library(loo)

# Fit first model
model_with_factor_association = 
  seurat_obj |>
  sccomp_estimate( 
    formula_composition = ~ type, 
    .sample =  sample, 
    .cell_group = cell_group, 
    bimodal_mean_variability_association = TRUE,
    cores = 1, 
    enable_loo = TRUE
  )

# Fit second model
model_without_association = 
  seurat_obj |>
  sccomp_estimate( 
    formula_composition = ~ 1, 
    .sample =  sample, 
    .cell_group = cell_group, 
    bimodal_mean_variability_association = TRUE,
    cores = 1 , 
    enable_loo = TRUE
  )

# Compare models
loo_compare(
  model_with_factor_association |> attr("fit") |> loo(),
  model_without_association |> attr("fit") |> loo()
)

Differential variability, binary factor

We can model the cell-group variability also dependent on the type, and so test differences in variability

res = 
  seurat_obj |>
  sccomp_estimate( 
    formula_composition = ~ type, 
    formula_variability = ~ type,
    .sample = sample,
    .cell_group = cell_group,
    bimodal_mean_variability_association = TRUE,
    cores = 1, verbose = FALSE
  )

res

## # A tibble: 60 × 14
##    cell_group        parameter factor c_lower c_effect c_upper c_n_eff c_R_k_hat
##    <chr>             <chr>     <chr>    <dbl>    <dbl>   <dbl>   <dbl>     <dbl>
##  1 B immature        (Interce… <NA>    0.225     0.634   1.03      NaN      3.17
##  2 B immature        typeheal… type    1.20      1.80    2.36      NaN      3.15
##  3 B mem             (Interce… <NA>   -1.26     -0.864  -0.466     NaN      3.14
##  4 B mem             typeheal… type    1.26      1.81    2.46      NaN      3.14
##  5 CD4 cm S100A4     (Interce… <NA>    1.48      1.74    2.02      NaN      3.09
##  6 CD4 cm S100A4     typeheal… type    0.575     0.967   1.38      NaN      3.12
##  7 CD4 cm high cyto… (Interce… <NA>   -1.17     -0.655  -0.182     NaN      3.03
##  8 CD4 cm high cyto… typeheal… type   -1.97     -1.27   -0.610     NaN      3.20
##  9 CD4 cm ribosome   (Interce… <NA>   -0.0221    0.411   0.848     NaN      3.17
## 10 CD4 cm ribosome   typeheal… type   -1.60     -0.974  -0.420     NaN      3.20
## # ℹ 50 more rows
## # ℹ 6 more variables: v_lower <dbl>, v_effect <dbl>, v_upper <dbl>,
## #   v_n_eff <dbl>, v_R_k_hat <dbl>, count_data <list>

Suggested settings

For single-cell RNA sequencing

We recommend setting bimodal_mean_variability_association = TRUE. The bimodality of the mean-variability association can be confirmed from the plots$credible_intervals_2D (see below).

For CyTOF and microbiome data

We recommend setting bimodal_mean_variability_association = FALSE (Default).

Visualisation of the MCMC chains from the posterior distribution

It is possible to directly evaluate the posterior distribution. In this example, we plot the Monte Carlo chain for the slope parameter of the first cell type. We can see that it has converged and is negative with probability 1.

res %>% attr("fit") %>% rstan::traceplot("beta[2,1]")

Plot 1D significance plot

plots = res |> sccomp_test() |> plot()

## Joining with `by = join_by(cell_group, sample)`
## Joining with `by = join_by(cell_group, type)`

plots$credible_intervals_1D

Plot 2D significance plot. Data points are cell groups. Error bars are the 95% credible interval. The dashed lines represent the default threshold fold change for which the probabilities (c_pH0, v_pH0) are calculated. pH0 of 0 represent the rejection of the null hypothesis that no effect is observed.

This plot is provided only if differential variability has been tested. The differential variability estimates are reliable only if the linear association between mean and variability for (intercept) (left-hand side facet) is satisfied. A scatterplot (besides the Intercept) is provided for each category of interest. The for each category of interest, the composition and variability effects should be generally uncorrelated.

plots$credible_intervals_2D

The old framework

The new tidy framework was introduced in 2024, two, understand the differences and improvements. Compared to the old framework, please read this blog post.

sccomp's People

Contributors

Stargazers

Watchers

Forkers

jeffreypullin andrjohns sachinkavindaa mayunlong89 animesh

sccomp's Issues

sccomp plot method -> rstan::gqs throws warning

I have not been able to prepare a self-contained reprex for this problem, and the fitted object saved as rds file is too big to share via GitHub.

Here is what I see in the console:

... # previous lines of code not relevant here
sctest <- sccomp::sccomp_test(s_collembola)
plots <- plot(sctest)

returns:

Joining with `by = join_by(taxon_id, sample_id)`
Joining with `by = join_by(taxon_id, Landgebruik_MBAG)`
Joining with `by = join_by(taxon_id, sample_id)`
Joining with `by = join_by(taxon_id, Diepte)`
Warning messages:
1: In validityMethod(object) :
  The following variables have undefined values:  . Many subsequent functions will not work correctly.
2: In validityMethod(object) :
  The following variables have undefined values:  . Many subsequent functions will not work correctly.

I debugged this and these warnings messages are thrown by rstan::gqs(). Do you know what this indicates? I think it has to do with the generation of posterior predictive checks (blue outlined boxplots). The warning maybe tells us that "undefined values" were generated? Maybe integer overflow?

Warning message: In validityMethod(object)

Hello!

I am running sccomp on single-cell protein data (like cite-seq), using the following command:

sccomp_obj = seurat_obj |> 
    sccomp::sccomp_glm( formula_composition = ~ type, 
                        formula_variability = ~ 1, 
                        percent_false_positive = 5, 
                        .sample = sample, 
                        .cell_group = cell_group)
  
  return(sccomp_obj)

I get the following warning, could it be due to the fact that is not scRNA data, or should I use another approach?

Warning message:
In validityMethod(object) :
  The following variables have undefined values:  log_lik[1],The following variables have undefined values:  log_lik[2],The following variables have undefined values:  log_lik[3],The following variables have undefined values:  log_lik[4],The following variables have undefined values:  log_lik[5],The following variables have undefined values:  log_lik[6],The following variables have undefined values:  log_lik[7],The following variables have undefined values:  log_lik[8],The following variables have undefined values:  log_lik[9],The following variables have undefined values:  log_lik[10],The following variables have undefined values:  log_lik[11],The following variables have undefined values:  log_lik[12],The following variables have undefined values:  log_lik[13],The following variables have undefined values:  log_lik[14],The following variables have undefined values:  log_lik[15],The following variables have undefined values:  log_lik[16],The following variables have undefined values:  l [... truncated]

Thank you!!

Marta

Download real-world datasets (possibly cell cluster and subject covariate labels), for benchmarking purposes.

If you dig in their github repository you will likely find the data already summarised for you in form of R script for reproducibility, so you have to avoid the 95% of work in getting raw single-cell data.

Used by this article: https://www.pnas.org/content/118/22/e2100293118

Segerstolpe A, Palasantza A, Eliasson P, Andersson EM, Andreasson AC, Sun X, Picelli S, Sabirsh A, Clausen M, Bjursell MK, Smith DM, Kasper M, Ammala C, Sandberg R. Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes. Cell Metab. 2016; 24(4):593–607.

RNA counts (matrix with genes for rows and cells for columns, integer transcription abundance as values, you need also the dataset for cell annotation, to link cells with subjects).
Cluster counts (matrix with categories for rows and subjects for columns, and integer count as values)

Delile J, Rayon T, Melchionda M, Edwards A, Briscoe J, Sagner A. Single cell transcriptomics reveals spatial and temporal dynamics of gene expression in the developing mouse spinal cord. Development. 2019. https://doi.org/10.1242/dev.173807.

RNA counts
Cluster counts

Used by this article: https://www.pnas.org/content/118/22/e2100293118

M. Sade-Feldman et al., Defining t cell states associated with response to checkpoint immunotherapy in melanoma.

RNA counts
Cluster counts

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120575

RNA counts
Cluster counts

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE122043

RNA counts
Cluster counts

K. Gupta et al., Single-cell analysis reveals a hair follicle dermal niche molecular differentiation trajectory that begins prior to morphogenesis. Dev. Cell 48, 17–31 (2019).

RNA counts
Cluster counts

X. Fan et al., Single cell and open chromatin analysis reveals molecular origin of epidermal cells of the skin. Dev. Cell 47, 21–37 (2018).

RNA counts
Cluster counts

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE102086

RNA counts
Cluster counts

https://ndownloader.figshare.com/files/22927382

RNA counts
Cluster counts

R. L. Chua et al., Covid-19 severity correlates with airway epithelium–immune cell interactions identified by single-cell analysis. Nat. Biotechnol. 38, 970–979 (2020).

RNA counts
Cluster counts

M. Liao et al., Single-cell landscape of bronchoalveolar immune cells in patients with covid-19. Nat. Med. 26, 842–844 (2020)

RNA counts
Cluster counts

https://cells.ucsc.edu/covid19-balf/nCoV.rds

RNA counts
Cluster counts

https://singlecell.broadinstitute.org/single_cell/study/SCP263/aging-mouse-brain#/

RNA counts
Cluster counts

M. Ximerakis et al., Single-cell transcriptomic profiling of the aging mouse brain. Nat. Neurosci. 22, 1696–1708 (2019).

RNA counts
Cluster counts

Used by this article: https://www.biorxiv.org/content/10.1101/2020.12.14.422688v1.full

https://singlecell.broadinstitute.org/single_cell/study/SCP259

RNA counts
Cluster counts

https://github.com/zhangzlab/covid_balf

RNA counts
Cluster counts

https://singlecell.broadinstitute.org/single_cell/study/SCP44

RNA counts
Cluster counts

custom ggplot theme misses theme element

The following reprex errors due to legend.text.align theme element is not defined in the element.
I'm pretty sure this used to work, but I recently updated all my packages. My guess is that something changed in ggplot2 that causes this:

library(sccomp)
s_factor <- seurat_obj |>
  sccomp_estimate(
    formula_composition = ~ type + group2__,
    .sample =  sample,
    .cell_group = cell_group,
    bimodal_mean_variability_association = TRUE,
    cores = 1
  )
#> sccomp says: estimation
#> sccomp says: the composition design matrix has columns: (Intercept), typehealthy, group2__GROUP22
#> sccomp says: the variability design matrix has columns: (Intercept)
#> 
#> SAMPLING FOR MODEL 'glm_multi_beta_binomial' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 0.000765 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 7.65 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Iteration:    1 / 4300 [  0%]  (Warmup)
#> Chain 1: Iteration:  301 / 4300 [  7%]  (Sampling)
#> Chain 1: Iteration: 1300 / 4300 [ 30%]  (Sampling)
#> Chain 1: Iteration: 2300 / 4300 [ 53%]  (Sampling)
#> Chain 1: Iteration: 3300 / 4300 [ 76%]  (Sampling)
#> Chain 1: Iteration: 4300 / 4300 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 3.598 seconds (Warm-up)
#> Chain 1:                28.47 seconds (Sampling)
#> Chain 1:                32.068 seconds (Total)
#> Chain 1:
s_factor <- sccomp_test(s_factor)
plots <- plot(s_factor)
#> Joining with `by = join_by(cell_group, sample)`
#> Joining with `by = join_by(cell_group, type)`
#> Joining with `by = join_by(cell_group, sample)`
#> Joining with `by = join_by(cell_group, group2__)`
plots$credible_intervals_1D
#> Error in `plot_theme()`:
#> ! The `legend.text.align` theme element is not defined in the element
#>   hierarchy.

^{Created on 2024-02-29 with reprex v2.1.0}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31 ucrt)
#>  os       Windows 10 x64 (build 19045)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Dutch_Belgium.utf8
#>  ctype    Dutch_Belgium.utf8
#>  tz       Europe/Brussels
#>  date     2024-02-29
#>  pandoc   3.1.12.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package              * version   date (UTC) lib source
#>    abind                  1.4-5     2016-07-21 [1] CRAN (R 4.3.0)
#>    Biobase                2.62.0    2023-10-24 [1] Bioconductor
#>    BiocGenerics           0.48.1    2023-11-01 [1] Bioconductor
#>    bitops                 1.0-7     2021-04-24 [1] CRAN (R 4.3.0)
#>    boot                   1.3-30    2024-02-26 [2] CRAN (R 4.3.2)
#>    cli                    3.6.2     2023-12-11 [1] CRAN (R 4.3.2)
#>    codetools              0.2-19    2023-02-01 [2] CRAN (R 4.3.2)
#>    colorspace             2.1-0     2023-01-23 [1] CRAN (R 4.3.0)
#>    crayon                 1.5.2     2022-09-29 [1] CRAN (R 4.3.0)
#>    curl                   5.2.0     2023-12-08 [1] CRAN (R 4.3.2)
#>    DelayedArray           0.28.0    2023-10-24 [1] Bioconductor
#>    digest                 0.6.34    2024-01-11 [1] CRAN (R 4.3.2)
#>    dotCall64              1.1-1     2023-11-28 [1] CRAN (R 4.3.2)
#>    dplyr                  1.1.4     2023-11-17 [1] CRAN (R 4.3.2)
#>    evaluate               0.23      2023-11-01 [1] CRAN (R 4.3.2)
#>    fansi                  1.0.6     2023-12-08 [1] CRAN (R 4.3.2)
#>    fastmap                1.1.1     2023-02-24 [1] CRAN (R 4.3.0)
#>    forcats                1.0.0     2023-01-29 [1] CRAN (R 4.3.0)
#>    fs                     1.6.3     2023-07-20 [1] CRAN (R 4.3.1)
#>    future                 1.33.1    2023-12-22 [1] CRAN (R 4.3.2)
#>    future.apply           1.11.1    2023-12-21 [1] CRAN (R 4.3.2)
#>    generics               0.1.3     2022-07-05 [1] CRAN (R 4.3.0)
#>    GenomeInfoDb           1.38.6    2024-02-08 [1] Bioconductor 3.18 (R 4.3.2)
#>    GenomeInfoDbData       1.2.10    2023-10-05 [1] Bioconductor
#>    GenomicRanges          1.54.1    2023-10-29 [1] Bioconductor
#>    ggplot2                3.5.0     2024-02-23 [1] CRAN (R 4.3.2)
#>    ggrepel                0.9.5     2024-01-10 [1] CRAN (R 4.3.2)
#>    globals                0.16.2    2022-11-21 [1] CRAN (R 4.3.0)
#>    glue                   1.7.0     2024-01-09 [1] CRAN (R 4.3.2)
#>    gridExtra              2.3       2017-09-09 [1] CRAN (R 4.3.0)
#>    gtable                 0.3.4     2023-08-21 [1] CRAN (R 4.3.1)
#>    hms                    1.1.3     2023-03-21 [1] CRAN (R 4.3.0)
#>    htmltools              0.5.7     2023-11-03 [1] CRAN (R 4.3.2)
#>    inline                 0.3.19    2021-05-31 [1] CRAN (R 4.3.0)
#>    IRanges                2.36.0    2023-10-24 [1] Bioconductor
#>    jsonlite               1.8.8     2023-12-04 [1] CRAN (R 4.3.2)
#>    knitr                  1.45      2023-10-30 [1] CRAN (R 4.3.2)
#>    labeling               0.4.3     2023-08-29 [1] CRAN (R 4.3.1)
#>    lattice                0.22-5    2023-10-24 [1] CRAN (R 4.3.2)
#>    lifecycle              1.0.4     2023-11-07 [1] CRAN (R 4.3.2)
#>    listenv                0.9.1     2024-01-29 [1] CRAN (R 4.3.2)
#>    loo                    2.7.0     2024-02-24 [1] CRAN (R 4.3.2)
#>    magrittr               2.0.3     2022-03-30 [1] CRAN (R 4.3.0)
#>    Matrix                 1.6-5     2024-01-11 [1] CRAN (R 4.3.2)
#>    MatrixGenerics         1.14.0    2023-10-24 [1] Bioconductor
#>    matrixStats            1.2.0     2023-12-11 [1] CRAN (R 4.3.2)
#>    munsell                0.5.0     2018-06-12 [1] CRAN (R 4.3.0)
#>    parallelly             1.37.0    2024-02-14 [1] CRAN (R 4.3.2)
#>    patchwork              1.2.0     2024-01-08 [1] CRAN (R 4.3.2)
#>    pillar                 1.9.0     2023-03-22 [1] CRAN (R 4.3.0)
#>    pkgbuild               1.4.3     2023-12-10 [1] CRAN (R 4.3.2)
#>    pkgconfig              2.0.3     2019-09-22 [1] CRAN (R 4.3.0)
#>    progressr              0.14.0    2023-08-10 [1] CRAN (R 4.3.1)
#>    purrr                  1.0.2     2023-08-10 [1] CRAN (R 4.3.1)
#>    QuickJSR               1.1.3     2024-01-31 [1] CRAN (R 4.3.2)
#>    R.cache                0.16.0    2022-07-21 [1] CRAN (R 4.3.0)
#>    R.methodsS3            1.8.2     2022-06-13 [1] CRAN (R 4.3.0)
#>    R.oo                   1.26.0    2024-01-24 [1] CRAN (R 4.3.2)
#>    R.utils                2.12.3    2023-11-18 [1] CRAN (R 4.3.2)
#>    R6                     2.5.1     2021-08-19 [1] CRAN (R 4.3.0)
#>    RColorBrewer           1.1-3     2022-04-03 [1] CRAN (R 4.3.0)
#>    Rcpp                   1.0.12    2024-01-09 [1] CRAN (R 4.3.2)
#>  D RcppParallel           5.1.7     2023-02-27 [1] CRAN (R 4.3.0)
#>    RCurl                  1.98-1.14 2024-01-09 [1] CRAN (R 4.3.2)
#>    readr                  2.1.5     2024-01-10 [1] CRAN (R 4.3.2)
#>    reprex                 2.1.0     2024-01-11 [1] CRAN (R 4.3.2)
#>    rlang                  1.1.3     2024-01-10 [1] CRAN (R 4.3.2)
#>    rmarkdown              2.25      2023-09-18 [1] CRAN (R 4.3.1)
#>    rstan                  2.32.5    2024-01-10 [1] CRAN (R 4.3.2)
#>    rstantools             2.4.0     2024-01-31 [1] CRAN (R 4.3.2)
#>    rstudioapi             0.15.0    2023-07-07 [1] CRAN (R 4.3.1)
#>    S4Arrays               1.2.0     2023-10-24 [1] Bioconductor
#>    S4Vectors              0.40.2    2023-11-23 [1] Bioconductor
#>    scales                 1.3.0     2023-11-28 [1] CRAN (R 4.3.2)
#>    sccomp               * 1.7.5     2024-02-28 [1] Github (stemangiola/sccomp@e1b4de2)
#>    sessioninfo            1.2.2     2021-12-06 [1] CRAN (R 4.3.0)
#>    SeuratObject           5.0.1     2023-11-17 [1] CRAN (R 4.3.2)
#>    SingleCellExperiment   1.24.0    2023-10-24 [1] Bioconductor
#>    sp                     2.1-3     2024-01-30 [1] CRAN (R 4.3.2)
#>    spam                   2.10-0    2023-10-23 [1] CRAN (R 4.3.2)
#>    SparseArray            1.2.4     2024-02-11 [1] Bioconductor 3.18 (R 4.3.2)
#>    StanHeaders            2.32.5    2024-01-10 [1] CRAN (R 4.3.2)
#>    stringi                1.8.3     2023-12-11 [1] CRAN (R 4.3.2)
#>    stringr                1.5.1     2023-11-14 [1] CRAN (R 4.3.2)
#>    styler                 1.10.2    2023-08-29 [1] CRAN (R 4.3.1)
#>    SummarizedExperiment   1.32.0    2023-10-24 [1] Bioconductor
#>    tibble                 3.2.1     2023-03-20 [1] CRAN (R 4.3.0)
#>    tidyr                  1.3.1     2024-01-24 [1] CRAN (R 4.3.2)
#>    tidyselect             1.2.0     2022-10-10 [1] CRAN (R 4.3.0)
#>    tzdb                   0.4.0     2023-05-12 [1] CRAN (R 4.3.1)
#>    utf8                   1.2.4     2023-10-22 [1] CRAN (R 4.3.2)
#>    V8                     4.4.2     2024-02-15 [1] CRAN (R 4.3.2)
#>    vctrs                  0.6.5     2023-12-01 [1] CRAN (R 4.3.2)
#>    withr                  3.0.0     2024-01-16 [1] CRAN (R 4.3.2)
#>    xfun                   0.41      2023-11-01 [1] CRAN (R 4.3.2)
#>    XVector                0.42.0    2023-10-24 [1] Bioconductor
#>    yaml                   2.3.8     2023-12-11 [1] CRAN (R 4.3.2)
#>    zlibbioc               1.48.0    2023-10-24 [1] Bioconductor
#> 
#>  [1] C:/R/library
#>  [2] C:/R/R-4.3.2/library
#> 
#>  D ── DLL MD5 mismatch, broken installation.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

issue creating res object - scoop

Hello!

I am trying to create res object from Seurat object with the code provided (below), but it does not seem to work. Is there an alternative option to create this res object?

res =
  seurat_obj |>
   formula_composition = ~ type, 
    formula_variability = ~ 1, 
    sample, 
    cell_group 
  )

Many thanks!
Marta

contrasts

I am not able to performe the sccomp_glm with contrasts.
Even using the "counts_obj" from this repository, the following error comes out:

"Error in sccomp_glm(counts_obj, formula_composition = ~0 + type, formula_variability = ~1, :
unused argument (contrasts = c(typecancer - typebenign, typebenign - typecancer))"

stanfit correspondence parameters variable names

Hi!

If we want to do postprocessing based on the stan-fit model, it is difficult to determine the correspondence between the names used for parameters in the stan-fit object and those used in - for instance - the print method of an sccomp object.

Here is an example taken from the README page:

res %>% attr("fit") %>% rstan::traceplot("beta[2,1]")

Is there a (maybe internal) function to relate beta[2,1] and the like to parameter names based on names of covariates used in the model?

FDR calculation

How was the FDR calculated, with the Benjamini-Hochberg procedure?
I was not able to find it in the paper.

Thanks!

Does sccomp require replicates ?

Thanks for developing such a tool.

I would like to know if replicates are mandatory or not to use sccomp.

To provide you with some context, I am doing case-control scRNAseq. I have 1 dataset that is my control cardioWT, and 1 dataset where there was a gene knock-out cardioKO.

In my code, I did

[email protected]$cell_group = [email protected]$celltypes
[email protected]$sample = [email protected]$orig.ident   # control and KO 
[email protected]$type = [email protected]$orig.ident

With this setting, I can run the sccomp analysis. However, I wonder why I have a single value per type, per cell types. (Here, my cell types are the cluster numbers, as I am still working on manual annotation).

##### Binary factor
binF <- obj |>
    sccomp_estimate( 
        formula_composition = ~ type,
        # formula_variability = ~ type,
        .sample =  sample, 
        .cell_group = cell_group, 
        bimodal_mean_variability_association = TRUE,
        cores = 1 
    ) |> 
    sccomp_remove_outliers() |> 
    sccomp_test(test_composition_above_logit_fold_change = 0.2)

binF

plots_binF = plot(binF) 

plots_binF$boxplot
ggsave("binary_factor.png")

My guess is because I don't have replicates.
Back to the questions:

are replicates mandatory for sccomp ?
if not, I feel that I am not able to work with single-point results. What would be your suggestion to overcome this situation?

Many thanks!!

Fig 3B and 3 group comparisson

Hello! thank you for developing this package!. On my dataset it produces robust results and beautiful plots. For my work I need to compare 3 groups, like in the figure 3b of your manuscript. Would you mind sharing the code for how that panel was compared?. After looking at the repository and vignettes I cant seem to find how to compare 3 groups. Thanks a lot!!

sccomp_boxplot() errors when passing a factor variable

Reprex:

library(sccomp)
s_factor <- seurat_obj |>
  sccomp_estimate(
    formula_composition = ~ type + group2__,
    .sample =  sample,
    .cell_group = cell_group,
    bimodal_mean_variability_association = TRUE,
    cores = 1
  )
#> sccomp says: estimation
#> sccomp says: the composition design matrix has columns: (Intercept), typehealthy, group2__GROUP22
#> sccomp says: the variability design matrix has columns: (Intercept)
#> 
#> SAMPLING FOR MODEL 'glm_multi_beta_binomial' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 0.000653 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 6.53 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Iteration:    1 / 4300 [  0%]  (Warmup)
#> Chain 1: Iteration:  301 / 4300 [  7%]  (Sampling)
#> Chain 1: Iteration: 1300 / 4300 [ 30%]  (Sampling)
#> Chain 1: Iteration: 2300 / 4300 [ 53%]  (Sampling)
#> Chain 1: Iteration: 3300 / 4300 [ 76%]  (Sampling)
#> Chain 1: Iteration: 4300 / 4300 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 3.801 seconds (Warm-up)
#> Chain 1:                32.613 seconds (Sampling)
#> Chain 1:                36.414 seconds (Total)
#> Chain 1:
s_factor <- sccomp_test(s_factor)
sccomp_boxplot(s_factor, "type")
#> Joining with `by = join_by(cell_group, sample)`
#> Joining with `by = join_by(cell_group, type)`
#> Error in sccomp_boxplot(s_factor, "type"): object '.x' not found
rlang::last_trace()
#> Error: Can't show last error because no error was recorded yet

^{Created on 2024-02-29 with reprex v2.1.0}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.2 (2023-10-31 ucrt)
#>  os       Windows 10 x64 (build 19045)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Dutch_Belgium.utf8
#>  ctype    Dutch_Belgium.utf8
#>  tz       Europe/Brussels
#>  date     2024-02-29
#>  pandoc   3.1.12.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  ! package              * version   date (UTC) lib source
#>    abind                  1.4-5     2016-07-21 [1] CRAN (R 4.3.0)
#>    Biobase                2.62.0    2023-10-24 [1] Bioconductor
#>    BiocGenerics           0.48.1    2023-11-01 [1] Bioconductor
#>    bitops                 1.0-7     2021-04-24 [1] CRAN (R 4.3.0)
#>    boot                   1.3-30    2024-02-26 [2] CRAN (R 4.3.2)
#>    cli                    3.6.2     2023-12-11 [1] CRAN (R 4.3.2)
#>    codetools              0.2-19    2023-02-01 [2] CRAN (R 4.3.2)
#>    colorspace             2.1-0     2023-01-23 [1] CRAN (R 4.3.0)
#>    crayon                 1.5.2     2022-09-29 [1] CRAN (R 4.3.0)
#>    curl                   5.2.0     2023-12-08 [1] CRAN (R 4.3.2)
#>    DelayedArray           0.28.0    2023-10-24 [1] Bioconductor
#>    digest                 0.6.34    2024-01-11 [1] CRAN (R 4.3.2)
#>    dotCall64              1.1-1     2023-11-28 [1] CRAN (R 4.3.2)
#>    dplyr                  1.1.4     2023-11-17 [1] CRAN (R 4.3.2)
#>    evaluate               0.23      2023-11-01 [1] CRAN (R 4.3.2)
#>    fansi                  1.0.6     2023-12-08 [1] CRAN (R 4.3.2)
#>    fastmap                1.1.1     2023-02-24 [1] CRAN (R 4.3.0)
#>    forcats                1.0.0     2023-01-29 [1] CRAN (R 4.3.0)
#>    fs                     1.6.3     2023-07-20 [1] CRAN (R 4.3.1)
#>    future                 1.33.1    2023-12-22 [1] CRAN (R 4.3.2)
#>    future.apply           1.11.1    2023-12-21 [1] CRAN (R 4.3.2)
#>    generics               0.1.3     2022-07-05 [1] CRAN (R 4.3.0)
#>    GenomeInfoDb           1.38.6    2024-02-08 [1] Bioconductor 3.18 (R 4.3.2)
#>    GenomeInfoDbData       1.2.10    2023-10-05 [1] Bioconductor
#>    GenomicRanges          1.54.1    2023-10-29 [1] Bioconductor
#>    ggplot2                3.5.0     2024-02-23 [1] CRAN (R 4.3.2)
#>    ggrepel                0.9.5     2024-01-10 [1] CRAN (R 4.3.2)
#>    globals                0.16.2    2022-11-21 [1] CRAN (R 4.3.0)
#>    glue                   1.7.0     2024-01-09 [1] CRAN (R 4.3.2)
#>    gridExtra              2.3       2017-09-09 [1] CRAN (R 4.3.0)
#>    gtable                 0.3.4     2023-08-21 [1] CRAN (R 4.3.1)
#>    hms                    1.1.3     2023-03-21 [1] CRAN (R 4.3.0)
#>    htmltools              0.5.7     2023-11-03 [1] CRAN (R 4.3.2)
#>    inline                 0.3.19    2021-05-31 [1] CRAN (R 4.3.0)
#>    IRanges                2.36.0    2023-10-24 [1] Bioconductor
#>    jsonlite               1.8.8     2023-12-04 [1] CRAN (R 4.3.2)
#>    knitr                  1.45      2023-10-30 [1] CRAN (R 4.3.2)
#>    lattice                0.22-5    2023-10-24 [1] CRAN (R 4.3.2)
#>    lifecycle              1.0.4     2023-11-07 [1] CRAN (R 4.3.2)
#>    listenv                0.9.1     2024-01-29 [1] CRAN (R 4.3.2)
#>    loo                    2.7.0     2024-02-24 [1] CRAN (R 4.3.2)
#>    magrittr               2.0.3     2022-03-30 [1] CRAN (R 4.3.0)
#>    Matrix                 1.6-5     2024-01-11 [1] CRAN (R 4.3.2)
#>    MatrixGenerics         1.14.0    2023-10-24 [1] Bioconductor
#>    matrixStats            1.2.0     2023-12-11 [1] CRAN (R 4.3.2)
#>    munsell                0.5.0     2018-06-12 [1] CRAN (R 4.3.0)
#>    parallelly             1.37.0    2024-02-14 [1] CRAN (R 4.3.2)
#>    patchwork              1.2.0     2024-01-08 [1] CRAN (R 4.3.2)
#>    pillar                 1.9.0     2023-03-22 [1] CRAN (R 4.3.0)
#>    pkgbuild               1.4.3     2023-12-10 [1] CRAN (R 4.3.2)
#>    pkgconfig              2.0.3     2019-09-22 [1] CRAN (R 4.3.0)
#>    progressr              0.14.0    2023-08-10 [1] CRAN (R 4.3.1)
#>    purrr                  1.0.2     2023-08-10 [1] CRAN (R 4.3.1)
#>    QuickJSR               1.1.3     2024-01-31 [1] CRAN (R 4.3.2)
#>    R.cache                0.16.0    2022-07-21 [1] CRAN (R 4.3.0)
#>    R.methodsS3            1.8.2     2022-06-13 [1] CRAN (R 4.3.0)
#>    R.oo                   1.26.0    2024-01-24 [1] CRAN (R 4.3.2)
#>    R.utils                2.12.3    2023-11-18 [1] CRAN (R 4.3.2)
#>    R6                     2.5.1     2021-08-19 [1] CRAN (R 4.3.0)
#>    Rcpp                   1.0.12    2024-01-09 [1] CRAN (R 4.3.2)
#>  D RcppParallel           5.1.7     2023-02-27 [1] CRAN (R 4.3.0)
#>    RCurl                  1.98-1.14 2024-01-09 [1] CRAN (R 4.3.2)
#>    readr                  2.1.5     2024-01-10 [1] CRAN (R 4.3.2)
#>    reprex                 2.1.0     2024-01-11 [1] CRAN (R 4.3.2)
#>    rlang                  1.1.3     2024-01-10 [1] CRAN (R 4.3.2)
#>    rmarkdown              2.25      2023-09-18 [1] CRAN (R 4.3.1)
#>    rstan                  2.32.5    2024-01-10 [1] CRAN (R 4.3.2)
#>    rstantools             2.4.0     2024-01-31 [1] CRAN (R 4.3.2)
#>    rstudioapi             0.15.0    2023-07-07 [1] CRAN (R 4.3.1)
#>    S4Arrays               1.2.0     2023-10-24 [1] Bioconductor
#>    S4Vectors              0.40.2    2023-11-23 [1] Bioconductor
#>    scales                 1.3.0     2023-11-28 [1] CRAN (R 4.3.2)
#>    sccomp               * 1.7.5     2024-02-28 [1] Github (stemangiola/sccomp@e1b4de2)
#>    sessioninfo            1.2.2     2021-12-06 [1] CRAN (R 4.3.0)
#>    SeuratObject           5.0.1     2023-11-17 [1] CRAN (R 4.3.2)
#>    SingleCellExperiment   1.24.0    2023-10-24 [1] Bioconductor
#>    sp                     2.1-3     2024-01-30 [1] CRAN (R 4.3.2)
#>    spam                   2.10-0    2023-10-23 [1] CRAN (R 4.3.2)
#>    SparseArray            1.2.4     2024-02-11 [1] Bioconductor 3.18 (R 4.3.2)
#>    StanHeaders            2.32.5    2024-01-10 [1] CRAN (R 4.3.2)
#>    stringi                1.8.3     2023-12-11 [1] CRAN (R 4.3.2)
#>    stringr                1.5.1     2023-11-14 [1] CRAN (R 4.3.2)
#>    styler                 1.10.2    2023-08-29 [1] CRAN (R 4.3.1)
#>    SummarizedExperiment   1.32.0    2023-10-24 [1] Bioconductor
#>    tibble                 3.2.1     2023-03-20 [1] CRAN (R 4.3.0)
#>    tidyr                  1.3.1     2024-01-24 [1] CRAN (R 4.3.2)
#>    tidyselect             1.2.0     2022-10-10 [1] CRAN (R 4.3.0)
#>    tzdb                   0.4.0     2023-05-12 [1] CRAN (R 4.3.1)
#>    utf8                   1.2.4     2023-10-22 [1] CRAN (R 4.3.2)
#>    V8                     4.4.2     2024-02-15 [1] CRAN (R 4.3.2)
#>    vctrs                  0.6.5     2023-12-01 [1] CRAN (R 4.3.2)
#>    withr                  3.0.0     2024-01-16 [1] CRAN (R 4.3.2)
#>    xfun                   0.41      2023-11-01 [1] CRAN (R 4.3.2)
#>    XVector                0.42.0    2023-10-24 [1] Bioconductor
#>    yaml                   2.3.8     2023-12-11 [1] CRAN (R 4.3.2)
#>    zlibbioc               1.48.0    2023-10-24 [1] Bioconductor
#> 
#>  [1] C:/R/library
#>  [2] C:/R/R-4.3.2/library
#> 
#>  D ── DLL MD5 mismatch, broken installation.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

The reprex for some reason doesn't show trace, but here is a trace from another example where I encounter this:

<error/rlang_error>
Error in `distinct()`:
! Must use existing variables.
✖ `Landgebuik_MBAG` not found in `.data`.
---
Backtrace:
     ▆
  1. ├─sccomp::sccomp_boxplot(sctest, "Landgebuik_MBAG")
  2. │ └─sccomp:::plot_boxplot(...) at sccomp/R/methods.R:1377:3
  3. │   ├─dplyr::left_join(...) at sccomp/R/utilities.R:1709:5
  4. │   ├─dplyr:::left_join.data.frame(...)
  5. │   │ └─dplyr::auto_copy(x, y, copy = copy)
  6. │   │   ├─dplyr::same_src(x, y)
  7. │   │   └─dplyr:::same_src.data.frame(x, y)
  8. │   │     └─base::is.data.frame(y)
  9. │   └─data_proportion %>% ... at sccomp/R/utilities.R:1709:5
 10. ├─dplyr::distinct(...)
 11. └─dplyr:::distinct.data.frame(., !!as.symbol(factor_of_interest), !!.sample, !!.cell_group)
 12.   └─dplyr::distinct_prepare(...)
 13.     └─rlang::abort(bullets, call = error_call)

When debugging this, I notice this happens in the following line where you want to pull a column by name from a data.frame which does not contain that column (but it does contain a column named "factor"):

Error in pull(significance_colors, !!as.symbol(factor_of_interest)) : 
Caused by error:
! object 'Landgebruik_MBAG' not found

Test significant changes

Optimise dataframes

I think it would be beneficial to optimise the internal code to use data.table.

The package https://github.com/tidyverse/dtplyr promises to use dplyr grammar but with efficient data.table

Formula for repeated measures?

Hi Stefano,

Is it possible to use this package for repeated measures data?

Regards,
Mikhael

Warning message when duplicated samples

Hi, using scCOMP for the first time; the issue I'm running into is that for some reason cell_group and sample calling are not working appropriately from a Seurat object when running scCOMP.

Here is the code (contrasts have been pre-defined):

results1 <- seurat |> sccomp_glm( formula_composition = ~ 0 + disease_inflammation_response, contrasts = contrasts, .sample = sample, .cell_group = cell_type, bimodal_mean_variability_association = TRUE )

Here is the error message:

sccomp says: outlier identification first pass - step 1/3
Joining with by = join_by(sample)Joining with by = join_by(cell_group)error occurred during calling the sampler; sampling not done
error occurred during calling the sampler; sampling not done
error occurred during calling the sampler; sampling not done
error occurred during calling the sampler; sampling not done
error occurred during calling the sampler; sampling not done
error occurred during calling the sampler; sampling not done
here are whatever error messages were returned
[[1]]
Stan model 'glm_multi_beta_binomial' does not contain samples.

[[2]]
Stan model 'glm_multi_beta_binomial' does not contain samples.

[[3]]
Stan model 'glm_multi_beta_binomial' does not contain samples.

[[4]]
Stan model 'glm_multi_beta_binomial' does not contain samples.

[[5]]
Stan model 'glm_multi_beta_binomial' does not contain samples.

[[6]]
Stan model 'glm_multi_beta_binomial' does not contain samples.

Error in draws[, draws_colnames, drop = FALSE] :
no 'dimnames' attribute for array

Any help would be great.
Thanks!

Track benchmarks

The first big benchmark f58cd58

Multilevel model error

Hi,

Thanks for creating this tool. I'm trying to run a multilevel model, with the sample argument as animal_id. I receive the following bug. I also tried typecasting the variable as a factor but it still gives a similar error:

question about lumping or selecting compositional data

This is only a question, not a bug report.

Suppose I have many categories (components of the compositional data - the "cell group" identifier in the terminology of the package), but I am particularly interested in only a subset of them. What would you advice as the best way to deal with this in sccomp?

I'm thinking of the following options:

Lump together the not-so-interesting categories by summing the counts and use the lumped category (say "other") together with the categories of interest
Remove the not-so-interesting categories and only focus on the categories of interest
... other options?

Intuitively I would go for 1, because in that case predicting compositional proportions for the categories of interest from the model would reflect their relative share in the original composition. Whereas in case 2 the categories of interest will sum to 100% for every sample.

Best strategy for differential composition estimation in integrated Seurat object

I am working with an integrated seurat object that includes multiple datasets across multiple pathologies (6 pathological settings and one healthy control group.
I want to evaluate both the differential composition and variability of cell-types in each pathology vs controls (with contrast), in a way that accounts for the inter-subject and inter-dataset variability.
What would be the best code to do so?

Best regards and thank you for the very usefull tool.

Design and create a numerical generating data process

Design a flow diagram of the process
implement the process as an R function

Document Daniel's code

Hello @CastielZhao, in this issue you can describe your commits and ask questions about documenting @jonjamesjr code. Let's see if this issue format works.

Rather than using false_positive_percentage to set your stringency, we use a new column that I created that is called false_positive_rate.

We will test false positive rate of 0.01, 0.02, .... and plot real false positive rate vs. estimated.

For this you have to merge the changes present in dev into your branch, and install again.

For the sake of exercise you setup a fix false_positive_percentage to 1%

question: scope of the statistical model

The paper that describes the method mentions single-cell transcriptomics, CyTOF, and microbiome sequencing as applications. I have read the paper and description of the statistical model and I am applying the model to environmental DNA (eDNA) data from, e.g., soil samples (e.g. taxonomic group Annelidae). IMO the statistical model can also be applied to such data, but maybe I overlooked something and would like to ask if you can confirm that the statistical model is suitable to analyse eDNA data. Data from eDNA is identical in nature to data from microbiome studies that use 16S rRNA amplicon sequencing.

dirichlet_multinomial noise model doesn't seem to function

Hi, when I tried to change the noise model from the default beta-binomial to Dirichlet multinomial, it didn't seem to work. I wonder what could the solution be as I'm trying to benchmark sccomp against other methods.

res <- seurat_obj |>
sccomp_glm(
formula_composition = ~ type,
formula_variability = ~ type, #if you want to have variability analysis
.sample = sample,
.cell_group = cell_group,
#.count = count, only add this line for count object
bimodal_mean_variability_association = T,
noise_model = "dirichlet_multinomial",
#cores = 1
)

sccomp_glm with numerical covariates

Hi I have been testing sccomp_glm and it looks really awsome so far.
I learnt from sccomp preprint that "Sccomp is compatible with complex experimental designs, including discrete and continuous covariates." The discrete case works well, and I was follwing the scomp README on github. But I am wondering how to test the association between cell type composition and a numerical (continuous) covariate? I tried something like below, but it does not work. Would you mind sharing your ideas?

counts_obj |>
sccomp_glm(
formula_composition = ~ my_continuous_score,
.sample = sample,
.cell_group = cell_group,
.count = count,
bimodal_mean_variability_association = TRUE,
cores = 1
)

Relationship between CI, minimal effect threshold, and FDR on plots$credible_intervals_1D

Hi,

I was trying to understand the relationship between CI, minimal effect threshold, and FDR on plots$credible_intervals_1D. One example is given in the Readme:

For c_effects (on the left), "The dashed lines represent the minimal effect that the hypothesis test is based on. An effect is labelled as significant if bigger than the minimal effect according to the 95% credible interval", as you mentioned in Readme.

My understanding is that minimal effect = 0.2 by default. In general, a celltype whose CI is far away from 0 is considered as significant. That means, if a cell type has c_effect < 0, it will be signficant (c_FDR<0.05) if its c_upper < -0.2 ; if a cell type has c_effect > 0, it will be signficant (c_FDR<0.05) if its c_lower > 0.2.

However the plots$credible_intervals_1D shows differently than my understanding. i.e. I imagine NK cells in c typehealthy should be non-signficant, because its c_lower is less than 0.2 (as indicated by grey vertical line). But on the plots$credible_intervals_1D it turns out to be signficant.

Can you help me to clarify how should I understand the relationship between CI, minimal effect threshold, and FDR in general?

type parameter in formula composition

Hello!

Should the significant differences for composition and/or group-specific variability change if the order of the variables in the parameter type change? If I change the order in which they appear, e.g. 1. normal, 2. disease to 1. disease, 2. normal, the results change.

Thanks,
Marta

Interaction terms in sccomp_glm

I have been using sccomp in my daily research and you guys have always been very helpful to answer my technical questions. :-) thank you! I just got a new quesion regarding specifying interaction terms in formula. To be specific, I would like to add interaction terms to my formula like below:

sccomp_glm(
formula_composition = ~ gender:treatment
....
),

and contrast like this:
"male_untreated - male_treated"

I wonder would this be possible?

Fix an error with sccomp_glm function

With regard to the input data (frames), the sample, cell_group, as well as type needs to be in format of factor.

Use the numerical generative process to calibrate the model

Does the false positive rate we claim (e.g. 0.05) correspond to 5% of false positives given our no-association, no-outlier simulated data?

Calibration:

inference of associations. (read https://doi.org/10.1093/nargab/lqab005)
inference of outliers

FDR parameter

Hi Stefano,

Is there a way to modify the FDR parameter, or is FDR = 0.025 a fixed value? I cannot find it in the sccomp_glm function.

Thank you!
Marta

throw error if missing values due to pivoting long

Thanks for developing this package. Very interesting modelling technique which I am now trying it out on microbiome data.

The .data argument seems to implicitly require that every possible combination of .cell_group and .sample is included.
However, if the data contain counts equal to zero, they are often left out for this long data format. If they are left out, the user will receive an error saying that the Y data contain NA. These NA come from internally pivoting from long to wide format (sample x cell_group) without passing a default "fill" value of 0. If this is indeed what I think is happening, I suggest that internally the 0's are filled in for missing combinations of .cell_group and .sample.

Question about "test_composition_above_logit_fold_change" parameter in sccomp_glm

Hi sccomp team,

I got a few questions about the "test_composition_above_logit_fold_change" parameter in sccomp_glm() and I am wondering if I can get some inputs from you:

I believe this parameter represent the "fold-change threshold (0.2 by default)" as described in Method section of the PNAS paper. I think the "fold-change" referred here is in fact the foldchange of logit-transformed cell type porportion, but not the origional cell tyoe porportion value, which are bounded to [0, 1]?
I hope to identify differential abudnant cell types between two groups of samples, by looking for the cells that are 1) statisitcally significant, and 2) has a porportion fold-change bigger than 2. My understanding is that test_composition_above_logit_fold_change does not represent foldchnage of cell type porportion (which is I want), but logit-transformed porportion instead. So my ad hoc solution is first set test_composition_above_logit_fold_change = 0, then manually calculare fold change between two groups of porportion of each cell type, and select the cell types that have c_FDR < 0.05 and foldchange >2. Does this make sense? Or I should use a differnet approach?

Interpretation when both bars are highlighted versus one

Hi all,

Thank you for the great package. I'm unclear of one thing, which is how to interpret data when one bar is colored as significant vs when both are in the boxplot.

Example:

use of inv_logit vs softmax

Hi,
This is more a question than an issue, I am trying to use your code from dev/stan_models/glm_beta_binomial.stan and on line 94 I don't understand why you use the inv_logit function and not the softmax function ? As I understand from the paper, on the equation [6], this is indeed the softmax function that you want to use? I'm a bit confused.
Thanks!

Summary of meetings

Hello @CastielZhao and @heejungshim,

here Castiel will add from today the summary of our meetings detailing what has been discussed, what and how to do what we have discussed in the meetings.

Castiel, this is how to open an issue regarding the code and tagging people to get an answer

Hello @CastielZhao, this is an example of how you can tag people (e.g. @stemangiola @jonjamesjr) and ask questions on the code or commits.

@jonjamesjr Castiel is documenting each function argument and each line of code to get a deep understanding of what is doing what.

The file is this https://github.com/stemangiola/sccomp/blob/dirichlet-negative-multinomial/dev/daniel_code.R

Warings from sccomp_remove_outliers() when sample size is large

I am working on a dataset with two categories, disease and healthy.

There are 46 samples in disease groups. I noticed that when I run sccomp_remove_outliers(), I always get a warning, saying "sccomp says: the number of draws used to defined quantiles of the posterior distribution is capped to 20K. This means that for very low probability threshold the quantile could become unreliable. We suggest to limit the probability threshold between 0.1 and 0.01"

I tracked down this warnings, and it seems it is because of this part in fit_model() function:

    output_samples =
      (draws_supporting_quantile/((1-quantile)/2)) %>% # /2 because I have two tails
      max(4000) 
    
    if(output_samples > max_sampling_iterations) {
      warning("sccomp says: the number of draws used to defined quantiles of the posterior distribution is capped to 20K. This means that for very low probability threshold the quantile could become unreliable. We suggest to limit the probability threshold between 0.1 and 0.01")
      output_samples = max_sampling_iterations

Since I have 46 samples, my output_samples = 46000, which is larger than 20k. I think this is why I get this warning.

I assume this means that outlier removal may not working prroperly when sample size is large? Is this a problem for the analysis? If so, is there any fix to this warnings (e.g. increase the 20k cap to 50k? We have dataset with sample size around 100, so I am not sure if 50k is sufficient for even bigger data in future)

stemangiola / sccomp Goto Github PK

sccomp's Introduction

sccomp - Outlier-aware and count-based compositional analysis of single-cell data

Characteristics

Installation

Analysis

Binary factor

From Seurat, SingleCellExperiment, metadata objects

From counts

Summary plots

Contrasts

Categorical factor (e.g. Bayesian ANOVA)

Differential variability, binary factor

Suggested settings

For single-cell RNA sequencing

For CyTOF and microbiome data

Visualisation of the MCMC chains from the posterior distribution

The old framework

sccomp's People

Contributors

Stargazers

Watchers

Forkers

sccomp's Issues

Recommend Projects

Recommend Topics

Recommend Org