const-ae / glmgampoi Goto Github PK

View Code? Open in Web Editor NEW

102.0 6.0 14.0 2.47 MB

Fit Gamma-Poisson Generalized Linear Models Reliably

R 82.55% C++ 17.45%

r-package glm gamma-poisson negative-binomial-regression regression on-disk

glmgampoi's People

Contributors

Stargazers

Watchers

Forkers

jeffreypullin aristoteleo hjames1 ctzouanas osmansharifi meereeum rgiordan junqiangwang noobcoding lzygenomics nlubock jackkamm allenlile sukses24

glmgampoi's Issues

Too many DEGs?

Hi,

I am a fresh user of glmGamPoi and thanks for providing this awesome tool! I tried to identify DEGs in scRNA and successfully run the following code:

load('/Volumes/kfsd/GSE200903_new_combined_cluster_FibroSamples_2022-01-14.RData')
combine_cluster <- as.SingleCellExperiment(combined_cluster)

set.seed(1)
colData(combine_cluster)
# Take highly expressed genes and proper cells:
combine_cluster_subset <- combine_cluster[rowSums(counts(combine_cluster)) > 100, 
                                          which( ! is.na(combine_cluster$CellType) & combine_cluster$CellType 
                                                 %in% c("fibroblasts"))]

# add a stim2 based on stim
colData(combine_cluster_subset)$stim2 <- colsplit(colData(combine_cluster_subset)$stim, 
                                                  split = "_", names = c('cell', 'trt','time'))$trt
# Convert counts to dense matrix
counts(combine_cluster_subset) <- as.matrix(counts(combine_cluster_subset))
# Remove empty levels because glm_gp() will complain otherwise
colData(combine_cluster_subset)$stim2 <- as.factor(colData(combine_cluster_subset)$stim2)
colData(combine_cluster_subset)$stim2 <- relevel(colData(combine_cluster_subset)$stim2, ref = "normal")

fit <- glm_gp(combine_cluster_subset, design = ~ stim2 ,
              reference_level = "normal")
summary(fit)
colnames(fit$Beta)

# The contrast argument specifies what we want to compare
# We test the expression difference of stimulated and control T-cells
#
# There is no sample label in the colData, so we create it on the fly
# from `stim` and `ind` columns in colData(fit$data).
de_res <- test_de(fit, contrast = `stim2PDAC`, 
                  pseudobulk_by = paste0(stim2, "-", rownames(colData(combine_cluster_subset)))) 


# The large `lfc` values come from groups were nearly all counts are 0
# Setting them to Inf makes the plots look nicer
de_res$lfc <- ifelse(abs(de_res$lfc) > 20, sign(de_res$lfc) * Inf, de_res$lfc)

# Most different genes
de_res<-de_res[order(de_res$pval), ]

However, I identified 11752 DEGs out of 14537 genes and I found many genes has df1=1.

head(de_res)
             name pval adj_pval f_statistic df1      df2        lfc
35           Rpl7    0        0    3244.666   1 20005.31 -0.5195890
44           Pi15    0        0    2180.603   1 20005.31 -1.6835958
124         Il1r1    0        0    3883.918   1 20005.31  2.2249821
132          Fhl2    0        0    5068.944   1 20005.31  4.8176423
158           Gls    0        0    1635.089   1 20005.31  0.9511179
164 1700019D03Rik    0        0    3380.809   1 20005.31 -2.7193859

I wondered how could I address this problem? Thanks in advance!

Best,
Kun

Error in handle_design_parameter(design, data, col_data, reference_level) : Too few replicates / too many coefficients to fit model.

Hi Constantin,

Thank you very much for glmGamPoi! I've been trying to get it work on a dataset I have, where I have 9 clusters for 2 genotypes, 3 replicate for each genotype. I have tried to run:

fit <- glm_gp( bm.mat_subset,
               design = ~ genotype + seurat_clusters + genotype:seurat_clusters - 1,
               col_data = metadat,
               subsample = TRUE,
               on_disk = FALSE,
               reference_level = "WT" )

c0 <- test_de(fit,  
                  contrast = genotypeKO - genotypeWT,
                  pseudobulk_by = mouse , 
         sort_by = pval, decreasing = FALSE)

Error in handle_design_parameter(design, data, col_data, reference_level) : 
  The model_matrix has more columns (18) than the there are samples in the data matrix (6 columns).
Too few replicates / too many coefficients to fit model.
The head of the design matrix: 
     genotypeWT genotypeKO seurat_clusters1 seurat_clusters2 seurat_clusters3 seurat_clusters4 seurat_clusters5 seurat_clusters6 seurat_clusters7 seurat_clusters8

I've looked at your example and I get the same error if I don't pre-filter the data to few clusters (NK cells, B cells and T cells), as it's done in the example. The resulting fit has 16 coeficients and the data has 16 samples ( ind + stim) so it produces the same error too few replicates/too many coeficients to fit the model. Is this a bug of do we always have to prefilter the data so the number of coeficients is less than the number of samples you "pseudobulk_by"?

would you say that doing the following would be a good way to overcome the issues above?

de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, 
                  pseudobulk_by = paste(stim,  ind,  cell, sep="_" ))

Thank you very much for your help

Miriam


sce_subset <- sce[rowSums(counts(sce)) > 100, 
                  sample(which(! is.na(sce$cell)), 1000)]
 counts(sce_subset) <- as.matrix(counts(sce_subset))
 sce_subset$cell <- droplevels(sce_subset$cell)
fit <- glm_gp(sce_subset, design = ~ cell + stim +  stim:cell - 1,
              reference_level = "NK cells")
fit
glmGamPoiFit object:
The data had 9727 rows and 1000 columns.
A model with 16 coefficient was fitted.
> de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, 
                  pseudobulk_by = paste0(stim, "-", ind)) 

Error in handle_design_parameter(design, data, col_data, reference_level) : 
  The model_matrix has more columns (16) than the there are samples in the data matrix (16 columns).
Too few replicates / too many coefficients to fit model.
The head of the design matrix: 
          cellNK cells cellB cells cellCD14+ Monocytes cellCD4 T cells cellCD8 T cells cellDendritic cells cellFCGR3A+ Monocytes cellMegakaryocytes stimstim cellB cells:stimstim cellCD14+ Monocytes:stimstim cellCD4 T cells:stimstim cellCD8 T cells:stimstim cellDendritic cells:stimstim cellFCGR3A+ Monocytes:stimstim cellMegakaryocytes:stimstim
 ctrl-101   0.05714286  0.11428571           0.4000000       0.2285714      0.14285714          0.00000000            0.02857143         0.02857143        0                    0                            0                        0                        0                            0                              0                           0
ctrl-1015   0.04761905  0.20000000           0.2761905       0.3333333      0.03809524          0.01904762

Research paper

Hi.
Could you help me with some references from books or research papers that glmGamPoi is built upon?
Thanks

Cannot handle data of class dgTMatrix.

I've constructed an SCE object from the components of an AnnData object (imported using the anndata R package)---

library(anndata); library(SingleCellExperiment); library(glmGamPoi);

adata <- anndata::read_h5ad(here::here("raw_data/scRNAseq/merged_mouse_fly.anndata.h5ad"))

sce <- SingleCellExperiment(
    assays      = list(raw = Matrix::t(adata$X)),
    colData     = adata$obs,
    rowData     = adata$var
    # reducedDims = list(umap = adata$obsm["X_umap"])
)
print(sce)

class: SingleCellExperiment 
dim: 16474 510193 
metadata(0):
assays(1): raw
rownames(16474): A1BG A1CF ... ZZEF1 ZZZ3
rowData names(0):
colnames(510193): 10X82_2_TCTCTCACCAGTTA- 10X82_2_TATTATCTACCAGA- ...
  TTTGTCACATGGATGG-w1118_30d_r1 TTTGTCATCCGAAGAG-w1118_30d_r1
colData names(180): Age AnalysisPool ... annotation_supp study
reducedDimNames(0):
altExpNames(0):

--- but when I input the SCE into glmGamPoi, I get the following message:

fit <- glm_gp(sce, on_disk = FALSE)

Error in handle_data_parameter(SummarizedExperiment::assay(data), on_disk) : 
  Cannot handle data of class dgTMatrix.It must be of type matrix, DelayedArray, SummarizedExperiment, or something that can be cast to a SummarizedExperiment

I thought at first this might simply be because SingleCellExperiment and/or SummarizedExperiment don't support sparse matrices, but that doesn't seem to be the case:

I dug around a bit and found out thatglm_gp does recognize "dgCMatrix". So after converting it, I get the a different error message, but one with a proposed solution.

sce@assays@data$raw <- as(sce@assays@data$raw, "dgCMatrix")
fit <- glm_gp(sce, on_disk = FALSE)

Error in handle_data_parameter(SummarizedExperiment::assay(data), on_disk) : 
  glmGamPoi does not yet support sparse input data of type 'dgCMatrix'. If your data fits into memory, please cast it to a dense matrix with 'as.matrix(data)' or if it is too big, convert it to an on-disk dataset with the 'HDF5Array

Anyway, just posting here in case others come across a similar issue.

Best,
Brian
@ Imperial College London

Session info

Click to expand

``` sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS/LAPACK: /home/bschilder/anaconda3/envs/Cell_BLAST/lib/libopenblasp-r0.3.10.so

locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] TENxPBMCData_1.8.0 HDF5Array_1.18.0
[3] rhdf5_2.34.0 SingleCellExperiment_1.12.0
[5] DelayedMatrixStats_1.12.1 DelayedArray_0.16.0
[7] Matrix_1.3-0 SummarizedExperiment_1.20.0
[9] Biobase_2.50.0 GenomicRanges_1.42.0
[11] GenomeInfoDb_1.26.2 IRanges_2.24.1
[13] S4Vectors_0.28.1 BiocGenerics_0.36.0
[15] MatrixGenerics_1.2.0 matrixStats_0.57.0
[17] glmGamPoi_1.2.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 here_1.0.1
[3] lattice_0.20-41 rprojroot_2.0.2
[5] assertthat_0.2.1 digest_0.6.27
[7] mime_0.9 BiocFileCache_1.14.0
[9] R6_2.5.0 RSQLite_2.2.1
[11] httr_1.4.2 pillar_1.4.7
[13] sparseMatrixStats_1.2.0 zlibbioc_1.36.0
[15] rlang_0.4.9 curl_4.3
[17] blob_1.2.1 reticulate_1.18
[19] splines_4.0.3 AnnotationHub_2.22.0
[21] RCurl_1.98-1.2 bit_4.0.4
[23] shiny_1.5.0 compiler_4.0.3
[25] httpuv_1.5.4 pkgconfig_2.0.3
[27] htmltools_0.5.0 tidyselect_1.1.0
[29] tibble_3.0.4 GenomeInfoDbData_1.2.4
[31] interactiveDisplayBase_1.28.0 withr_2.3.0
[33] crayon_1.3.4 dplyr_1.0.2
[35] dbplyr_2.0.0 later_1.1.0.1
[37] rhdf5filters_1.2.0 bitops_1.0-6
[39] rappdirs_0.3.1 grid_4.0.3
[41] jsonlite_1.7.2 xtable_1.8-4
[43] lifecycle_0.2.0 DBI_1.1.0
[45] magrittr_2.0.1 anndata_0.7.5
[47] XVector_0.30.0 promises_1.1.1
[49] ellipsis_0.3.1 generics_0.1.0
[51] vctrs_0.3.6 Rhdf5lib_1.12.0
[53] tools_4.0.3 bit64_4.0.5
[55] glue_1.4.2 purrr_0.3.4
[57] BiocVersion_3.12.0 fastmap_1.0.1
[59] yaml_2.2.1 AnnotationDbi_1.52.0
[61] BiocManager_1.30.10 ExperimentHub_1.16.0
[63] memoise_1.1.0

</details>

compilation error

Hi Thanks for your great work!

I was trying to install LEMUR on R, however I have encountered this error.

ld: warning: directory not found for option '-L/usr/local/gfortran/lib/gcc/x86_64-apple-darwin18/8.2.0'
ld: warning: directory not found for option '-L/usr/local/gfortran/lib'
ld: library not found for -lgfortran
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [glmGamPoi.so] Error 1
ERROR: compilation failed for package ‘glmGamPoi’
* removing ‘/Library/Frameworks/R.framework/Versions/4.1/Resources/library/glmGamPoi’
Warning: installation of package ‘/var/folders/jp/zptlfyzn5z3dz53d2c0hcft40000gn/T//RtmpDqImAo/file5611ee0480b/glmGamPoi_1.11.8.tar.gz’ had non-zero exit status

I have tried to re-install the x-code that you mentioned in another same issue. However, I still give this error. I am using MacBook with M1 chip, will this be a problem?

Thanks in advance!

Kind regards,
Jiayi

Design Matrix

I am just a bit confused about how design should be contracted for gym_pg.
it will be great if the more detail examples of constructing more complex design matrix involving two or more groups

Thanks.

Bridging compatibility with Seurat objects

Hi! Many thanks again for developing such an incredibly useful tool.
As most of my analysis is carried out via Seurat, I was wondering if there may be any efficient way to converting the counts dgCMatrix to something that glmGamPoi could interpet? I've attempted to convert my object to a SingleCellExperiment, then manually convert the counts to an HDF5 array or DelayedArray. Unfortunately both attempts have been unsuccessful.
Might you have any suggestions for a workaround for a large dataset? (Roughly 100k cells).

test_glm <- as.SingleCellExperiment(test_glm )
test_glm_mat<-test_glm @assays@data@listData$counts
test_glm_mat<- DelayedArray(test_glm_mat) #Try delayedarray
test_glm_mat<-as.matrix(test_glm_mat) #File was too large
writeTENxMatrix(test_glm_mat,filepath = "~/hdf5testglm.h5") #Attempted to convert to HDF5 format and use that
test_glm_mat<-h5read('~/hdf5covidonly.h5')

#Replace the counts with new format
test_glm@assays@data@listData$counts<-test_glm_mat

glmGamPoi scales cubically with the number of design matrix columns

Hi Constantin,

when the number of design matrix columns increases (and thus the number of samples for a fixed number of replicates), glmGamPoi scales cubically. Consider the following code where 3 genes are fitted with 3 replicates per condition and a varying number of conditions:

library("glmGamPoi")
library("DESeq2")
library("tidyverse")

# adapted from DESeq2
make_example_dds <- function(n_genes, n_replicates, n_conditions){

  dispMeanRel <-  function(x) 4/x + 0.1
  
  beta <- matrix(rnorm(n_genes*n_conditions, mean = 4, sd = 2), ncol = n_conditions)
  dispersion <- dispMeanRel(2^(beta[, 1]))
  colData <- DataFrame(condition = factor(rep(paste0("cond", 1:n_conditions), n_replicates)))
  x <- model.matrix.default(~colData$condition)
  mu <- t(2^(x %*% t(beta)))
  
  countData <- matrix(rnbinom(mu, mu = mu, size = 1/dispersion), ncol = ncol(mu))
  mode(countData) <- "integer"

  design <- as.formula("~ condition", env = .GlobalEnv)
  object <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = design)
  object
}

calc_runtime <- function(n_conditions){
  set.seed(1)
  n_replicates <- 3
  dds <- make_example_dds(n_genes = 3, n_replicates = n_replicates, n_conditions = n_conditions)
  time <- system.time(
    fit <- glm_gp(assay(dds), design = design(dds), col_data = colData(dds), size_factors = rep(1, n_replicates*n_conditions))
  )
  as.double(time["elapsed"])
}

n_conditions <- c(10,25,seq(50,600,50))
res_glm_gp <- tibble(
  n_conditions = n_conditions,
  runtime = map_dbl(n_conditions, calc_runtime)
)

ggplot(res_glm_gp, aes(n_conditions, runtime^(1/3))) +
    geom_point() +
    geom_smooth(method="lm", se=F)

Could this be related to the QR decomposition of the design matrix and ultimately not be improved?

Can't install the package on MacOS Monterey

Hi,

I noticed in the error message below it mentioned "big sur" which was the previous MacOS version. Is that why my installation kept failing?

Thanks!
JS

BiocManager::install("glmGamPoi")
'getOption("repos")' replaces Bioconductor standard repositories, see '?repositories' for details

replacement repositories:
CRAN: https://cran.rstudio.com/

trying URL 'https://bioconductor.org/packages/3.15/bioc/src/contrib/glmGamPoi_1.8.0.tar.gz'
Content type 'application/x-gzip' length 914728 bytes (893 KB)

downloaded 893 KB

installing source package ‘glmGamPoi’ ...
** using staged installation
** libs
clang++ -arch arm64 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/RcppArmadillo/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/beachmat/include' -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -arch arm64 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/RcppArmadillo/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/beachmat/include' -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c beta_estimation.cpp -o beta_estimation.o
clang++ -arch arm64 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/RcppArmadillo/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/beachmat/include' -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c deviance.cpp -o deviance.o
clang++ -arch arm64 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/RcppArmadillo/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/beachmat/include' -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c overdispersion.cpp -o overdispersion.o
clang++ -arch arm64 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/RcppArmadillo/include' -I'/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/beachmat/include' -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c utils.cpp -o utils.o
clang++ -arch arm64 -std=gnu++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/opt/R/arm64/lib -o glmGamPoi.so RcppExports.o beta_estimation.o deviance.o overdispersion.o utils.o -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -L/opt/R/arm64/gfortran/lib/gcc/aarch64-apple-darwin20.6.0/12.0.1 -L/opt/R/arm64/gfortran/lib -lgfortran -lemutls_w -lquadmath -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
ld: warning: directory not found for option '-L/opt/R/arm64/gfortran/lib/gcc/aarch64-apple-darwin20.6.0/12.0.1'
ld: warning: directory not found for option '-L/opt/R/arm64/gfortran/lib'
ld: library not found for -lgfortran
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [glmGamPoi.so] Error 1
ERROR: compilation failed for package ‘glmGamPoi’
removing ‘/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/glmGamPoi’

pseudobulk_by apparently dropping columns

Hi, I am successfully using your package with scRNA-Seq data, but finding a possible bug when trying out the pseudobulk_by option. Below I show it working fine without pseudobulk_by and then failing when I add it in:

s2<-as.SingleCellExperiment(s)
design<- model.matrix(~ 0 + experimentGroup,s2@colData)
glmGamPoi.fit <- glm_gp(s2
                       ,design = design
                       ,size_factors = "deconvolution"
                       ,on_disk = FALSE
                        )

This worked fine. My grouping factor and design matrix looks like this:

 unique(s2@colData$experimentGroup)
 [1] LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
 Levels: LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600

 head(design)
                        LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
 homeo_AAACCTGAGAAGCCCA        1        0        0        0        0        0        0
 homeo_AAACCTGAGGAGCGTT        1        0        0        0        0        0        0
 homeo_AAACCTGCAACTGCGC        1        0        0        0        0        0        0
 homeo_AAACCTGCAAGTAGTA        1        0        0        0        0        0        0
 homeo_AAACCTGCATATGAGA        1        0        0        0        0        0        0
 homeo_AAACCTGGTGCAACTT        1        0        0        0        0        0        0

Testing for DGE with the following contrast works fine:

contrast=c('LL_N_000-LL_H_000', 'LL_N_030-LL_H_000', 'LL_N_060-LL_H_000', 'LL_N_180-LL_H_000', 'LL_N_300-LL_H_000','LL_N_600-LL_H_000')
contrast<-limma::makeContrasts(levels=l,contrasts=contrast)

glmGamPoi.test <- test_de(glmGamPoi.fit
                        , contrast = contrast
                         ##,pseudobulk_by='experimentGroup'
                          )

... the problem comes when I try to test using pseudobulk_by, as

glmGamPoi.bulk.test <- test_de(glmGamPoi.fit
                        , contrast = contrast
                         ,pseudobulk_by='experimentGroup'
                          ) %>% setDT

Error in handle_design_parameter(design, data, col_data, reference_level) :
  The model_matrix has more columns (7) than the there are samples in the data matrix (7 columns).
Too few replicates / too many coefficients to fit model.
The head of the design matrix:
         LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
LL_H_000        1        0        0        0        0        0        0
LL_N_000        0        1        0        0        0        0        0
LL_N_030        0        0        1        0        0        0        0
The head of the data:
      LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180
rpl24    21728    14048    13748    19649    14109
cep97       76       18       21        9       55
  eed      270       86       96       88      257

However by my calculations the aggregated data columns should d be there and are somehow erroneously being dropped, , viz::

 library(DelayedArray)
 grs<-colsum(s2@assays@data$counts,s2@colData$experimentGroup)
 head(grs)
         LL_H_000 LL_N_000 LL_N_030 LL_N_060 LL_N_180 LL_N_300 LL_N_600
 rpl24      21728    14048    13748    19649    14109    11293    11562
 cep97         76       18       21        9       55       32       35
 eed          270       86       96       88      257      176      101
 hikeshi      459       98      120      137      458      380      144
 tmem39a      296      134       68       82      132       63       96
 ildr1a       746      471      620      460      435      261      342

I could potentially share the s2 object via ftp if it would help you sleuth this.

Or perhaps you recognize some error on my part?

Core Usage

Hello!

Insanely useful package. Thank you so much for this!

I wanted to ask, when fitting a model with glm_gp on slightly larger datasets, during the final fitting step, all cores on the machine are hogged. Sometimes this can cause problems with other processes. Is there any way to limit core usage by gamPoi? I tried through future, and biocParallel but was unable to limit core usage.

Thanks again,
Nikolas

sorry accidentally made this issue.

Fix documentation of residuals.glmGamPoi

The equation does not render the $\mu$ which makes the documentation hard to read https://bioconductor.org/packages/devel/bioc/manuals/glmGamPoi/man/glmGamPoi.pdf

Return standard error of coefficient estimates

In thelovelab/DESeq2#29 @mschubert raised the point that glmGamPoi, unlike DESeq2 does not return the standard error for the LFC estimates.

His use-case for them is change point detection, where he plugs-in the LFC and the standard error into the ecp package to detect structural variances (e.g. stretches of the genome that are duplicated).

I have not returned the standard error, because they are mainly important for the Wald test, where-as I however, do a likelihood ratio test.

If I want to return the covariance matrix of the coefficients, I would either

need to rewrite the inner fitting routines (fisher_scoring_qr_step()) to make sure they return the covariance matrix or
Calculate the covariance matrix on demand. The formula would probably be something like: t(design_matrix) %*% design_matrix / sqrt(n)

The second option seems the easier way, but I would need to think through the math and properly test the estimator.

Tests fail on i386 (really i686): ailure: qres.gampoi can handle other weird values (@test-residuals.R#112) `a` is not strictly more than `b`. Difference: 0

Hello,

While packaging glmgampoi for Debian we noticed that the tests fail on 32bit x86

BEGIN TEST testthat.R

R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: i686-pc-linux-gnu (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(testthat)
> library(glmGamPoi)
> 
> test_check("glmGamPoi")
── 1. Failure: qres.gampoi can handle other weird values (@test-residuals.R#112)
`a` is not strictly more than `b`. Difference: 0

══ testthat results  ═══════════════════════════════════════════════════════════
Error: testthat unit tests failed
[ OK: 262 | SKIPPED: 3 | WARNINGS: 20 | FAILED: 1 ]
1. Failure: qres.gampoi can handle other weird values (@test-residuals.R#112) 

Execution halted

https://ci.debian.net/data/autopkgtest/testing/i386/r/r-bioc-glmgampoi/9733914/log.gz

Use for pseudobulk differential expression - advantages & logFC values

Hi,

Thank you for your very useful package. I have two questions regarding its use for pseudobulk differential expression analysis.

Firstly, could you outline the reasons why you think your model is better for pseudobulk than alternatives like a manual pseudobulk step and edgeR/DEseq, given that glmGamPoi's main use seems to be for non-pseudobulk?

Secondly, I have noted strange logFC values when performing pseudobulk differential expression analysis on a Alzheimer's Disease split by 6 cell types. The dataset has approx 50 samples, resulting in >50k cells after quality control. The logFC values can be seen in this volcano plot:

The logFC values for all cell types appears to be split into three groups and does not appear as I would expect in a volcano plot. Have you noted logFC values like this before? I have attached the table of this data for just cell type "A" (to keep the size down).
DE_analysis_odd_logFC_values.txt

Return pseudobulk samples values from test_de

Hi,

I was hoping to get the derived pseudobulk samples returned from a call to test_de(), is this available? Specifically, I use test_de to get an estimate of the uncertainty, ensuring cells from the same donor are not independent replicates. The (possible) issue is I am observing very large LFC values (in the thousands) and I want to examine the pseudobulk data generated to see if it appears reasonable.

Thank,
Alan.

Questions about pseudobulk_sce

Thanks for developing the package!

I've got couple questions regarding pseudobulk_sce function. My dataset has 14904 genes and 453154 cells. I want to fit a model with the following design:

design = ~ cell_type_L1 + dataset + cell_type_L1:age_group - 1

So, I create pseudobulk profiles like so:

aggr_sce <- pseudobulk_sce(sce, group_by = vars(sample_name, cell_type_L1, age_group, dataset))
aggr_sce

To clarify, counts for each sample_name and cell_type_L1 combination should be aggregated, but I also want to keep age_group and dataset as a covariates.

My questions are:

Does the variable order inside vars make any difference? Do I specify the group_by param correctly?
Is pseudobulk_sce supposed to take a long time? It takes > 2 hrs to create the pseudobulk profiles.

Thanks!

Upgrading from beachmat to tatami

I think you're still using the old-school beachmat interface, which is fine and will still be around.

But I wonder if you'll consider experimenting with the new tatami interface, which will hopefully struggle its way onto BioC-devel via beachmat (v2.17.3) in the next few days.

This does away with the need for integer/numeric templates, it makes it easier to parallelize at the C++ level, and some delayed operations are captured natively (i.e., some DelayedArrays are translated to their C++ equivalents). Other R matrices are still supported, of course, but are handled more slowly by calling back into DelayedArray::extract_array().

It also greatly simplifies compilation. In the coolest and most experimental part of this update, beachmat will generate external pointers to matrix instances that can be passed to glmGamPoi's shared libraries. This means that glmGamPoi doesn't need to actually need to compile support for the various matrix implementations in its own *.so file, which makes it easier to extend tatami's C++ support to, e.g., HDF5, TileDB, Zarr-backed matrices.

I've already switched SingleR to this latest paradigm, hopefully it'll pass BioC-devel in the next few days 🤞 If it works, I'd like a non-Aaron package developer to test out this new approach, and you're the lucky one.

Score and Hessian (of the likelihood function)

Hi,

Let X (nxp matrix) be the design matrix
I'm wondering if the sum of scores (sum over n, pxp matrix) and sum of hessians (sum over n, pxp matrix) are directly available.
These values seem to be computed internally while fitting the model but does not seem to be present in the glm_gp object.
I know it can be computed using the predict and additional matrix operations but thought the process was redundant since it's already computed while fitting.

Thanks.

proper input for DE

Hello,

I have a single-cell dataset with the structure below. If I would like to see the marker genes of the cluster2, how shall I get reasonable inputs into the following steps?

treatment | cell_type

sample 1 | cluster 2
sample 1 | cluster 1
sample 1 | cluster 1
sample 1 | cluster 3
sample 1 | cluster 2
sample 1 | cluster 3
sample 2 | cluster 3
sample 2 | cluster 3
sample 2 | cluster 2
sample 2 | cluster 1
sample 2 | cluster 2
sample 2 | cluster 1

fit <- glm_gp(pbmcs, design = XXX) #shall I include the treatment, cell_type, or treatment+cell_type into the 'design' paramenter?
res <- test_de(fit, contrast = cluster2- !cluster2, pseudobulk_by = treatment ) # If I have incorporated the 'treatment' in the glm_gp() above, do I still have to add the treatment into the 'pseudobulk_by' parameter?

Thank you.

Error in pseudobulk step

Hi,

I was having issues with my own dataset, so tried running the one given in the vignette and ran into the same problem.

In the vignette, starting in section 3.2 I ran the code from sce <- muscData::Kang18_8vs8() to de_res <- test_de(fit, contrast = stimstim+cellCD4 T cells:stimstim, pseudobulk_by = paste0(stim, "-", ind))

This last command resulted in the following error: Error in value[3L] :
Object 'stim' not found. Allowed variables are:

There are no allowed variables listed after the colon. Is there a way to get it to recognize the colData columns? I am running glmGamPoi version 1.8.0.

Thanks for your help!

overdispersion_shrinkage is not optional

Hi team,

I was running some tests using different configurations of glm_nb and I wanted to quantify what the effect of shrinkage is on my data. It appears that test_de will error out in the lines below when turning off overdispersion shrinkage in glm_nb (overdispersion_shrinkage = F). Is this on propose? I understand its usage is strongly recommended, but this looks like a small bug.

glmGamPoi/R/test_de.R

Lines 212 to 214 in 15bdd09

 if(is.null(fit$overdispersion_shrinkage_list)){ 

 stop("fit$overdispersion_shrinkage_list is NULL. Run 'glm_gp' with ", 

 "'overdispersion_shrinkage = TRUE'.")

Any thoughts or workarounds?
Thanks,
Alex

Regression analysis to produce DEGs

Hi,

I am hoping to use glmGamPoi to identify differentially expressed genes (DEGs) across differing cell types from a linear regression analysis (where Y is continuous rather than discrete). I have come up with the following approach and just want to validate it with you as I couldn't find a linear regression analysis example in the documentation. Note I have been using an approach for this question where Y is discrete (disease vs control cases) which has been working great and just want to verify my approach for linear rather than logistic regression.

First, imagine a dataset with the following columns:

Y - the continuous variable I want to perform the regression analysis on
celltype - A field to identify cell types
patient_ID - Patient identifier, to be used for pseudobulk analysis
sex - Patient sex, we want to account for sex in our design matrix

So we would build the model (fit) as follows, note I leave reference_level as NULL as I would usually set this to Control in a disease control comparison so since this is regression, I assumed this was the correct approach. Can you confirm?:

fit <- glmGamPoi::glm_gp(datset, 
                           design = ~ celltype:sex + celltype:Y -1,
                           reference_level = NULL)

Then I would identify DEGs per cell type as follows (running for each cell type):

de_res <- glmGamPoi::test_de(fit, contrast = celltype_i, 
                                 pseudobulk_by = paste0(patient_ID,"-",celltype),
                                 pval_adjust_method = "BH")

Error in fitBeta

Hello,

I've been trying to use glmGamPoi with SCTransform, but I've gotten an error

warning: solve(): system seems singular; attempting approx solution
Error in fitBeta_fisher_scoring(Y, model_matrix, exp(offset_matrix), dispersions,  :
  BLAS/LAPACK routine 'DLASCL' gave error code -4

I should note the matrix I'm using is pretty big, 142k cells by 30k genes.
Any ideas why this might be happening?

Problem in estimate_betas_fisher_scoring leads to "Error in estimate_overdispersions_fast"

Full error:

Error in estimate_overdispersions_fast(y, mean, model_matrix, do_cox_reid_adjustment,  : 
  Evaluation error: all(!is.na(mean)) is not TRUE.

The issue seems to root in estimate_beta_fisher_scoring, which leads to NA estimates in some cases like this one:
gist with data (2 genes)

do.call(estimate_betas_fisher_scoring, pars)
$Beta
         [,1]       [,2]    [,3]       [,4]      [,5]     [,6]       [,7]       [,8]      [,9]     [,10]       [,11]      [,12]      [,13]      [,14]       [,15]       [,16]      [,17]      [,18]       [,19]      [,20]     [,21]     [,22]      [,23]      [,24]     [,25]
[1,] 16.63781 -0.1256251 0.15645 -0.0754991 0.1646437 0.191263 0.03485183 0.03229181 0.2402343 0.3362655 -0.05429001 -0.1214602 0.04572174 0.08695161 -0.04867007 -0.08189775 -0.3380709 -0.2842984 -0.09034367 -0.3311708 0.1646174 0.1410605 0.07767562 -0.1840425 -0.175129
[2,]       NA         NA      NA         NA        NA       NA         NA         NA        NA        NA          NA         NA         NA         NA          NA          NA         NA         NA          NA         NA        NA        NA         NA         NA        NA
          [,26]      [,27]    [,28]     [,29]        [,30]      [,31]      [,32]      [,33]      [,34]      [,35]       [,36]      [,37]       [,38]       [,39]      [,40]      [,41]       [,42]       [,43]    [,44]      [,45]    [,46]       [,47]     [,48]     [,49]     [,50]
[1,] -0.3301341 -0.2993631 0.136859 0.1833964 -0.001078211 -0.1206512 -0.1775895 -0.2914664 -0.2173742 -0.1357657 -0.07481921 0.03626795 -0.02230796 -0.05025337 0.04157687 -0.2338234 -0.02899101 -0.04472247 0.182262 -0.3527863 0.129024 -0.01209915 0.2357902 0.1669566 0.2286588
[2,]         NA         NA       NA        NA           NA         NA         NA         NA         NA         NA          NA         NA          NA          NA         NA         NA          NA          NA       NA         NA       NA          NA        NA        NA        NA
           [,51]       [,52]      [,53]      [,54]     [,55]     [,56]       [,57]      [,58]        [,59]      [,60]      [,61]     [,62]     [,63]      [,64]       [,65]     [,66]      [,67]      [,68]       [,69]     [,70]     [,71]     [,72]     [,73]     [,74]     [,75]
[1,] -0.07738464 -0.01242978 0.09689969 -0.5510674 0.3017041 0.3198646 -0.04609768 -0.0157081 -0.003161268 -0.3200002 -0.1707829 0.3582172 0.5344889 -0.1207368 -0.05683072 0.1049338 0.04020218 -0.2073573 0.001379412 0.1023875 0.1685776 0.1113139 0.1662056 0.2080259 0.2338429
[2,]          NA          NA         NA         NA        NA        NA          NA         NA           NA         NA         NA        NA        NA         NA          NA        NA         NA         NA          NA        NA        NA        NA        NA        NA        NA
          [,76]      [,77]         [,78]
[1,] -0.1123109 0.02781702 -0.0009693966
[2,]         NA         NA            NA

$iterations
[1]    3 1000

$deviances
[1] 1.209236      NaN

Perhaps this hope is vain?

glmGamPoi/src/beta_estimation.cpp

Line 269 in 6c5c931

// This should not happen

Some error messages are unhelpful when truncated

glmGamPoi gives helpful error messages, for example, when the design matrix is degenerate:

> glmGamPoi::glm_gp(data = matrix(0, 20, 100), design = matrix(0, 100, 10))
Error in handle_design_parameter(design, data, col_data, reference_level) : 
  The model matrix:
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    0    0    0    0    0    0    0    0    0     0
[2,]    0    0    0    0    0    0    0    0    0     0
[3,]    0    0    0    0    0    0    0    0    0     0
[4,]    0    0    0    0    0    0    0    0    0     0
[5,]    0    0    0    0    0    0    0    0    0     0
[6,]    0    0    0    0    0    0    0    0    0     0
seems degenerate ('matrix_rank(model_matrix) < ncol(model_matrix)'). Column 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 contains only zeros.

However, when the design matrix is bigger/ has long column names the last part of the error message is truncated (without indication), hiding the relevant part:

> glmGamPoi::glm_gp(data = matrix(0, 20, 100), design = matrix(0, 100, 10, dimnames= list(rows = NULL, cols = paste0("my_long_sample_name_", seq_len(10)))))
Error in handle_design_parameter(design, data, col_data, reference_level) : 
  The model matrix:
     my_long_sample_name_1 my_long_sample_name_2 my_long_sample_name_3 my_long_sample_name_4 my_long_sample_name_5 my_long_sample_name_6 my_long_sample_name_7 my_long_sample_name_8 my_long_sample_name_9 my_long_sample_name_10
[1,]                     0                     0                     0                     0                     0                     0                     0                     0                     0                      0
[2,]                     0                     0                     0                     0                     0                     0                     0                     0                     0                      0
[3,]                     0                     0                     0                     0                     0                     0                     0                     0                     0                      0
[4,]                     0                     0                    
> #above error message gets truncated by default (see ?stop)

Please, consider printing the head of the model matrix only after helpful part.

Design with `NA` values causes nondescriptive error message

Hey Constantin,

I think I found a small bug in your package.

Using column data where some values are NA in a design throws an error with the message: Number of rows in col_data does not match number of columns of data

As an example:

library("tidyverse")
library("glmGamPoi")
d <- matrix(rpois(5*6, 10), ncol=5)
d.col <- data.frame(var=c(1,2,3,4,NA))
glm_gp(design=~var, col_data=d.col, data=d)

Throws

Error in handle_design_parameter(design, data, col_data, reference_level, : Number of rows in col_data does not match number of columns of data
Traceback:

1. glm_gp(design = ~var, col_data = d.col, data = d)
2. handle_design_parameter(design, data, col_data, reference_level, 
 .     offset)
3. stop("Number of rows in col_data does not match number of columns of data")

Obviously this should not happen as we have 5 rows in the data.frame and 5 columns in the matrix.
This can be fixed by removing the NA entries:

glm_gp(design=~var, col_data=d.col[!is.na(d.col$var),,drop=F], data=d[,!is.na(d.col$var)])

Returns

glmGamPoiFit object:
The data had 6 rows and 4 columns.
A model with 2 coefficient was fitted.

Inconsistent dispersion estimate between MLE and quasi-MLE

Hi,
I am trying to understand the difference between MLE and quasi-MLE results. So I did a simple simulation:

1000 genes x 500 cells, every gene has the same mean expression = 3 and dispersion = 5

# Simulate a matrix with 1000 genes x 500 cells per gene on the same gene with NB distribution
m = 1000
n = 500
mu = 3 # every gene has the same mean expression
size = 0.2 # inverse of the dispersion = 5
data = matrix(NA, nrow = m, ncol = n)

for(i in 1:m){
  data[i,] = rnbinom(n = n, size = size, mu = mu)
}

fit = glmGamPoi::glm_gp(as.matrix(data), size_factors = FALSE, verbose = T)
theta_mle = fit$overdispersions

m_est = rowMeans(data)

v_est = apply(data, 1, function(x)var(x))

theta_moe = (v_est-m_est)/(m_est^2)

res = data.frame(method=rep(c("moe","mle","ql_mle","ql_mle_shrunken"),each=1000),
               m_est=rep(m_est,each=4), theta_est=c(theta_moe,theta_mle,fit$overdispersion_shrinkage_list$ql_disp_estimate,fit$overdispersion_shrinkage_list$ql_disp_shrunken))

boxplot(theta_est~method, res)
abline(h = 1/size, lty = 3)
abline(h = 1, lty = 3)

However, when I checked the estimate from four methods: moment, MLE, ql_mle, ql_mle_shruken, I found that MOE and MLE provide relatively accurate estimate but the ql_mle (quasi-likelihood dispersion estimates based on the dispersion trend) and ql_mle_shruken (shrunken quasi-likelihood dispersion estimates) are always close to 1 (see plot below).

Then I further check the fit results

> summary(fit)
glmGamPoiFit object:
The data had 1000 rows and 500 columns.
A model with 1 coefficient was fitted.
The design formula is: Y~1

Beta:
            Min 1st Qu. Median 3rd Qu. Max
Intercept 0.673    1.02   1.09    1.16 1.4

deviance:
 Min 1st Qu. Median 3rd Qu. Max
 361     398    408     417 456

overdispersion:
  Min 1st Qu. Median 3rd Qu.  Max
 3.59    4.71   4.98    5.29 6.82

Shrunken quasi-likelihood overdispersion:
   Min 1st Qu. Median 3rd Qu.  Max
 0.738    0.95  0.997    1.06 1.33

size_factors: FALSE

Mu:
  Min 1st Qu. Median 3rd Qu.  Max
 1.96    2.78   2.99     3.2 4.04

I found the difference came from the

dispersion_trend

and

ql_disp_trend

> summary(fit$overdispersion_shrinkage_list$dispersion_trend)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.885   4.944   4.979   4.978   4.999   5.152 
> summary(fit$overdispersion_shrinkage_list$ql_disp_trend)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.005655 0.787480 0.926023 0.934233 1.104975 1.186770

I wonder if you have any ideas why the ql_disp_trend isn't consistent with the simulated parameter (theta = 5)? Thanks very much for your help!

regarding the "note of caution"

Regarding your remark:

A note of caution: applying test_de() to single cell data without the pseudobulk gives overly optimistic p-values. This is due to the fact that cells from the same sample are not independent replicates! It can still be fine to use the method for identifying marker genes, as long as one is aware of the difficulties interpreting the results.

... do you have any guidance as to approaches toward accounting for the overly optimistic p-values?

Naively, I suppose that without knowing how the cell-to-cell variability of expression depends upon the gene, the experimental condition, and possibly even their interaction, there is really no hope for such an accounting. Does this agree with your understanding, or, is there more that can be accomplished with the data at hand.

Support for existing size factors

Hi Constantin,

first of all congrats and thanks for this nice and very useful package.
I would like to ask two questions:

I was wondering whether there is (or you plan to add) an option to use existing size factors rather than estimating them when running glm_gp(). In my case I already have the size factors from scran::calculateSumFactors() which the logcounts are based on that I use throughput the analysis. It would probably not make too much of a difference but in order to be consistent it'd be desirable to use exactly these factors rather than re-estimating them. Depending on the size of the dataset it could also save a bit of time to use existing ones. Would that be possible?

This one is rather a general one rather than a technical request relating to the interpretation of the QL ratio test results. Is there a general consensus in your group how you deal with differential expression results that are driven by a few number of cells? For illustration, imagine you had two clusters of cells to compare, say 500 cells each but in each cluster only say 5-10% of cells actually have counts > 0 for the a given gene (see plot with dummy data below). This would tyically be a highly-significant result using either Wilcox-like tests of something like the QL ratio test due to the large number of cells per cluster. Still it is questionable whether like 5% of cells per cluster with counts > 0 are representative. Probably a bit too "philosophical" for a GitHub issue, so please don't feel obligate to reply to this one if you feel like it is inappropriate here. Wanted to catch the chance though to ask this in this context as your package would allow for routine QL-like tests even on large datasets.

Link to Violin dummy plots

Thank you and best wishes!
-Alex

size_factors = FALSE in README

In the README there is the comment + code:

# size_factors = FALSE, because only a single GLM is fitted
fit <- glm_gp(counts, design = ~ 1)
fit

I interpreted the comment as an explanation of why the following code sets size_factors = FALSE but then the code doesn't do this.

Is this an oversight or have I misinterpreted the comment?

Best!

proper input for marker selection

Hi Constantin,

When I would like to get the markers for each cluster in a scRNA-seq dataset, I tried 2 ways:

treating non-8 cluster as a group in glm_gp()

fit2 <- glm_gp(t(adata$X), design = ~ adata$obs$batch2 + (adata$obs$leiden == '8'))
de_res2 = test_de(fit2, contrast = `adata$obs$leiden == "8"TRUE`)

and the output was:

name pval adj_pval f_statistic df1 df2 lfc
1 A1BG 9.083840e-01 9.443240e-01 1.324322e-02 1 14007.96 -0.19294410
2 A1CF 4.922579e-01 6.659670e-01 4.716120e-01 1 14007.96 0.61182880
3 A2M 6.559798e-26 2.372595e-24 1.112389e+02 1 14007.96 3.38479526
4 A2ML1 9.528825e-01 9.735806e-01 3.491446e-03 1 14007.96 0.04592482
5 A3GALT2 6.607379e-01 7.841244e-01 1.926367e-01 1 14007.96 -32.55924521
6 A4GALT 2.402577e-04 1.314017e-03 1.349370e+01 1 14007.96 2.01775275
7 A4GNT 7.185802e-01 8.225581e-01 1.298624e-01 1 14007.96 0.75398461
8 AAAS 8.593856e-01 9.144625e-01 3.138554e-02 1 14007.96 -0.04804371
9 AACS 1.722583e-01 3.358798e-01 1.863370e+00 1 14007.96 -0.25751422
10 AADAC 5.411227e-01 7.032795e-01 3.734812e-01 1 14007.96 -37.17579965
11 AADACL2 4.919869e-01 6.658566e-01 4.722028e-01 1 14007.96 1.30339179
12 AADACL3 2.822849e-01 4.712254e-01 1.156146e+00 1 14007.96 -1.40612943
13 AADACL4 2.640774e-01 4.506751e-01 1.247355e+00 1 14007.96 1.58819251
14 AADAT 4.133496e-01 6.013788e-01 6.691857e-01 1 14007.96 -0.38910189
15 AAGAB 1.161384e-01 2.520346e-01 2.468946e+00 1 14007.96 0.22126741
16 AAK1 3.768931e-15 7.216806e-14 6.195588e+01 1 14007.96 0.84736921
17 AAMDC 3.412921e-01 5.328369e-01 9.056332e-01 1 14007.96 -0.36246608
18 AAMP 4.683749e-02 1.248763e-01 3.951915e+00 1 14007.96 -0.52769448
19 AANAT 5.366256e-01 6.999187e-01 3.818486e-01 1 14007.96 -0.52204354
20 AAR2 5.863448e-01 7.359875e-01 2.961025e-01 1 14007.96 0.14654954

the other was treating non-target(8) clusters individually in the glm_gp() step:

fit <- glm_gp(t(adata$X), design = ~ adata$obs$batch2 + adata$obs$leiden)
de_res = test_de(fit, contrast = `adata$obs$leiden8`)

the output was:

name pval adj_pval f_statistic df1 df2 lfc
1 A1BG 8.427675e-01 9.401642e-01 3.934607e-02 1 14007.25 -0.34293252
2 A1CF 2.504479e-02 8.752234e-02 5.021866e+00 1 14007.25 2.13935983
3 A2M 2.110166e-21 9.158640e-20 9.053395e+01 1 14007.25 4.27889600
4 A2ML1 5.859041e-01 7.751609e-01 2.968002e-01 1 14007.25 0.48178161
5 A3GALT2 9.997667e-01 1.000000e+00 8.552766e-08 1 14007.25 -0.17163815
6 A4GALT 1.263855e-06 1.337511e-05 2.349799e+01 1 14007.25 3.62761987
7 A4GNT 4.908065e-01 7.069898e-01 4.747828e-01 1 14007.25 1.84253091
8 AAAS 6.669497e-01 8.321722e-01 1.851987e-01 1 14007.25 0.13071683
9 AACS 5.923612e-01 7.796723e-01 2.866851e-01 1 14007.25 0.11325749
10 AADAC 9.999056e-01 1.000000e+00 1.400734e-08 1 14007.25 -1.49534952
11 AADACL2 6.956023e-01 8.524604e-01 1.530937e-01 1 14007.25 0.85935584
12 AADACL3 2.501211e-01 4.736049e-01 1.322737e+00 1 14007.25 -1.61088927
13 AADACL4 8.342623e-03 3.564575e-02 6.960385e+00 1 14007.25 Inf
14 AADAT 2.168300e-01 4.317719e-01 1.525366e+00 1 14007.25 -0.61582857
15 AAGAB 2.974100e-01 5.263028e-01 1.085852e+00 1 14007.25 0.15954452
16 AAK1 2.910548e-10 4.877179e-09 3.979156e+01 1 14007.25 0.74130901
17 AAMDC 3.376498e-01 5.692153e-01 9.193974e-01 1 14007.25 -0.39961874
18 AAMP 3.935166e-03 1.886862e-02 8.316256e+00 1 14007.25 -0.80653633
19 AANAT 1.930534e-01 3.978769e-01 1.694324e+00 1 14007.25 -1.14436345
20 AAR2 9.684653e-01 1.000000e+00 1.562931e-03 1 14007.25 -0.01149323

I found the output was not identical. Which input would be a more reasonable/proper way for the purpose?

Thank you.

pseudobulk does not include `NA` levels

See:

data <- data.frame(fav_food = sample(c("apple", "banana", "cherry", NA), size = 50, replace = TRUE),
                   indicator = sample(1:3, size = 50, replace = TRUE))
Y <- matrix(rnbinom(n = 100 * 50, mu = 3, size = 1/3.1), nrow = 100, ncol = 50)
rownames(Y) <- paste0("gene_", seq_len(100))
colnames(Y) <- paste0("cell_", seq_len(50))
row_dat <- data.frame(id = rownames(Y), chr = sample(1:22, nrow(Y), replace = TRUE))
sce <- SingleCellExperiment::SingleCellExperiment(list(counts = Y, logcounts = log(Y + 1)),
                                                  colData  = data, rowData = row_dat)
psce <- pseudobulk(sce, group_by = vars(fav_food))
unique(colData(sce)$fav_food)
unique(colData(psce)$fav_food)
``

Compatibility with R 3

Just wanted to check if any version of this amazing package was ever compatible/installable with R 3.6. Any help would be much appreciated.

EDITED: in case the above is true (it seems to be according to original pub [R 3.6.2, glmGamPoi 1.1.5]) and the package is usable/installable in R 3.6, which release/commitSHA should be used with devtools::install_github, remotes::install_github or some other installation tool?

Comparison of glmGamPoi models via ANOVA

Hi,

Thanks for developing this powerful tool!
I have one question: how can I convert a glmGamPoi object to a glm object so that two glmGamPoi models can be compared via ANOVA? Or is there a glmGamPoi-orientated way to apply ANOVA between two models?

Thanks a lot!

Best,
Dadi

determining the less trustworthy log2fc values

Hello, thanks very much for your package. I just want to follow up on this point from the vignette:

The large lfc values come from groups were nearly all counts are 0

It seems that depending on what my design is, the threshold to separate the "three groups" of log2fc values can be as small as 5. I also got the warnings "“encountered non-positive size factor estimates” and “singular gradient” when I was running glm_gp for the fit, I don't know if it's related. I'm assuming these are still "large lfc values" though they are < 20. Is there a better way you could recommend to separate out the genes with less trustworthy log2fc values than by looking visually?

How to do within cluster differential analysis

Hi, I have a single cell project. basically, we have two treatment conditions and the two seurat objects are merged. I would like to identify differentially expressed genes within (as opposed to between) a cluster between treatment and control samples. Since there is no replicates per treatment, there is no way to specify the pseudobulk_by parameter. I can not find an example code in your documentation. Can you tell me how to do this type of analysis? Thanks!

P-values returned for low read counts sometimes seem too significant

Hi again,

I'm using glmGamPoi_1.3.3 via DESeq2_1.31.3 for 10x sc-RNA-seq data. For one comparison, I have 4096 control cells and 6 cells in my condition (the imbalance is, unfortunately, unavoidable). For a couple of genes where my "condition" cells have no reads, I find the p-values returned are very low. For instance, my gene1 below:

# DESeq2+glmGamPoi p=1e-137
table(counts(eset2)["gene1",-which(cur$is_smp)])
#    0    1    2    3    4    5    6    7
# 1901 1346  553  202   64   20    7    3
table(counts(eset2)["gene1",which(cur$is_smp)])
# 0
# 6

Naively, I would expect that the probability of picking 6 cells with 0 reads out of the distribution of my control cells would be the fraction of zero-read cells to the power to cells, i.e.:

1901/(1901+1346+553+202+64+20+7+3)
# 0.4641113
0.4641113^6
# 0.009993854 # not significant after multiple testing correction

Now, this is of course a probability and not a p-value (although they coincide in this case), and a binomial distribution instead of a gamma mixture of poissons.

But should the significance measure be that different?

compare multiple pairs concurrently

Hi,

My dataset contains single-cell data with the following 6 treatments:

A
A+B
A+B+C
A+C
A+C+D
A+B+C+D

I would like to know the genes specifically induced by B. I wonder if it is okay and more robust to perform pairwise comparisons as below and get the aggregated output of significant genes induced by B in all conditions?
A vs A+B
A+C vs A+C+B
A+C+D vs A+C+D+B

Thanks!

Installation of regularization branch

Hi Constantin,

I've just updated my R version to 4.0.5, which is giving me trouble in re-installing the regularization branch. Do you happen to have an idea what could be causing this?

Thanks.
Best,
Koen

> library(devtools)
Loading required package: usethis
> install_github("const-ae/glmGamPoi", ref="regularization")
Downloading GitHub repo const-ae/glmGamPoi@regularization
   checking for file ‘/private/var/folders/15/gs2v829x4gxcdv0tlgk_5l4m0000gn/T/RtmpQRxk23/remotes36a36628391a/const-ae-glm✔  checking for file ‘/private/var/folders/15/gs2v829x4gxcdv0tlgk_5l4m0000gn/T/RtmpQRxk23/remotes36a36628391a/const-ae-glmGamPoi-c3a356e/DESCRIPTION’
─  preparing ‘glmGamPoi’:
✔  checking DESCRIPTION meta-information ...
─  cleaning src
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘glmGamPoi_1.3.6.tar.gz’
   
* installing *source* package ‘glmGamPoi’ ...
** using staged installation
** libs
/usr/local/clang4/bin/clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include/ -I'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include' -I'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppArmadillo/include' -I'/Library/Frameworks/R.framework/Versions/4.0/Resources/library/beachmat/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c RcppExports.cpp -o RcppExports.o
In file included from RcppExports.cpp:4:
In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppArmadillo/include/RcppArmadillo.h:31:
In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/RcppArmadillo/include/RcppArmadilloForward.h:26:
In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/RcppCommon.h:29:
In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/Rcpp/r/headers.h:67:
In file included from /Library/Frameworks/R.framework/Versions/4.0/Resources/library/Rcpp/include/Rcpp/platform/compiler.h:100:
In file included from /usr/local/clang4/bin/../include/c++/v1/cmath:305:
/usr/local/clang4/bin/../include/c++/v1/math.h:301:15: fatal error: 'math.h' file not found
#include_next <math.h>
              ^~~~~~~~
1 error generated.
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘glmGamPoi’
* removing ‘/Library/Frameworks/R.framework/Versions/4.0/Resources/library/glmGamPoi’
Warning message:
In i.p(...) :
  installation of package ‘/var/folders/15/gs2v829x4gxcdv0tlgk_5l4m0000gn/T//RtmpQRxk23/file36a35abd0e5b/glmGamPoi_1.3.6.tar.gz’ had non-zero exit status
> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.5

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] devtools_2.3.2 usethis_2.0.1 

loaded via a namespace (and not attached):
 [1] rstudioapi_0.13   magrittr_2.0.1    pkgload_1.2.0     R6_2.5.0         
 [5] rlang_0.4.10      fastmap_1.1.0     tools_4.0.5       pkgbuild_1.2.0   
 [9] sessioninfo_1.1.1 cli_2.3.1         withr_2.4.1       ellipsis_0.3.1   
[13] remotes_2.3.0     assertthat_0.2.1  rprojroot_2.0.2   lifecycle_1.0.0  
[17] crayon_1.4.1      processx_3.5.0    purrr_0.3.4       callr_3.6.0      
[21] fs_1.5.0          ps_1.6.0          curl_4.3          testthat_3.0.2   
[25] memoise_2.0.0     glue_1.4.2        cachem_1.0.4      compiler_4.0.5   
[29] desc_1.3.0        prettyunits_1.1.1

update tag on blog

On (https://const-ae.name/post/need_for_speed-deseq2/) you recommend installing
devtools::install_github("mikelove/DESeq2@ae7c6bd") to try the speedups, since your changes have been merged & there have been some bugs in that version you might want to change that to remotes::install_github("mikelove/DESeq2").

Standard error of the overdispersion estimates

Dear Constantin,

I was wondering when estimating the overdispersion by glmGamPoi::glm_gp() function, can we extract the standard error of the estimates?

Also, when the algorithm shrinks the dispersion, would that impact the standard error?

Thank you!

Cheers,
Angli

Questions about the tutorial

Hi Constantin,

Thank you for the great tool.

Could you kindly explain the language you used in the tutorial?

fit <- glm_gp(sce_subset, design = ~ cell + stim +  stim:cell - 1,
              reference_level = "NK cells")

Why there is a "-1" in the "design"? Did it mean the "NK cells' in the "reference_level", or the default 1-coefficient fitting?
Is there any advantage of including the "reference_level" here(instead of reference_level=NULL and excluding NK cells in the dataset)?

de_res <- test_de(fit, contrast = `stimstim` + `cellCD4 T cells:stimstim`, 
                  pseudobulk_by = paste0(stim, "-", ind))

Did "contrast = stimstim + cellCD4 T cells:stimstim" indicate the contrast between stimstim of all stimulated cell types (actually normalized B and T by the reference NK cells) and cellCD4 T cells:stimstim of stimulated T cells alone?

marker_genes2 <- test_de(fit, (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - 
                               (`cellB cells` + `cellB cells:stimstim`), 
                        sort_by = pval)

Can I replace the (`cellCD4 T cells` + `cellCD4 T cells:stimstim`) - (`cellB cells` + `cellB cells:stimstim`) with (`stimstim` + `cellCD4 T cells:stimstim`) - (`stimstim` + `cellB cells:stimstim`) and reach the identical output?

Thank you

test_de failing with on_disk models

Hello Constantin,

Thank you for this code, I've been using and learning more about it for several months. I'm working with some large data sets and using glmGamPoi to perform modeling and hypothesis testing.

I've been able to create full models from HDF5Array input data sets and on_disk = TRUE. I did need to adjust the chunk size params to get this to occur in finite time. However, the on-disk version of the model output cannot be queried with test_de. Performing such test gives the error:

Error in combine_size_factors_and_offset(offset, size_factors, Y, verbose = verbose) :
length(offset) == 1 || length(offset) == n_samples is not TRUE

(seems to be coming from estimate_size_factors.R)

If I run the same model with a realized input matrix and on_disk = FALSE, everything works as expected. I wonder if test_de can't determine the right dimensions or related parameters for on_disk models and is instead throwing this error?

It could be a general bug, it could be a problem that only shows up when one alters default H5 temp file paths, etc? But I've looked, and I think the on-disk model objects have hard-coded locations for data contained within them.... Hmm.

Attached is the log file (docx). You can get a full tgz of the code and input data at the link below. I've tried to make this an MRE, but even very truncated the singlecellexperiment file is a little big.

Thanks, Nathan Siemers
mre.glmgampoi.bug.docx

tgz file: https://drive.google.com/file/d/1fYN4F7TtFpwgC6yqYZDP_rDHBiV4XU5H/view?usp=sharing

Error parsing "complex" contrast names

Hi Constantin,

I just came across an error in the handling of contrast names in test_de(). The error appears when using a model.matrix with "complex" names (see below)

Code to reproduce:

library(glmGamPoi)
Y <- matrix(rnbinom(n = 30 * 100, mu = 4, size = 0.3), nrow = 30, ncol = 100)
annot <- data.frame(sample = sample(LETTERS[1:6], size = 100, replace = TRUE),
                    grp = sample(LETTERS[1:2], size = 100, replace = TRUE))
se <- SummarizedExperiment::SummarizedExperiment(Y, colData = annot)
mm <- model.matrix(~annot$grp)
fit <- glm_gp(se, design = mm)
test_de(fit, contrast = "annot$grpB")

Error:

 Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x),  : 
  'data' must be of a vector type, was 'NULL'

I traced the Error to

glmGamPoi/R/parse_contrast.R

Line 38 in b595fd7

res <- eval(parse(text = res), envir= level_environment)

which returns NULL in this context (and not the correct column name as supposed).

The error can be circumvented by defining the model.matrix more elegantly (model.matrix(~grp, annot)), but as the matrices themselves are identical (except for the column names), I still think this might be an error you want to catch.

Best,
Tobi

Differential Expression case vs control across all cell types

Hi,

We want to test for differential expression in all cell types in a single-cell dataset for a case vs control experiment and were hoping to use glmGamPoi to do so. What would you say best practise is for setting the value of reference_level?

Thanks!

Small typo "Preprare results"

glmGamPoi/R/test_de.R

Line 255 in 89aac9d

if(verbose){message("Preprare results")}

Missing test_de Function

Hi Constantin,

Apologies for having to open another issue so soon! I've tried running test_de and it doesn't appear to be part of glmGamPoi from Bioconductor, I've tried separately adding the function via the source code (including eval_with_q and handle_design_parameter). It also appears that colData is called on a $data column, but I encounter an error that it doesn't exist when trying to run it.

Error in value[[3L]](cond) : unable to find an inherited method for function ‘colData’ for signature ‘"NULL"’
8.
stop(e$message)
7.
value[[3L]](cond)
6.
tryCatchOne(expr, names, parentenv, handlers[[1L]])
5.
tryCatchList(expr, classes, parentenv, handlers)
4.
tryCatch({ res <- eval(statement, envir = as.list(data), enclos = statement_envir) if (!is.numeric(res)) { if (is.character(res) && length(res) == 1) { ...
3.
eval_with_q(pseudobulk_by, SummarizedExperiment::colData(fit$data), env = env)
2.
test_de_q(fit, contrast = contrast_capture, reduced_design = reduced_design, full_design = full_design, subset_to = subset_to_capture, pseudobulk_by = pseudobulk_by_capture, pval_adjust_method = pval_adjust_method, sort_by = sort_by_capture, decreasing = decreasing, n_max = n_max, ...
1.
test_de(fit, reduced_design = ~1)

Thanks!
Nick

Typo in help text for overdispersion_shrinkage

In the Value section of overdispersion_shrinkage, there appears to be a mistake:

ql_disp_trend the ql_disp_estimate still might show a trend with respect to gene_means. If
ql_disp_trend = TRUE a spline is used to remove this secondary trend. If ql_disp_trend = TRUE it corresponds directly to the dispersion prior

It says ql_disp_trend = TRUE twice. I think the 2nd TRUE should be FALSE instead.

	if(is.null(fit$overdispersion_shrinkage_list)){
	stop("fit$overdispersion_shrinkage_list is NULL. Run 'glm_gp' with ",
	"'overdispersion_shrinkage = TRUE'.")

const-ae / glmgampoi Goto Github PK

glmgampoi's People

Contributors

Stargazers

Watchers

Forkers

glmgampoi's Issues

Session info

trying URL 'https://bioconductor.org/packages/3.15/bioc/src/contrib/glmGamPoi_1.8.0.tar.gz' Content type 'application/x-gzip' length 914728 bytes (893 KB)

Recommend Projects

Recommend Topics

Recommend Org

trying URL 'https://bioconductor.org/packages/3.15/bioc/src/contrib/glmGamPoi_1.8.0.tar.gz'
Content type 'application/x-gzip' length 914728 bytes (893 KB)