Comments (55)
Yes, I noticed that on my site too. I am trying to make sp_mat
object from RcppArmadillo
to handle large sparse matrix.
from baynorm.
Thanks for that. Please have a try the new version 1.5.5
. It seems that now only the memory size of the machine matters.
My laptop has memory 32G. Currently it failed on matrix of dimension 32738 * 737280 (gene * cell), but seems to be ok with 32738 * 10000.
If as.matrix
only correlates with the memory size of machine, then I probably change back to use matrix
instead of dgCMatrix
object for processing the data in the future.
from baynorm.
Thank you for trying bayNorm
!
The following code may help you to insert any normalized data, which can be obtained outside Seurat
framework, into a Seurat
object. Then you can find clusters based on that normalized data:
library(bayNorm)
library(Seurat)
data('EXAMPLE_DATA_list')
library(Seurat)
bay_out<-bayNorm(EXAMPLE_DATA_list$inputdata,mean_version = TRUE)
x.seurat <- CreateSeuratObject(counts =bay_out$Bay_out,assay = 'bayNorm')
# x.seurat <- NormalizeData(x.seurat)
x.seurat <- ScaleData(x.seurat)
x.seurat <- FindVariableFeatures(x.seurat, do.plot = FALSE)
#Specifying: assay='bayNorm'
x.seurat <- RunPCA(x.seurat, pc.genes = [email protected],
pcs.compute = 50, do.print = FALSE,assay='bayNorm')
x.seurat <- RunUMAP(x.seurat, dims = 1:10,assay='bayNorm')
#x.seurat <- JackStraw(x.seurat, prop.freq = 0.06)
x.seurat <- FindNeighbors(x.seurat, dims = 1:10)
x.seurat <- FindClusters(x.seurat, resolution = 0.5)
head(Idents(x.seurat), 5)
#It is a toy example. Here only one cluster was found
plot(x.seurat@[email protected],pch=16,col=as.factor(Idents(x.seurat)))
#Double check that the assay we used comes from bayNorm
x.seurat@[email protected]
Hope this would be helpful.
from baynorm.
Hi @kvshams ,
Maybe have a trybayNorm(Data=cbind(case, control))
(pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells.
Is the left panel based onHarmoney
?@WT215 Thanks for the suggestion. Just tried different ways. Why not the
GG
. Got a good alignment . Do I need to keep anything in mind while using this method?
@kvshams Thanks for the feedback! Basically, if you want to minimize difference between groups of cells, you can try GG
procedure. LL
procedure could amplify the inter-groups noisy while minimizing the intrinsic noisy within each group.
However, I don't think there is a "once for all" procedure for dealing with the scRNA-seq data. It depends on what answers you would like to find from the data, like DE detection, noisy genes detection or other goals.
In addition to the different procedures, if you have an idea about the mean capture efficiency of your data (\bar{\beta}
, a scalar ranges between 0 and 1), or even have a relevant smFISH data, then you can estimate \beta
accurately. This may lead to a better performance of bayNorm.
from baynorm.
@kvshams I think you can directly try GG at first (for any purposes), then do cluster detection to find interesting clusters which may due to KO/treatment. Then LL for any subgroups of cells if you want to look at noisy genes in that subgroup.
from baynorm.
I did not test these two factors rigorously. My suggestion is to filter out most mitochondria related genes at the very beginning (if they are not of interest in your study).
For cell cycles, the GG
procedure may mitigate the difference between cells which belong to different cell cycles. Whether this works maybe checked by visualizing the 2D umap, see if cells were clustered w.r.t to the cell cycles.
from baynorm.
I am looking at this discussion: https://stackoverflow.com/questions/40592054/large-matrices-in-rcpparmadillo-via-the-arma-64bit-word-define, but I got the issue posted here when I tried to build the package: https://stackoverflow.com/questions/55248883/why-does-rcpp-function-segfault-only-when-called-from-r-package-but-not-when-sou.
from baynorm.
@kvshams By learning from the R package liger
, I am using dgCMatrix
throughout the process. Please have a try the 1.5.9
. I can run bayNorm
on a sparse matrix of dimension 32738 genes * 737280 cells
on my laptop (32G memory, 5 cores specified in bayNorm
. Though that test was not completed, it arrived at 1% at time-consuming
part. Hence it seems to work).
The output is still of type matrix
. If it works, I may allow it to be of type dgCMatrix
too. In case that the output is too large to be stored.
The error %Error in serialize(data, node$con) : ignoring SIGPIPE signal
is probably due to that we used matrix
. So the data which was distributed to each cores is still too large, hence the core is highly likely to be terminated. Now with dgCMatrix
, it may works.
from baynorm.
Glad to hear that! I will also allow mean version to return sparse matrix later.
from baynorm.
I just updated the 1.5.13
. Now it allows dgCMatrix
throughout the whole process (Input which is not of dgCMatrix
will be converted to dgCMatrix
).
Have a try at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Prior_type="GG",out.sparse = TRUE)
. Then you will get
> class(at_GG$Bay_out)
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
from baynorm.
The rerun with new version getting killed
at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, parallel = TRUE,NCores = 50, Prior_type="GG", out.sparse = TRUE)
Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
|======================================================================| 100%
Prior estimation has completed!
**********************************************************
***********
Killed
from baynorm.
Yes, the output is no more sparse. I don't think replacing <1 with 0 can help. It seems that there is no solution, I need to think about it.
How about filtering out some cells according to the total counts at the very beginning?
from baynorm.
I don't think replacing <1 with 0 is correct. Does '1.5.14' work without replacing <1?
from baynorm.
Yes, it works without replacing it. I am saving that object and now will see how does it sync with Seurat
from baynorm.
Basically, as far as I think, there are following ways:
- Filtering out more cells/genes.
- Saver the normalized data. Then restart R and load that data into
Seurat
. By doing so, we make sure that there is only one big data loaded in the environment (exclude the raw data). - Using
scanpy
inpython
. spam
package inR
... still look into it https://stackoverflow.com/questions/24236426/how-to-get-a-big-sparse-matrix-in-r-231-1 .
from baynorm.
I agree that it would work in other packages such as scnapy
. Even it works very well with bigmemory
(verified) in R but unfortunately the Seurat
is still at dgCMatrix
based data structure. Thanks a lot for working on it. As we have reached now on a dead end, I am closing this issue
from baynorm.
@WT215 sorry to bug you again. How easy to return
Arcsine
normalizeddgcMatrix
as an output. That would solve the memory issue as it keep the zero value as zero.
Thanks for the suggestion! This idea sounds interesting. However the normalized data is no longer sparse, and the R itself cannot handle big matrix. So I am not very sure whether that idea works, but I will have a look.
from baynorm.
Thanks @WT215 for the reply. That worked!.
I have a doubt regarding the case control data. If use the baynorm assay
for the clustering, there is huge batch effect (baynorm is done separately for case and control to preserve the structure, is there any other way ?). Please see the Inf
data from seurat-data . Harmoney
is used for integration.
Any suggestion?
from baynorm.
Hi @kvshams ,
Maybe have a try bayNorm(Data=cbind(case, control))
(pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells.
Is the left panel based on Harmoney
?
from baynorm.
@WT215 Was thinking to do so, but did not do by anticipating that would create a true bias in the data structure itself (or am I missing some thing?). For eg, if the case is a KO, then we would see an imputed data for the KO gene also, right?. Thats the reason I was doing the case control baynorm
separately (in this case STIM
and CTRL
). Did not test the KO scenario, but will test and get back to you with an appropriate test data.
Both the clustering above is done by Harmoney
in same way. The only difference is that the assay fed in for the process, ie RNA
and Baynorm
imputed assays respectively.
from baynorm.
@WT215, hopefully I can use the baynorm output ( ie the bay_out$Bay_out
) directly to the edgeR differential expression without any additional steps right?.
from baynorm.
@WT215, hopefully I can use the baynorm output ( ie the
bay_out$Bay_out
) directly to the edgeR differential expression without any additional steps right?.
I have not tried to applied edgeR
to the bayNorm normalized data. MAST
could be a good choice, see example here: Section: Downstream analysis: DE genes detection.
from baynorm.
Hi @kvshams ,
Maybe have a try
bayNorm(Data=cbind(case, control))
(pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells.Is the left panel based on
Harmoney
?
@WT215 Thanks for the suggestion. Just tried different ways. Why not the GG
. Got a good alignment . Do I need to keep anything in mind while using this method?
from baynorm.
Hi @kvshams ,
Maybe have a trybayNorm(Data=cbind(case, control))
(pseudocode), and then look at the clustering of this normalized data? The prior parameters would be estimated across all the cells.
Is the left panel based onHarmoney
?@WT215 Thanks for the suggestion. Just tried different ways. Why not the
GG
. Got a good alignment . Do I need to keep anything in mind while using this method?
@kvshams Thanks for the feedback! Basically, if you want to minimize difference between groups of cells, you can try
GG
procedure.LL
procedure could amplify the inter-groups noisy while minimizing the intrinsic noisy within each group.However, I don't think there is a "once for all" procedure for dealing with the scRNA-seq data. It depends on what answers you would like to find from the data, like DE detection, noisy genes detection or other goals.
In addition to the different procedures, if you have an idea about the mean capture efficiency of your data (
\bar{\beta}
, a scalar ranges between 0 and 1), or even have a relevant smFISH data, then you can estimate\beta
accurately. This may lead to a better performance of bayNorm.
Thanks @WT215 for the detailed and quick reply.
My main goal for the analysis is to find trajectory, DE and find any excited cells due to the KO/treatment. According to my understanding,
GG
would be best for DE & TrajectoryLL
would best for finding excited cells/signatures
Is that correct?
from baynorm.
Thank you for trying
bayNorm
!The following code may help you to insert any normalized data, which can be obtained outside
Seurat
framework, into aSeurat
object. Then you can find clusters based on that normalized data:library(bayNorm) library(Seurat) data('EXAMPLE_DATA_list') library(Seurat) bay_out<-bayNorm(EXAMPLE_DATA_list$inputdata,mean_version = TRUE) x.seurat <- CreateSeuratObject(counts =bay_out$Bay_out,assay = 'bayNorm') # x.seurat <- NormalizeData(x.seurat) x.seurat <- ScaleData(x.seurat) x.seurat <- FindVariableFeatures(x.seurat, do.plot = FALSE) #Specifying: assay='bayNorm' x.seurat <- RunPCA(x.seurat, pc.genes = [email protected], pcs.compute = 50, do.print = FALSE,assay='bayNorm') x.seurat <- RunUMAP(x.seurat, dims = 1:10,assay='bayNorm') #x.seurat <- JackStraw(x.seurat, prop.freq = 0.06) x.seurat <- FindNeighbors(x.seurat, dims = 1:10) x.seurat <- FindClusters(x.seurat, resolution = 0.5) head(Idents(x.seurat), 5) #It is a toy example. Here only one cluster was found plot(x.seurat@[email protected],pch=16,col=as.factor(Idents(x.seurat))) #Double check that the assay we used comes from bayNorm x.seurat@[email protected]
Hope this would be helpful.
Do we need to regress out % mitochondria and cell cycle or is it already account the same in any of your prior estimate?
from baynorm.
All works good with relatively small data. When I try the actual data it exit with some error
BNorm_LL_Sample <- bayNorm(raw_data@assays$RNA@counts,mean_version = TRUE,
+ [email protected]$orig.ident, Prior_type="LL")
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
BNorm_LL_Batch <- bayNorm(raw_data@assays$RNA@counts,mean_version = TRUE, [email protected]$Batch, Prior_type="
LL")
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
This data has 480K cells across 4 groups and 20 samples. As this error may be related to the memory, I tried with 3.8T RAM system also (sadly, we cant get more memory), still get the error. If I split the groups, that works fine but that would be more like to be LL
. Is there any work around for this issue?
from baynorm.
I updated the package, try install it via:
#version 1.5.4
library(devtools)
devtools::install_github("WT215/bayNorm")
Now bayNorm
support sparse matrix. Let me know if it still does not work.
BTW, how many genes do you keep?
from baynorm.
@WT215, thanks for the fix and update. It does seems works and will update how is the performance.
I use all the remnant genes based on the Seurat
default filtering (thats ~15k). Is there any other criterion for the genes to keep?
from baynorm.
@WT215, thanks for the fix and update. It does seems works and will update how is the performance.
I use all the remnant genes based on the
Seurat
default filtering (thats ~15k). Is there any other criterion for the genes to keep?
15k is a reasonable number. Seurat
is already good to be used for this purpose. Another way is, for each gene, look at the proportion of cells which have non-zero counts. Then you can set up a threshold.
I am still updating the package. Currently accessing the row of dgCMatrix
is slower than accessing the row of matrix in R https://stackoverflow.com/questions/47997184/extraction-speed-in-matrix-package-is-very-slow-compared-to-regular-matrix-class.
from baynorm.
LL
done without any error. But GG is having an error
Error in t_sp(Data): SpMat::init(): requested size is too large
Traceback:
1. bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Conditions = [email protected]$orig.ident,
. Prior_type = "GG")
2. Prior_fun(Data = do.call(cbind, DataList_sr), BETA_vec = do.call(c,
. BETAList), parallel = parallel, NCores = NCores, FIX_MU = FIX_MU,
. GR = GR, BB_SIZE = BB_SIZE, verbose = verbose)
3. t_sp(t_sp(Data)/BETA_vec)
4. t_sp(Data)
from baynorm.
how to fix this issue?
from baynorm.
I think its worth looking this thread
from baynorm.
@WT215 it still persist!. I tried the highest possible memory in GCP (n1-ultramem-160 3.8TB) :(
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Calls: bayNorm ... EstPrior -> <Anonymous> -> - -> - -> as -> asMethod
Execution halted
from baynorm.
What is the error message?
from baynorm.
What is the error message?
updated the previous comment with the error
from baynorm.
Maybe we need to debug in the following steps:
- Can
rout=Matrix::rowMeans(raw_data@assays$RNA@counts)
be run? - Can
qq=as.matrix(raw_data@assays$RNA@counts)
be run successfully on your site? - If the step 2 is ok, then how about
rout=Matrix::rowMeans(qq)
?
from baynorm.
Step1 is working fine.
step to have the error
qq=as.matrix(raw_data@assays$RNA@counts)
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
as step is not working, the step 3 cant be tested
from baynorm.
If
as.matrix
only correlates with the memory size of machine, then I probably change back to usematrix
instead ofdgCMatrix
object for processing the data in the future.
Falling to matrix
is the option here I guess?
from baynorm.
@kvshams I just updated the 1.5.7
, hopefully it works.
from baynorm.
Error still persists with rccp (even with the highest RAM system).
error: Mat::init(): requested size is too large
Error in EstPrior_rcpp(Data = Data) :
Mat::init(): requested size is too large
Calls: bayNorm -> Prior_fun -> EstPrior -> EstPrior_rcpp
Execution halted
from baynorm.
I replaced the Rcpp function with original one in 1.5.8
. You can directly try:
MME_est<-EstPrior(Data=raw_data@assays$RNA@counts,verbose=TRUE)
Can that be run?
from baynorm.
Thank you for the update. Now I get this.
at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, [email protected]$orig.ident, Prior_type="GG")
Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
| | 0%Error in serialize(data, node$con) : ignoring SIGPIPE signal
MME_est<-EstPrior(Data=raw_data@assays$RNA@counts,verbose=TRUE)
Priors estimation based on MME method has completed.
from baynorm.
UPDATE: now it seems working... will update the performance onece it will get complete (as it may take few hours to get the result, I keeping the log here). Thanks a lot for the quick fix!
at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, [email protected]$orig.ident, Prior_type="GG")
Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
|============= | 18%
from baynorm.
Yes it works now!. Thanks a lot.
at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, [email protected]$orig.ident, Prior_type="GG")
Priors estimation based on MME method has completed.
Start optimization using spg from BB package.
This part may be time-consuming.
|======================================================================| 100%
Prior estimation has completed!
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
from baynorm.
I am having very hard time in making the seurat object from the baynorm
output (450k Cells across 20 samples). I think the issue is to convert all the samples to one dgCMatrix
. If I am trying to have a work around like
Mtrix_from_Baynorm <- cbind(baynorm$Bay_out_lis)
my.counts <- as(Mtrix_from_Baynorm, "dgCMatrix")
Segmentation fault
Is there any way I can put all the sample into a single dgCMatrix
in baynorm
output?
If I am making one object per sample from the Bay_out_lis
and use the Seurat merge
option, then also it breaks due to memory error.
from baynorm.
Hi @kvshams ,
The correct way for cbind
matricies in a list
is Mtrix_from_Baynorm <- do.call(cbind,baynorm$Bay_out_list)
, maybe try this at first, since you have already had that result.
If that does not solve the problem, then another way could be:
at_GG <- bayNorm(raw_data@assays$RNA@counts, mean_version = TRUE, Prior_type="GG")
By setting Conditions=NULL
, the output will be a full matrix instead stored in a list. Hence there is no need to cbind
. Since Prior_type="GG"
, specifying Conditions
is not necessary. You can call the subgroup of cells according to the cell names later. This way may save the memory.
Then, x.seurat <- CreateSeuratObject(counts =at_GG$Bay_out,assay = 'bayNorm')
. No need to convert it to dgCMatrix
, that code (CreateSeuratObject
) will convert it inside that function.
from baynorm.
Thanks @WT215 for reopening the issue and suggestions. I have already tried both cbind and cbind2 ie as in
do.call(cbind,baynorm$Bay_out_list)
In both the case seurat
fails (I presume it is in dgcmatrix
converting step as the input from baynorm
is a matrix
and the error is exactly similar that I get to convert baynorm
output matrix to the dgcmatrix
).
Error in .cbind2Csp(x, y) :
Cholmod error 'problem too large' at file ../Core/cholmod_sparse.c, line 92
Let me try the GG as you suggested and see how it goes.
from baynorm.
Just updated 1.5.14
, please have a try.
from baynorm.
@WT215 I think the issue is with the matrix. The baynorm
out is no more a true sparse one and almost all the col*rows are filled with values imputed by baynorm
(many of them are near to zero). Is'nt it worth putting all values <0 to 0 to make it memory efficient?. Is that statistically correct according to your model?
from baynorm.
In the output replacing <1 with 0 helps with the cost of time! (conversion time). And the v 1.5.14
also works now (I just re-installed the Rccp
and RcppArmadillo
). 1.5.14
is much faster and memory efficient than any previous version!.
from baynorm.
Now it gives a new error
at_GG$Bay_out
22865 x 442010 sparse Matrix of class "dgCMatrix"
Error in validityMethod(as(object, superClass)) :
long vectors not supported yet: ../../src/include/Rinlinedfuns.h:522
from baynorm.
Try not to access the whole data directly: at_GG$Bay_out
.
Can x.seurat <- CreateSeuratObject(counts =at_GG$Bay_out,assay = 'bayNorm')
be run?
from baynorm.
from baynorm.
This has been mentioned here: satijalab/seurat#1029 (comment), will look into it.
How about
qq<-unname(at_GG$Bay_out)
rm(at_GG)
x.seurat <- CreateSeuratObject(counts =qq,assay =
'bayNorm')
Or save at_GG$Bay_out
as HDF5Matrix
object. Then Read10X_h5
(still, not sure about this).
Another relevant discussion:
cole-trapnell-lab/monocle-release#138 (comment).
A final resort could be using scanpy
in python
, which is similar to Seurat
in R
.
from baynorm.
@WT215 sorry to bug you again. How easy to return Arcsine
normalized dgcMatrix
as an output. That would solve the memory issue as it keep the zero value as zero.
from baynorm.
Related Issues (2)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from baynorm.