theislab / sccoda Goto Github PK
View Code? Open in Web Editor NEWA Bayesian model for compositional single-cell data analysis
License: BSD 3-Clause "New" or "Revised" License
A Bayesian model for compositional single-cell data analysis
License: BSD 3-Clause "New" or "Revised" License
Hi there,
I'm greatly appreciative of the way scCODA models changes in whole-sample composition versus univariate cell counts.
This in mind, if I am working with a dataset with a prevalence of stress (say, 5-10% of the cells in every sample fail QC), should I include these cells in final compositional abundance testing? Seeing as they are part of overall sample composition. I do know that stress was unfortunately more prevalent in one condition. And yes, I know avoiding this initially would have been much more preferable but here we are.
Thanks!
Kane
thanks for the development of the great scCODA.
It may be my own error. I found that the result did not change when I changed the est_fdr. Could anyone help me?
>>> sim_results.set_fdr(est_fdr=0.05)
>>> sim_results.summary()
Compositional Analysis summary:
Data: 23 samples, 2 cell types
Reference index: 0
Formula: Group
Intercepts:
Final Parameter Expected Sample
Cell Type
GABA 2.843 63.188077
GLU 3.257 95.594532
Effects:
Final Parameter Expected Sample log2-fold change
Covariate Cell Type
Group[T.Mouse] GABA 0.00000 126.398962 1.000260
GLU -1.77579 32.383646 -1.561663
>>> sim_results.set_fdr(est_fdr=0.00001)
>>> sim_results.summary()
Compositional Analysis summary:
Data: 23 samples, 2 cell types
Reference index: 0
Formula: Group
Intercepts:
Final Parameter Expected Sample
Cell Type
GABA 2.843 63.188077
GLU 3.257 95.594532
Effects:
Final Parameter Expected Sample log2-fold change
Covariate Cell Type
Group[T.Mouse] GABA 0.00000 126.398962 1.000260
GLU -1.77579 32.383646 -1.561663
>>> sim_results.set_fdr(est_fdr=0.000000000000000000000000000000000000000000000000001)
>>> sim_results.summary()
Compositional Analysis summary:
Data: 23 samples, 2 cell types
Reference index: 0
Formula: Group
Intercepts:
Final Parameter Expected Sample
Cell Type
GABA 2.843 63.188077
GLU 3.257 95.594532
Effects:
Final Parameter Expected Sample log2-fold change
Covariate Cell Type
Group[T.Mouse] GABA 0.00000 126.398962 1.000260
GLU -1.77579 32.383646 -1.561663
>>> sim_results.set_fdr(est_fdr=0)
>>> sim_results.summary()
Compositional Analysis summary:
Data: 23 samples, 2 cell types
Reference index: 0
Formula: Group
Intercepts:
Final Parameter Expected Sample
Cell Type
GABA 2.843 63.188077
GLU 3.257 95.594532
Effects:
Final Parameter Expected Sample log2-fold change
Covariate Cell Type
Group[T.Mouse] GABA 0.00000 126.398962 1.000260
GLU -1.77579 32.383646 -1.561663
>>> sim_results.set_fdr(est_fdr=0.5)
>>> sim_results.summary()
Compositional Analysis summary:
Data: 23 samples, 2 cell types
Reference index: 0
Formula: Group
Intercepts:
Final Parameter Expected Sample
Cell Type
GABA 2.843 63.188077
GLU 3.257 95.594532
Effects:
Final Parameter Expected Sample log2-fold change
Covariate Cell Type
Group[T.Mouse] GABA 0.00000 126.398962 1.000260
GLU -1.77579 32.383646 -1.561663
I have used scCODA in the past without any problems. Now that I am using it with new samples, I noticed that the sample labels on the bar graph don't line up with the data. In the attached graph, I know from umaps that sample 10 contains cell type 5, but that specific bar is always located on the far right and given whichever sample label is at the end of 'B'. It doesn't matter if I change the covariate_df or if I change the order of samples within my anndata, my graph always comes out with mismatched labels. Code below. Any help with this glitch would be greatly appreciated.
B=['DG_4365_WS2','DG_4363_WS3','DG_4367_WS5','DG_4369_WS6','DG_4377_WS10','DG_4365_WR2','DG_4362_WR2','DG_4371_WR6','DG_4376_WR9','DG_4382_WR12','DG_4404_AS7','DG_4408_AS9','DG_4395_AS3','DG_4401_AS5','WR_4396_AS4','DG_4390_AR2','DG_4400_AR4','DG_4405_AR6','DG_4409_AR8','DG_4402_AR5']
cov_df = pd.DataFrame({"Cond": ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20']}, index=B)
data_scanpy_1 = dat.from_scanpy(
adata,
cell_type_identifier="leiden",
sample_identifier="batches",
covariate_df=cov_df
)
viz.stacked_barplot(data_scanpy_1, feature_name="Cond",cmap=umap_cmap)
plt.xticks(rotation='vertical')
plt.title('Sample Composition')
thanks for the development of the great scCODA.
here, i wonder how to get the P value (or FDR) of the analysis.
And could you please teach me how to interpret the "Final Parameter". For example, if a cell type, "Endothelial cell", got 2.007198 as "Final Parameter".
It would be nice to have a verbose option for sample_hmc
to turn off the printing of the progress, which would be useful for using scCODA
in reports.
Dear scCODA team,
Thanks a lot for this package, its turning out very useful for my snRNAseq analysis!
I am finding the same error mentioned in this other issue #40 however I have the versions required by the scCODA package installed for tensorflow and tensorflow-probability:
In [3]: import tensorflow
In [4]: tensorflow.__version__
Out[4]: '2.8.0'
In [6]: import tensorflow_probability
In [7]: tensorflow_probability.__version__
Out[7]: '0.16.0'
Any ideas whats going on or if it has any negative effect on the results?
Thanks!
From the tutorial and the example of from_pandas
, it appears that scCODA actually uses input data in the form of cell type sizes rather than single cell UMIs data. What the function from_scanpy
does is simply calculate the size of the cell type in each sample, right?
Mouse Endocrine Enterocyte Enterocyte.Progenitor Goblet Stem TA TA.Early Tuft
0 Control_1 36 59 136 36 239 125 191 18
1 Control_2 5 46 23 20 50 11 40 5
2 Control_3 45 98 188 124 250 155 365 33
3 Control_4 26 221 198 36 131 130 196 4
4 H.poly.Day10_1 42 71 203 147 271 109 180 146
5 H.poly.Day10_2 40 57 383 170 321 244 256 71
6 H.poly.Day3_1 52 75 347 66 323 263 313 51
7 H.poly.Day3_2 65 126 115 33 65 39 129 59
8 Salm_1 37 332 113 59 90 47 132 10
9 Salm_2 32 373 116 67 117 65 168 12
The values in the table above represent the count of cells contained in a particular cell type in a given sample, correct?
This is a bit of an edge case, but I discovered that if any of the cell counts in the df
dataframe of sccoda.util.cell_composition_data.from_pandas
is NA, sample_hmc()
will return a result with a 0% acceptance rate without an error or warning. This is not likely a common issue, but can occur depending how the cell-type counts dataframe is generated. It may be helpful to add an error/warning or simply recode NAs as 0 in data.X
under the hood.
I got this Warning when running log chains. Any idea?
WARNING:tensorflow:5 out of the last 5 calls to <function CompositionalModel.sampling..sample_mcmc at 0x7f937941c7a0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.
Hi,
many thanks for the package! After running ".sample_hmc()" I receive a warning:
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
After investigating the results with arviz, I observed that sample-stats -> chain only contains level 0 (I have multiple conditions, but I tried with 2 and 3, one being the reference), I furthermore could not find a 'mu' and 'tau' in the results with az.summary(all_results, var_names=['mu', 'tau'], filter_vars="regex"). Previously I ran:
mod.CompositionalAnalysis(CCM_GMEAN_SUB, formula="C(TREATMENT_COLNAME, Treatment('ELEMENT1_IN_TREATMENT_COLNAME'))",
reference_cell_type="CELL_TYPE_REFERENCE").
Thanks for your thoughts.
Hi there,
My dataset contains three conditions (A, B, C), but none of them are a control. That is, I would be interested in compositional changes between A & B, A & C, and B & C. Is there a way to configure scCODA or the formula
parameter that doesn't assign a single value to the control? (removing patsy's "treatment coding"?)
If not, an alternative I may employ (and would be grateful for feedback regarding!) is based on this response: running scCODA multiple times with one of my three conditions as the Control a third of time (correcting for the +/- results in the final summation of credible non-zero results). Would this be functional?
Thanks a lot!
Hi, I'm able to get an appropriate/accurate boxplot with viz.boxplots(). However, when I try viz.stacked.barplot() with the same input, I get the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-13-7b7aa02bf827> in <module>
----> 1 viz.stacked.barplot(data_scanpy, feature_name = "Condition")
2 plt.show()
AttributeError: module 'sccoda.util.data_visualization' has no attribute 'stacked'
Any help would be appreciated. Thanks.
Hi All,
I am using scCODA
for differential abundance analysis of a scRNA-seq dataset.
If I perform a manual reference cell type selection using e.g. B-cells as reference, and another time using dendritic cells as reference, I find that in both cases, for example, T-cells are differentially abundant between conditions, at a permissive FDR of 0.4
.
However, when looping over the different cell types to use each as a reference sequentially to check the stability of the results, I notice that the, e.g., T-cells are only significant once. How can this discrepancy be explained?
Here is the code for automatic/manual selection of these cell types (automatic selection selects another cell type than B-cells).
## automatic
cellCountData = dat.from_pandas(cellCounts, covariate_columns=["treatment"])
model = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type="automatic")
res = model.sample_hmc()
# no credible effect has been found for any cell population
res.summary()
# Some effect at higher FDR
res.set_fdr(est_fdr=0.4)
res.summary()
## manual
model = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type="B-cell")
resBCell = model.sample_hmc()
# no credible effect has been found for any cell population
resBCell.summary()
# Some effect at higher FDR
resBCell.set_fdr(est_fdr=0.4)
resBCell.summary()
To loop over the cell types, I am using the code from the vignette (slightly adapted to allow for permissive FDR).
cell_types = cellCountData.var.index
results_cycle = pd.DataFrame(index=cell_types, columns=["times_credible"]).fillna(0)
for ct in cell_types:
print(f"Reference: {ct}")
# Run inference
model_temp = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type=ct)
temp_results = model_temp.sample_hmc()
# Select credible effects
temp_results.set_fdr(est_fdr=0.4)
cred_eff = temp_results.credible_effects()
cred_eff.index = cred_eff.index.droplevel(level=0)
# add up credible effects
results_cycle["times_credible"] += cred_eff.astype("int")
Hello,
With an other member of my lab, we are starting to use scCODA for some analysis. We have encountered some puzzling results.
Here is the extended summary for a run in python 3.8.11, using "Undef" as the reference cell type:
As you can see, according to the summary it looks like that only B-cells present a significant difference in proportion. However, the credible_effects report states that there is a difference for Granulocytes too.
Now, if we repeat the analysis in a python 3.9.6 environment:
This time, what is reported in the summary and the credible_effects reports seems consistent. You can also note the difference of running time.
So, I was wondering which of these results should I trust ?
We also noticed that in Python3.9.6, the plot to look for a reference cell type is broken ( AttributeError: module 'sccoda.util.data_visualization' has no attribute 'rel_abundance_dispersion_plot' ) and it is not possible to run mod.CompositionalAnalysis() in automatic mode for the selection of the reference cell type ( TypeError: '<' not supported between instances of 'str' and 'int' )
In case this problem is data-dependent, here is the input table:
Input_scCODA.csv
If that's linked to an installation problem, I usually use conda environments:
Thank you for the help and feedback !
Hi all - I was getting the following error:
ImportError: cannot import name 'CloudPickler' from 'cloudpickle.cloudpickle'
If anyone else faces similar issue on a fresh install of scCODA
or tensorflow-probability 0.11.0
, you need to roll-back your cloudpickle
install to 1.3.0.
pip install cloudpickle==1.3.0
This resolves the issue.
Hi,
Sccoda is suggesting some cell types to be affected by the inhibitor, what further analysis to perform on sccoda results?
HI,
small comment about the API. It's really cool that it makes uses of AnnData! However, I feel that this would not make it very convenient for users with an AnnData just out of scRNA-seq analysis pipeline. For instance, the slot adata.X
would be already in use, and all the covariates (cell types, conditions) would be already stored in adata.obs
What about a simple function that takes:
and simply wrangle the adata.obs
to suit mod.CompositionalAnalysis
.
Something simple like
def wrangle_input(data: ad.AnnData, covariates: list):
comp_table = anndata.obs[covariates].groupby(covariates).size().unstack(fill_value=0).reset_index()
return comp_table
Anyway, great tool again!
Hi,
This is more of a question than a bug. If I should post it somewhere else, please let me know.
I am doing some tests with scCODA and was wondering about the relevance of the number of "populations" included in the dataset. Say I originally have 20 groups (=clusters) and 2 conditions and find no compositional difference between conditions. What influence would it make to subset my dataset to a lower number of groups (e.g. removing 2 groups of less related cells)?
Thanks,
A
Can we change the package name to scdcpy
to comply with the lowercase python convention?
I.e. import scdcpy
instead of import SCDCpy
- until we have the final name...
Hi,
I ran scCODA the following way:
counts_per_sample = pd.read_csv("counts2.csv")
counts_adata = dat.from_pandas(counts_per_sample, covariate_columns=['Condition'])
model = mod.CompositionalAnalysis(counts_adata, formula=f"C('Condition', Treatment('ctl'))", reference_cell_type='automatic')
sim_results = model.sample_hmc()
sim_results.summary()
The execution was halted due to Exit code 134. Full output and input files are attached.
Thanks,
Rachelly.
Hi here, thank you for giving us a new method for evaluating the celltype abundance. I tried this tool following tutorials but still could not get the full results for my own data.
When I tried to print, the data is too long to display completely. Can we write it into csv? so we can use it for other analysis. Thanks a lot!
Hi.
I am trying to run scCODA with the last update (0.1.3). However, when I run the function ".sample" I get the next error:
KeyError: 'Inclusion probability'
The code I used:
new_model = mod.CompositionalAnalysis(scDC_data, formula="Condition", reference_cell_type=i)
new_result = new_model.sample_hmc(num_results= 200000,num_burnin= 90000)
new_result.effect_df.to_csv("/Users/juan.henao/Documents/"+i+".csv")
No_reference=new_result.effect_df.drop(index=i,level=1)
No_reference.sort_values("Final Parameter",ascending=False).index.get_level_values("Cell Type").
The entire error I get:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2645 try:
-> 2646 return self._engine.get_loc(key)
2647 except KeyError:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'Inclusion probability'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-11-c6ce1911fcf1> in <module>
2 print(i)
3 new_model = mod.CompositionalAnalysis(scDC_data, formula="Condition", reference_cell_type=i)
----> 4 new_result = new_model.sample_hmc(num_results= 200000,num_burnin= 90000)
5 new_result.effect_df.to_csv("/Users/juan.henao/Documents/sc_T1D/CD4_compositional_analysis/"+i+".csv")
6 No_reference=new_result.effect_df.drop(index=i,level=1)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/model/scCODA_model.py in sample_hmc(self, num_results, num_burnin, num_adapt_steps, num_leapfrog_steps, step_size)
347 model_specs = {"reference": self.reference_cell_type, "formula": self.formula}
348
--> 349 return res.CAResultConverter(posterior=posterior,
350 posterior_predictive=posterior_predictive,
351 observed_data=observed_data,
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/util/result_classes.py in to_result_data(self, sampling_stats, model_specs)
31 obs = self.observed_data_to_xarray()
32
---> 33 return CAResult(
34 sampling_stats, model_specs,
35 **{
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/util/result_classes.py in __init__(self, sampling_stats, model_specs, **kwargs)
96 self.model_specs = model_specs
97
---> 98 intercept_df, effect_df = self.summary_prepare()
99
100 self.intercept_df = intercept_df
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/util/result_classes.py in summary_prepare(self, est_fdr, *args, **kwargs)
158 # Calculation of columns that are not from az.summary
159 intercept_df = self.complete_alpha_df(intercept_df)
--> 160 effect_df = self.complete_beta_df(intercept_df, effect_df, est_fdr)
161
162 # Give nice column names, remove unnecessary columns
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/util/result_classes.py in complete_beta_df(self, intercept_df, effect_df, fdr)
258 return 1., 0
259
--> 260 threshold, fdr = opt_thresh(effect_df, fdr)
261
262 self.model_specs["threshold_prob"] = threshold
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/util/result_classes.py in opt_thresh(result, alpha)
246 def opt_thresh(result, alpha):
247
--> 248 incs = np.array(result.loc[result["Inclusion probability"] > 0, "Inclusion probability"])
249 incs[::-1].sort()
250
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
2798 if self.columns.nlevels > 1:
2799 return self._getitem_multilevel(key)
-> 2800 indexer = self.columns.get_loc(key)
2801 if is_integer(indexer):
2802 indexer = [indexer]
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
2646 return self._engine.get_loc(key)
2647 except KeyError:
-> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key))
2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
2650 if indexer.ndim > 1 or indexer.size > 1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'Inclusion probability'
Thanks in advance,
Juan
Hi,
In the notebook:
https://github.com/theislab/scCODA/blob/master/docs/source/getting_started.ipynb
the log fold change is control vs disease, but what if I have cond 1 and cond2, how to make one of them as a reference, so we can easily interpret the sign of fold change?
Hi,
I wonder if you have any suggestion on how to sample on data with multiple timepoints.
I am running it now by using timepoint as an additional covariate, however, I was trying to understand if it would be more correct to run a different model for each timepoint.
I am currently testing the model "condition + timepoint". The reason I used all data (multiple timepoints) was that I thought it might help and speed up the inference.
Thanks!
Is there a way to have a progressbar to asses how long the sampling will take?
I have difficult time selecting reference celltypes. dispersion plot show one cell type but it has small count do you have any recommendation how many cells as minimum should be consider as a reference.
I get the following issue:
ERROR:tensorflow:
Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7fef746d6b90>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
File "/home/benni/miniconda3/envs/scanpy/lib/python3.7/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 2478, in while_loop_v2
return_same_structure=True) File "/home/benni/miniconda3/envs/scanpy/lib/python3.7/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 2757, in while_loop
return result File "/home/benni/miniconda3/envs/scanpy/lib/python3.7/site-packages/tensorflow_probability/python/mcmc/internal/util.py", line 389, in _body
return i + 1, state, trace_arrays File "/home/benni/miniconda3/envs/scanpy/lib/python3.7/site-packages/tensorflow_probability/python/mcmc/internal/util.py", line 386, in <listcomp>
a.write(i, v) for a, v in zip( File "/home/benni/miniconda3/envs/scanpy/lib/python3.7/site-packages/tensorflow_core/python/util/tf_should_use.py", line 237, in wrapped
error_in_function=error_in_function)
Tensorflow == 2.1.0
TF-probability == 0.9.0
I've been incorporating CoDA-valid methodologies in my workflows ever since I realized the caveats of using relative abundance. There's not many packages that implement CoDA versions of differential expression/abundance other than ALDEx[2], ANCOM[-BC], and Songbird.
Are there any assumptions about the data that would render scCODA
invalid for 1) non single-celled transcriptomics in one species; 2) metagenomics; or 3) metatranscriptomics?
Regardless, thank you for writing this software as I've been wanting to learn how to use tensorflow_probability
for quite some time and your code is very organized to follow logically. Most of the tutorials I come across are for image recognition.
Hi,
I am running a small chain to infer cell type contributions in a dataset composed by 8 samples, with 14 cell types.
I get the following summary:
Compositional Analysis summary (extended):
Data: 8 samples, 14 cell types
Reference index: 7
Formula: C(condition, Treatment("WT"))
Spike-and-slab threshold: 0.642
MCMC Sampling: Sampled 20000 chain states (5000 burnin samples) in 128.565 sec. Acceptance rate: 57.8%
where the acceptance rate in 57.8%.
How can I make sure that I explored correctly the posterior distribution and I don't need a longer chain?
Is an acceptance rate of 57.8% in line with a correct estimation or are there additional parameters that need to be tuned for this model?
Thanks!
Francesco
Hi scCODA team and thank you for making this nice tool available.
I am running into warnings when running model_scelltype_Time.sample_hmc() apparently involving beta_nonzero_mean.append() and ret.dtype.type(). The function still generate an output with meaningful results.
As I'm not an expert in Python, I wanted to know if this warning can be ignored or not and what kind of error was involved (a setup, implementation or input design issue?). Does it affect the reliability of the output?
Here is what I got as an output (I'm using the singularity image to run scCODA):
`
model_scelltype_Time = mod.CompositionalAnalysis(data_sccelltype, formula="stage + clone", reference_cell_type="automatic")
Automatic reference selection! Reference cell type set to IPC/nN
2022-05-30 14:55:56.605202: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/R/lib/R/lib::/.singularity.d/libs
2022-05-30 14:55:56.605258: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-30 14:55:56.605300: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (c18n01.ruddle.hpc.yale.internal): /proc/driver/nvidia/version does not exist
2022-05-30 14:55:56.607509: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Zero counts encountered in data! Added a pseudocount of 0.5.
simres_celltype_Time = model_scelltype_Time.sample_hmc()
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
2022-05-30 14:56:32.299744: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
0%| | 0/20000 [00:00<?, ?it/s]
2022-05-30 14:56:33.121718: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x2b066c011710 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-05-30 14:56:33.121764: I tensorflow/compiler/xla/service/service.cc:179] StreamExecutor device (0): Host, Default Version
2022-05-30 14:56:33.223335: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env varMLIR_CRASH_REPRODUCER_DIRECTORY
to enable.
2022-05-30 14:56:34.553831: I tensorflow/compiler/jit/xla_compilation_cache.cc:363] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20000/20000 [02:26<00:00, 136.38it/s]
MCMC sampling finished. (186.700 sec)
Acceptance rate: 80.6%
/opt/python/lib/python3.8/site-packages/sccoda/util/result_classes.py:252: RuntimeWarning: Mean of empty slice.
beta_nonzero_mean.append(beta_i_raw[beta_i_raw_nonzero].mean())
/opt/python/lib/python3.8/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)`
Another question was the importance of the Acceptance rate %? Is it an important output? how can it be interpreted and below which value is it problematic?
Thank you for your help!
Hi,
after long time, I finally managed to give it a try. I would be very interested to use this method for an organoid dataset I am currently analysing.
In short, my dataset consists of ~200k samples in ~1.5k conditions and 15 phenotypic classes (analogous of cell types in scRNAseq data). Similarly to what you would do in a scRNAseq dataset, I am interested in assessing a compositional difference in classes between the various conditions.
The "frequency" table as showed in the tutorial looks like this:
running
data = dat.from_pandas(comp_table, covariate_columns=["Compound_name"])
# Extract condition from mouse name and add it as an extra column to the covariates
data.obs["Condition"] = data.obs["Compound_name"]
model = mod.CompositionalAnalysis(data, formula="Condition", baseline_index=1460)
sim_results = model.sample_hmc()
sim_results.summary()
Finishes the MCMC sampling with
MCMC sampling finished. (847.986 sec)
Acceptance rate: 23.8%
and then throw the error below.
I will reduce the size of the dataset and try to re run it as well.
Another question I have with respect to usage is: would it be possible to add other covariates in the model, like biological replicates and the likes?
Thanks a lot in advance!
~/miniconda3/envs/scdc/lib/python3.7/site-packages/scdcdm/model/dirichlet_models.py in sample_hmc(self, num_results, n_burnin, num_leapfrog_steps, step_size)
192 observed_data=observed_data,
193 dims=dims,
--> 194 coords=coords).to_result_data(y_hat, baseline=False)
195
196 def sample_nuts(self, num_results=int(10e3), n_burnin=int(5e3), max_tree_depth=10, step_size=0.01):
~/miniconda3/envs/scdc/lib/python3.7/site-packages/scdcdm/util/result_classes.py in to_result_data(self, y_hat, baseline)
23 def to_result_data(self, y_hat, baseline):
24
---> 25 post = self.posterior_to_xarray()
26 ss = self.sample_stats_to_xarray()
27 postp = self.posterior_predictive_to_xarray()
~/miniconda3/envs/scdc/lib/python3.7/site-packages/arviz/data/base.py in wrapped(cls, *args, **kwargs)
35 if all([getattr(cls, prop_i) is None for prop_i in prop]):
36 return None
---> 37 return func(cls, *args, **kwargs)
38
39 return wrapped
~/miniconda3/envs/scdc/lib/python3.7/site-packages/arviz/data/io_dict.py in posterior_to_xarray(self)
53 )
54
---> 55 return dict_to_dataset(data, library=None, coords=self.coords, dims=self.dims)
56
57 @requires("sample_stats")
~/miniconda3/envs/scdc/lib/python3.7/site-packages/arviz/data/base.py in dict_to_dataset(data, attrs, library, coords, dims)
199 for key, values in data.items():
200 data_vars[key] = numpy_to_data_array(
--> 201 values, var_name=key, coords=coords, dims=dims.get(key)
202 )
203 return xr.Dataset(data_vars=data_vars, attrs=make_attrs(attrs=attrs, library=library))
~/miniconda3/envs/scdc/lib/python3.7/site-packages/arviz/data/base.py in numpy_to_data_array(ary, var_name, coords, dims)
164 # filter coords based on the dims
165 coords = {key: xr.IndexVariable((key,), data=coords[key]) for key in dims}
--> 166 return xr.DataArray(ary, coords=coords, dims=dims)
167
168
~/miniconda3/envs/scdc/lib/python3.7/site-packages/xarray/core/dataarray.py in init(self, data, coords, dims, name, attrs, indexes, fastpath)
342 data = _check_data_shape(data, coords, dims)
343 data = as_compatible_data(data)
--> 344 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
345 variable = Variable(dims, data, attrs, fastpath=True)
346 indexes = dict(
~/miniconda3/envs/scdc/lib/python3.7/site-packages/xarray/core/dataarray.py in _infer_coords_and_dims(shape, coords, dims)
153 "conflicting sizes for dimension %r: "
154 "length %s on the data but length %s on "
--> 155 "coordinate %r" % (d, sizes[d], s, k)
156 )
157
ValueError: conflicting sizes for dimension 'cell_type_nb': length 14 on the data but length 15 on coordinate 'cell_type_nb'
</details>.
Hi,
I have ran scCODA on a few sub-clusterings, and in many of them I get zero fold-change in the summary of scCODA run, though there is clearly a non-zero fold-change between the contions.
Attached is one example of cell counts per condition per cluster and the frequency graph for this data.
What I ran is:
counts_per_sample = pd.read_csv("counts.csv")
counts_adata = dat.from_pandas(counts_per_sample, covariate_columns=['Condition'])
model = mod.CompositionalAnalysis(counts_adata, formula=f"C('Condition', Treatment('ctl'))", reference_cell_type='automatic')
sim_results = model.sample_hmc()
sim_results.summary()
The full output is also attached.
Thanks for your help!
Rachelly.
Hi scCODA team,
Line 265 in data_visualization gives an error when running boxplots with plot_faucets = True.
The issue is in the col_wrap parameter for the sns function. np.floor gives a float, not an int, which leads to an error in the FacetGrid function.
HI,
can only external condition should be consider as condition in scCODA model or should also use control-case status as a condition?
can also I use continuous variable such as treatment exposure time as a condition or scCODA only accept categorical variable?
Thank you
Hi,
I wanted to run the scdcdm model as a python script and I was wondering how I could save the simulation results such that I can inspect the results lateron. I haven't found a save/write function and using arviz.to_netcdf()
allows only to save the InferenceData
, but not the model results.
Help is much appreciated.
I installed scCODA (pip install sccoda) without any problem/warning and performed the basic analysis following the getting started tutorial, until the model setup and inference. Unfortunately I cannot import sccoda.util.comp_ana function into python. Following the tutorial:
from sccoda.util import cell_composition_data as dat
from sccoda.util import data_visualization as viz
from sccoda.util.comp_ana import comp_ana as mod
doesn't work for my python installation (conda, python 3.9), thus I used
import sccoda.util.cell_composition_data as dat
import sccoda.util.data_visualization as viz
successfully. When I try to import comp_ana using:
import sccoda.util.comp_ana as mod
The kernel crushes. If I import sccoda separately;
import sccoda
And look for the possible functions, but I only can see cell_composition_data and data_visualization, but not comp_ana
comp_ana.py is under the sccoda/util folder where it is supposed to be. I am not sure what exactly the problem is, but somehow comp_ana is 'invisible'.
I have a Macbook pro M1 with 8-core, my environment details are:
anndata 0.7.6
scanpy 1.8.1
sinfo 0.3.4
-----
PIL 8.2.0
anyio NA
appnope 0.1.2
argon2 20.1.0
attr 21.2.0
babel 2.9.1
backcall 0.2.0
brotli NA
cairo 1.19.1
certifi 2021.05.30
cffi 1.14.5
chardet 4.0.0
cloudpickle 1.6.0
cycler 0.10.0
cython_runtime NA
dask 2021.06.0
dateutil 2.8.1
decorator 4.4.2
fsspec 2021.06.0
google NA
h5py 3.1.0
idna 2.10
igraph 0.9.6
ipykernel 5.5.5
ipython_genutils 0.2.0
jedi 0.18.0
jinja2 3.0.1
joblib 1.0.1
json5 NA
jsonschema 3.2.0
jupyter_server 1.8.0
jupyterlab_server 2.6.0
kiwisolver 1.3.1
leidenalg 0.8.4
llvmlite 0.36.0
markupsafe 2.0.1
matplotlib 3.4.2
mpl_toolkits NA
natsort 7.1.1
nbclassic NA
nbformat 5.1.3
numba 0.53.1
numexpr 2.7.3
numpy 1.19.5
packaging 20.9
pandas 1.2.4
parso 0.8.2
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
prometheus_client NA
prompt_toolkit 3.0.18
psutil 5.8.0
ptyprocess 0.7.0
pvectorc NA
pyexpat NA
pygments 2.9.0
pyparsing 2.4.7
pyrsistent NA
pytz 2021.1
requests 2.25.1
sccoda 0.1.4
scipy 1.6.2
seaborn 0.11.1
send2trash NA
six 1.15.0
sklearn 0.24.2
sniffio 1.2.0
socks 1.7.1
sparse 0.12.0
statsmodels 0.12.2
storemagic NA
tables 3.6.1
tblib 1.7.0
terminado 0.10.1
texttable 1.6.3
tlz 0.11.1
toolz 0.11.1
tornado 6.1
traitlets 5.0.5
typing_extensions NA
urllib3 1.26.5
wcwidth 0.2.5
websocket 0.57.0
yaml 5.4.1
zipp NA
zmq 22.1.0
-----
IPython 7.24.1
jupyter_client 6.1.12
jupyter_core 4.7.1
jupyterlab 3.0.16
notebook 6.4.0
-----
Python 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:15) [Clang 11.1.0 ]
macOS-11.2.3-x86_64-i386-64bit
8 logical CPU cores, i386
-----
Session information updated at 2021-09-26 14:07
I have tried this on a linux server as well with identical results.
Thanks for the time
Hey there,
I have successfully used this tool on my data a couple of months ago. unfortunately I now get the following error (both for the model with ans without baseline):
MCMC sampling finished. (189.715 sec)
Acceptance rate: 45.1%
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-65-b8c3ad8ee3bd> in <module>
7 for i in range(5):
8 ca = mod.CompositionalAnalysis(a, "condition", baseline_index=baseline)
----> 9 params_mcmc = ca.sample_hmc()
10 thisgrp = [i[1] for i in params_mcmc.summary_prepare()[1][params_mcmc.summary_prepare()[1]['final_parameter'] != 0]['final_parameter'].index]
11 print('Significant:', thisgrp)
/storage/groups/ml01/workspace/leander.dony/projects/organoids_chronic/git/SCDCdm_public/scdcdm/model/dirichlet_models.py in sample_hmc(self, num_results, n_burnin, num_leapfrog_steps, step_size)
224 dims=dims,
225 coords=coords).to_result_data(sampling_stats=sampling_stats,
--> 226 model_specs=model_specs)
227
228 def sample_nuts(self, num_results=int(10e3), n_burnin=int(5e3), max_tree_depth=10, step_size=0.01):
/storage/groups/ml01/workspace/leander.dony/projects/organoids_chronic/git/SCDCdm_public/scdcdm/util/result_classes.py in to_result_data(self, sampling_stats, model_specs)
33 "sample_stats_prior": ssp,
34 "prior_predictive": prip,
---> 35 "observed_data": obs,
36 }
37 )
/storage/groups/ml01/workspace/leander.dony/projects/organoids_chronic/git/SCDCdm_public/scdcdm/util/result_classes.py in __init__(self, sampling_stats, model_specs, **kwargs)
78 self.model_specs = model_specs
79
---> 80 intercept_df, effect_df = self.summary_prepare()
81
82 self.intercept_df = intercept_df
/storage/groups/ml01/workspace/leander.dony/projects/organoids_chronic/git/SCDCdm_public/scdcdm/util/result_classes.py in summary_prepare(self, *args, **kwargs)
132 hpds_new = hpds.str.replace("hpd_", "HPD ")
133
--> 134 intercept_df = intercept_df.loc[:, ["final_parameter", hpds[0], hpds[1], "sd", "expected_sample"]]
135 intercept_df = intercept_df.rename(columns=dict(zip(
136 intercept_df.columns,
/opt/python/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
3927 if is_scalar(key):
3928 key = com.cast_scalar_indexer(key)
-> 3929 return getitem(key)
3930
3931 if isinstance(key, slice):
IndexError: index 0 is out of bounds for axis 0 with size 0
Hello,
I am working with a single cell data set with 12 clusters and 3 treatment conditions (including the control). For a couple different cell types I've tried to set as the reference, I get a division by zero error after running model_obj_iggvsnat.sample_hmc() . I am not sure why this happens when I set certain cell types as the reference, but not with others.
There have also been some instances where I have run the code based on the getting started tutorial, and received results, but then have had the "division by zero" error when reruning the same code without changing the reference cell type or anything else.
Any insights into what might be going on here would be really helpful, thank you!
Hi,
I'm using the boxplot wrapper from this package to plot differences in proportions.
I noticed that, if I want to specify the hue_order
of the feature_name
categories with:
viz.boxplots(
data, feature_name='genotype', dpi=150, cmap=['#393e41', '#ca3c25'],
figsize=(10,10), args_boxplot={'hue_order': ['Control', 'Case']},
add_dots=True,
)
the dots added by add_dots=True
do not respect the hue_order
parameter but they default to the original ordering (images attached, without and with custom hue_order
).
I'm using scCODA==0.1.2.post1
.
Thanks
Sorry, I am not sure where to post issues, so this is a duplicate.
Hi,
I am trying to use scdcdm with an existing scanpy object.
First, I tried using scdcdm.util.cell_composition_data.from_scanpy
, but it turns out that this function does not return a CompositionalData object. Instead, it returns a np.array with cell counts and a list with covariates, which cannot be directly used as an input to scdcdm.util.comp_ana.CompositionalAnalysis
. Then, I tried using scdcdm.util.cell_composition_data.from_scanpy_list
, but it returned the following error:
Traceback (most recent call last):
File "<ipython-input-26-2483c45c611d>", line 3, in <module>
covariate_key='Condition')
File "/Applications/python_modules/SCDCdm/SCDCdm_public/scdcdm/util/cell_composition_data.py", line 77, in from_scanpy_list
covariate_data = covariate_data.append(pd.Series(covs), ignore_index=True)
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 7116, in append
other.values.reshape((1, len(other))),
AttributeError: 'Categorical' object has no attribute 'reshape'
Any tips?
Hi all, I wanted to follow the examples suggested at the end of README.md
page but rendering of html pages is gibberish (see here).
Hi, I am excited to use scCODA for my data analysis, however I am running into an issue. I am following the getting started tutorial notebook, and I get a segmentation fault when I run this line:
sim_results = model_salm.sample_hmc()
I am suspicious that I didn't install scCODA properly? I followed the guidelines as posted on the github repo for installation (fresh conda environment, Python 3.8.5). I noticed that on the scCODA github, it specifically mentioned tensorflow==2.3.2
as the version but after pip installing scCODA, I noticed that the version was 2.4.1
.
Hi
Thank you for the package! I wonder, how should I obtain the p-values for cell type abundance comparisons?
Can I use Inclusion probability
from summary_extended
for it? As far as I understand the methods, this is used for selecting the credible effects.
Also, in equation 12 in the paper shouldn't there be max
instead of min
?
Thank you
Hi sccoda team,
While trying to also run sccoda on my laptop (MacBook Air M1), and downloading all relevant packages for my project, I've come across a version discrepancy. (Note: when I installed sccoda/tensorflow on an iMac a few weeks back the issue did not arise)
When installing tensorflow:
sccoda 0.1.7 requires numpy>=1.21, but you have numpy 1.19.5 which is incompatible. sccoda 0.1.7 requires tensorflow>=2.8, but you have tensorflow 2.5.0 which is incompatible. sccoda 0.1.7 requires tensorflow-probability>=0.16.0, but you have tensorflow-probability 0.12.0 which is incompatible.
On the description for sccoda installation however, other package versions are described.
This package uses the
tensorflow (>=2.4, <2.6)
andtensorflow-probability (==0.12)
packages.
Which package versions ones should I be using?
Also, I've been having issues importing sccoda into my jupyter notebook, as the kernel stops running. Could that issue be related to the package versions? (Note: I do not have this issue using sccoda in jupyter notebook on an iMac, only on my macbook air)
Thanks!
this is a duplicate because i am not sure where to post issues.
hi,
i am trying to run scdcdm with a tiny fraction of my real data set as a test, and i am basically following the tutorial.
however, i get the error message below, which i do not understand. i would appreciate any advice!
In [110]: import scdcdm
In [111]: from scdcdm.util import cell_composition_data as dat
In [112]: from scdcdm.util import comp_ana as mod
In [113]: dat_pandas
Out[113]:
0 1 2 condition
0 1065 1001 1232 control
1 439 435 197 treatment
In [114]: dat_scdcdm = dat.from_pandas(dat_pandas,
...: covariate_columns=["condition"])
Transforming to str index.
In [115]: dat_model = mod.CompositionalAnalysis(dat_scdcdm,
...: formula="condition",
...: baseline_index=None)
In [116]: sim_results = dat_model.sample_hmc()
WARNING:tensorflow:5 out of the last 5 calls to <function CompositionalModel.sampling.<locals>.sample_mcmc at 0x7f8fe4d5e8c0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.
MCMC sampling finished. (42.321 sec)
Acceptance rate: 35.0%
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:376: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[key] = _infer_fill_value(value)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:576: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self.obj[item_labels[indexer[info_axis]]] = value
Traceback (most recent call last):
File "<ipython-input-116-4c85d20e3019>", line 1, in <module>
sim_results = dat_model.sample_hmc()
File "/Applications/python_modules/SCDCdm/SCDCdm_public/scdcdm/model/dirichlet_models.py", line 226, in sample_hmc
model_specs=model_specs)
File "/Applications/python_modules/SCDCdm/SCDCdm_public/scdcdm/util/result_classes.py", line 35, in to_result_data
"observed_data": obs,
File "/Applications/python_modules/SCDCdm/SCDCdm_public/scdcdm/util/result_classes.py", line 80, in __init__
intercept_df, effect_df = self.summary_prepare()
File "/Applications/python_modules/SCDCdm/SCDCdm_public/scdcdm/util/result_classes.py", line 134, in summary_prepare
intercept_df = intercept_df.loc[:, ["final_parameter", hpds[0], hpds[1], "sd", "expected_sample"]]
File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4280, in __getitem__
return getitem(key)
IndexError: index 0 is out of bounds for axis 0 with size 0
Hi,
My data set contains 155 clusters and 2 conditions (treatment vs control). I performed 15 sample_hmc runs with the same data and the same NoBaselineModel, and the results could be broadly classified as follows:
10/15 runs: ~15-20 cell types had non-zero final parameter values, all values were negative
5/15 runs: ~50-60 cell types had non-zero final parameter values, all values were positive
Expectedly, the cell types that had negative final parameter values in some runs were not overlapping with the cell types that had positive final parameter values in the other runs. However, I am puzzled by the fact that there was not a single run in which I would have captured both positive and negative cell types at the same time.
What can it mean and how should I go about it? Should I do some kind of bootstrapping?
Thanks!
Hi scCODA team,
Thanks for creating this great algorithm. I have a question on MCMC reproducibility. I am running the following model:
model_nb = mod.CompositionalAnalysis(data_nb, formula="C(MYCN, Treatment('non-amplified'))", reference_cell_type="Endothelial")
When I run the model sampling two separate times, I get slight disagreements in what effects are credible. Below, in one run the stroma cells show up as credible and in another run they show up as not credible.
Do you have recommendations on what to do in this case? I assume these effects are at the borderline of being credible. Is there a random seed I should set to get reproducible MCMC? And do you have any suggestions on setting the false discovery rate in a systematic manner? Thank you so much! Best, Orr
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.