Light

theislab / sccoda Goto Github PK

View Code? Open in Web Editor NEW

137.0 9.0 23.0 149.07 MB

A Bayesian model for compositional single-cell data analysis

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 81.00% Python 19.00%

tensorflow compositional-changes statistical-analysis single-cell-rna-seq

sccoda's Issues

Including Bad Cells in Model

Hi there,

I'm greatly appreciative of the way scCODA models changes in whole-sample composition versus univariate cell counts.

This in mind, if I am working with a dataset with a prevalence of stress (say, 5-10% of the cells in every sample fail QC), should I include these cells in final compositional abundance testing? Seeing as they are part of overall sample composition. I do know that stress was unfortunately more prevalent in one condition. And yes, I know avoiding this initially would have been much more preferable but here we are.

Thanks!
Kane

est_fdr did not change the result

thanks for the development of the great scCODA.

It may be my own error. I found that the result did not change when I changed the est_fdr. Could anyone help me?

>>> sim_results.set_fdr(est_fdr=0.05)
>>> sim_results.summary()
Compositional Analysis summary:

Data: 23 samples, 2 cell types
Reference index: 0
Formula: Group

Intercepts:
Final Parameter Expected Sample
Cell Type
GABA 2.843 63.188077
GLU 3.257 95.594532

Effects:
Final Parameter Expected Sample log2-fold change
Covariate Cell Type
Group[T.Mouse] GABA 0.00000 126.398962 1.000260
GLU -1.77579 32.383646 -1.561663

>>> sim_results.set_fdr(est_fdr=0.00001)
>>> sim_results.summary()
Compositional Analysis summary:

Data: 23 samples, 2 cell types
Reference index: 0
Formula: Group

Intercepts:
Final Parameter Expected Sample
Cell Type
GABA 2.843 63.188077
GLU 3.257 95.594532

Effects:
Final Parameter Expected Sample log2-fold change
Covariate Cell Type
Group[T.Mouse] GABA 0.00000 126.398962 1.000260
GLU -1.77579 32.383646 -1.561663

>>> sim_results.set_fdr(est_fdr=0.000000000000000000000000000000000000000000000000001)
>>> sim_results.summary()
Compositional Analysis summary:

Data: 23 samples, 2 cell types
Reference index: 0
Formula: Group

Intercepts:
Final Parameter Expected Sample
Cell Type
GABA 2.843 63.188077
GLU 3.257 95.594532

Effects:
Final Parameter Expected Sample log2-fold change
Covariate Cell Type
Group[T.Mouse] GABA 0.00000 126.398962 1.000260
GLU -1.77579 32.383646 -1.561663

>>> sim_results.set_fdr(est_fdr=0)
>>> sim_results.summary()
Compositional Analysis summary:

Data: 23 samples, 2 cell types
Reference index: 0
Formula: Group

Intercepts:
Final Parameter Expected Sample
Cell Type
GABA 2.843 63.188077
GLU 3.257 95.594532

Effects:
Final Parameter Expected Sample log2-fold change
Covariate Cell Type
Group[T.Mouse] GABA 0.00000 126.398962 1.000260
GLU -1.77579 32.383646 -1.561663

>>> sim_results.set_fdr(est_fdr=0.5)
>>> sim_results.summary()
Compositional Analysis summary:

Data: 23 samples, 2 cell types
Reference index: 0
Formula: Group

Intercepts:
Final Parameter Expected Sample
Cell Type
GABA 2.843 63.188077
GLU 3.257 95.594532

Effects:
Final Parameter Expected Sample log2-fold change
Covariate Cell Type
Group[T.Mouse] GABA 0.00000 126.398962 1.000260
GLU -1.77579 32.383646 -1.561663

Mismatched sample labels

I have used scCODA in the past without any problems. Now that I am using it with new samples, I noticed that the sample labels on the bar graph don't line up with the data. In the attached graph, I know from umaps that sample 10 contains cell type 5, but that specific bar is always located on the far right and given whichever sample label is at the end of 'B'. It doesn't matter if I change the covariate_df or if I change the order of samples within my anndata, my graph always comes out with mismatched labels. Code below. Any help with this glitch would be greatly appreciated.

B=['DG_4365_WS2','DG_4363_WS3','DG_4367_WS5','DG_4369_WS6','DG_4377_WS10','DG_4365_WR2','DG_4362_WR2','DG_4371_WR6','DG_4376_WR9','DG_4382_WR12','DG_4404_AS7','DG_4408_AS9','DG_4395_AS3','DG_4401_AS5','WR_4396_AS4','DG_4390_AR2','DG_4400_AR4','DG_4405_AR6','DG_4409_AR8','DG_4402_AR5']
cov_df = pd.DataFrame({"Cond": ['1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20']}, index=B)

data_scanpy_1 = dat.from_scanpy(
adata,
cell_type_identifier="leiden",
sample_identifier="batches",
covariate_df=cov_df
)

viz.stacked_barplot(data_scanpy_1, feature_name="Cond",cmap=umap_cmap)
plt.xticks(rotation='vertical')
plt.title('Sample Composition')

P value (FDR) and interpretation of "Final Parameter"

thanks for the development of the great scCODA.

here, i wonder how to get the P value (or FDR) of the analysis.

And could you please teach me how to interpret the "Final Parameter". For example, if a cell type, "Endothelial cell", got 2.007198 as "Final Parameter".

verbose option for `sample_hmc`

It would be nice to have a verbose option for sample_hmc to turn off the printing of the progress, which would be useful for using scCODA in reports.

WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.

Dear scCODA team,

Thanks a lot for this package, its turning out very useful for my snRNAseq analysis!

I am finding the same error mentioned in this other issue #40 however I have the versions required by the scCODA package installed for tensorflow and tensorflow-probability:

In [3]: import tensorflow

In [4]: tensorflow.__version__
Out[4]: '2.8.0'
In [6]: import tensorflow_probability

In [7]: tensorflow_probability.__version__
Out[7]: '0.16.0'

Any ideas whats going on or if it has any negative effect on the results?

Thanks!

Is the input data expected by scCODA the size of the cell type for each sample?

From the tutorial and the example of from_pandas, it appears that scCODA actually uses input data in the form of cell type sizes rather than single cell UMIs data. What the function from_scanpy does is simply calculate the size of the cell type in each sample, right?

            Mouse  Endocrine  Enterocyte  Enterocyte.Progenitor  Goblet  Stem    TA  TA.Early  Tuft
0       Control_1         36          59                    136      36   239   125       191    18
1       Control_2          5          46                     23      20    50    11        40     5
2       Control_3         45          98                    188     124   250   155       365    33
3       Control_4         26         221                    198      36   131   130       196     4
4  H.poly.Day10_1         42          71                    203     147   271   109       180   146
5  H.poly.Day10_2         40          57                    383     170   321   244       256    71
6   H.poly.Day3_1         52          75                    347      66   323   263       313    51
7   H.poly.Day3_2         65         126                    115      33    65    39       129    59
8          Salm_1         37         332                    113      59    90    47       132    10
9          Salm_2         32         373                    116      67   117    65       168    12

The values in the table above represent the count of cells contained in a particular cell type in a given sample, correct?

Add scikit-bio as dependecies

sample_hmc() fails without error when NAs are present

This is a bit of an edge case, but I discovered that if any of the cell counts in the df dataframe of sccoda.util.cell_composition_data.from_pandas is NA, sample_hmc() will return a result with a 0% acceptance rate without an error or warning. This is not likely a common issue, but can occur depending how the cell-type counts dataframe is generated. It may be helpful to add an error/warning or simply recode NAs as 0 in data.X under the hood.

Warning triggered tf.function retracing

I got this Warning when running log chains. Any idea?

WARNING:tensorflow:5 out of the last 5 calls to <function CompositionalModel.sampling..sample_mcmc at 0x7f937941c7a0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/tutorials/customization/performance#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.

WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature

Hi,

many thanks for the package! After running ".sample_hmc()" I receive a warning:

WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.

After investigating the results with arviz, I observed that sample-stats -> chain only contains level 0 (I have multiple conditions, but I tried with 2 and 3, one being the reference), I furthermore could not find a 'mu' and 'tau' in the results with az.summary(all_results, var_names=['mu', 'tau'], filter_vars="regex"). Previously I ran:

mod.CompositionalAnalysis(CCM_GMEAN_SUB, formula="C(TREATMENT_COLNAME, Treatment('ELEMENT1_IN_TREATMENT_COLNAME'))",
reference_cell_type="CELL_TYPE_REFERENCE").

Thanks for your thoughts.

Test All Pairwise Comparisons

Hi there,

My dataset contains three conditions (A, B, C), but none of them are a control. That is, I would be interested in compositional changes between A & B, A & C, and B & C. Is there a way to configure scCODA or the formula parameter that doesn't assign a single value to the control? (removing patsy's "treatment coding"?)

If not, an alternative I may employ (and would be grateful for feedback regarding!) is based on this response: running scCODA multiple times with one of my three conditions as the Control a third of time (correcting for the +/- results in the final summation of credible non-zero results). Would this be functional?

Thanks a lot!

Stacked barplot not working

Hi, I'm able to get an appropriate/accurate boxplot with viz.boxplots(). However, when I try viz.stacked.barplot() with the same input, I get the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-7b7aa02bf827> in <module>
----> 1 viz.stacked.barplot(data_scanpy, feature_name = "Condition")
      2 plt.show()

AttributeError: module 'sccoda.util.data_visualization' has no attribute 'stacked'

Any help would be appreciated. Thanks.

Looping over cell types to use as reference: discrepancy?

Hi All,

I am using scCODA for differential abundance analysis of a scRNA-seq dataset.
If I perform a manual reference cell type selection using e.g. B-cells as reference, and another time using dendritic cells as reference, I find that in both cases, for example, T-cells are differentially abundant between conditions, at a permissive FDR of 0.4.

However, when looping over the different cell types to use each as a reference sequentially to check the stability of the results, I notice that the, e.g., T-cells are only significant once. How can this discrepancy be explained?

Here is the code for automatic/manual selection of these cell types (automatic selection selects another cell type than B-cells).

## automatic
cellCountData = dat.from_pandas(cellCounts, covariate_columns=["treatment"])
model = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type="automatic")
res = model.sample_hmc()
# no credible effect has been found for any cell population
res.summary()
# Some effect at higher FDR
res.set_fdr(est_fdr=0.4)
res.summary()

## manual
model = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type="B-cell")
resBCell = model.sample_hmc()
# no credible effect has been found for any cell population
resBCell.summary()
# Some effect at higher FDR
resBCell.set_fdr(est_fdr=0.4)
resBCell.summary()

To loop over the cell types, I am using the code from the vignette (slightly adapted to allow for permissive FDR).

cell_types = cellCountData.var.index
results_cycle = pd.DataFrame(index=cell_types, columns=["times_credible"]).fillna(0)

for ct in cell_types:
    print(f"Reference: {ct}")

    # Run inference
    model_temp = mod.CompositionalAnalysis(cellCountData, formula="treatment", reference_cell_type=ct)
    temp_results = model_temp.sample_hmc()

    # Select credible effects
    temp_results.set_fdr(est_fdr=0.4)
    cred_eff = temp_results.credible_effects()
    cred_eff.index = cred_eff.index.droplevel(level=0)

    # add up credible effects
    results_cycle["times_credible"] += cred_eff.astype("int")

Result reproducibility issues depending on Python version

Hello,

With an other member of my lab, we are starting to use scCODA for some analysis. We have encountered some puzzling results.
Here is the extended summary for a run in python 3.8.11, using "Undef" as the reference cell type:

As you can see, according to the summary it looks like that only B-cells present a significant difference in proportion. However, the credible_effects report states that there is a difference for Granulocytes too.

Now, if we repeat the analysis in a python 3.9.6 environment:

This time, what is reported in the summary and the credible_effects reports seems consistent. You can also note the difference of running time.

So, I was wondering which of these results should I trust ?
We also noticed that in Python3.9.6, the plot to look for a reference cell type is broken ( AttributeError: module 'sccoda.util.data_visualization' has no attribute 'rel_abundance_dispersion_plot' ) and it is not possible to run mod.CompositionalAnalysis() in automatic mode for the selection of the reference cell type ( TypeError: '<' not supported between instances of 'str' and 'int' )

In case this problem is data-dependent, here is the input table:
Input_scCODA.csv

If that's linked to an installation problem, I usually use conda environments:

creating a new environment with the desired python version
conda install -c conda-forge notebook
pip install sccoda

Thank you for the help and feedback !

Solved: ImportError: cannot import name 'CloudPickler'

Hi all - I was getting the following error:

ImportError: cannot import name 'CloudPickler' from 'cloudpickle.cloudpickle'

If anyone else faces similar issue on a fresh install of scCODA or tensorflow-probability 0.11.0, you need to roll-back your cloudpickle install to 1.3.0.

pip install cloudpickle==1.3.0

This resolves the issue.

logfoldchange always -0.430725 for non significant classes

I always get a logfodchange of -0.430725 if the composition change in that celltype was non-significant. Is this expected?

how to assess sccoda results

Hi,
Sccoda is suggesting some cell types to be affected by the inhibitor, what further analysis to perform on sccoda results?

API design comment

HI,

small comment about the API. It's really cool that it makes uses of AnnData! However, I feel that this would not make it very convenient for users with an AnnData just out of scRNA-seq analysis pipeline. For instance, the slot adata.X would be already in use, and all the covariates (cell types, conditions) would be already stored in adata.obs

What about a simple function that takes:

the anndata as is (e.g. just out of a scRNA-seq pipeline)
the covariates of choice.

and simply wrangle the adata.obs to suit mod.CompositionalAnalysis.

Something simple like

def wrangle_input(data: ad.AnnData, covariates: list):
    comp_table = anndata.obs[covariates].groupby(covariates).size().unstack(fill_value=0).reset_index()
    return comp_table

Anyway, great tool again!

Influence of subtypes present in the dataset

Hi,

This is more of a question than a bug. If I should post it somewhere else, please let me know.

I am doing some tests with scCODA and was wondering about the relevance of the number of "populations" included in the dataset. Say I originally have 20 groups (=clusters) and 2 conditions and find no compositional difference between conditions. What influence would it make to subset my dataset to a lower number of groups (e.g. removing 2 groups of less related cells)?

Thanks,

A

python convention lower case

Can we change the package name to scdcpy to comply with the lowercase python convention?
I.e. import scdcpy instead of import SCDCpy - until we have the final name...

boostrapping on posterior samples for inclusion prob variance

check literature if this correct

Exit code 134

Hi,
I ran scCODA the following way:

counts_per_sample = pd.read_csv("counts2.csv")
counts_adata = dat.from_pandas(counts_per_sample, covariate_columns=['Condition'])
model = mod.CompositionalAnalysis(counts_adata, formula=f"C('Condition', Treatment('ctl'))", reference_cell_type='automatic')
sim_results = model.sample_hmc()
sim_results.summary()

The execution was halted due to Exit code 134. Full output and input files are attached.
Thanks,
Rachelly.

counts2.csv
output2.txt.txt

How to save the results as dataframe or show the whole summary?

Hi here, thank you for giving us a new method for evaluating the celltype abundance. I tried this tool following tutorials but still could not get the full results for my own data.
When I tried to print, the data is too long to display completely. Can we write it into csv? so we can use it for other analysis. Thanks a lot!

Error in version update_0.1.3: 'Inclusion probability'

Hi.

I am trying to run scCODA with the last update (0.1.3). However, when I run the function ".sample" I get the next error:

KeyError: 'Inclusion probability'

The code I used:

    new_model = mod.CompositionalAnalysis(scDC_data, formula="Condition", reference_cell_type=i)
    new_result = new_model.sample_hmc(num_results= 200000,num_burnin= 90000)
    new_result.effect_df.to_csv("/Users/juan.henao/Documents/"+i+".csv")
    No_reference=new_result.effect_df.drop(index=i,level=1)
    No_reference.sort_values("Final Parameter",ascending=False).index.get_level_values("Cell Type").

The entire error I get:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2645             try:
-> 2646                 return self._engine.get_loc(key)
   2647             except KeyError:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Inclusion probability'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-11-c6ce1911fcf1> in <module>
      2     print(i)
      3     new_model = mod.CompositionalAnalysis(scDC_data, formula="Condition", reference_cell_type=i)
----> 4     new_result = new_model.sample_hmc(num_results= 200000,num_burnin= 90000)
      5     new_result.effect_df.to_csv("/Users/juan.henao/Documents/sc_T1D/CD4_compositional_analysis/"+i+".csv")
      6     No_reference=new_result.effect_df.drop(index=i,level=1)

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/model/scCODA_model.py in sample_hmc(self, num_results, num_burnin, num_adapt_steps, num_leapfrog_steps, step_size)
    347         model_specs = {"reference": self.reference_cell_type, "formula": self.formula}
    348 
--> 349         return res.CAResultConverter(posterior=posterior,
    350                                      posterior_predictive=posterior_predictive,
    351                                      observed_data=observed_data,

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/util/result_classes.py in to_result_data(self, sampling_stats, model_specs)
     31         obs = self.observed_data_to_xarray()
     32 
---> 33         return CAResult(
     34             sampling_stats, model_specs,
     35             **{

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/util/result_classes.py in __init__(self, sampling_stats, model_specs, **kwargs)
     96         self.model_specs = model_specs
     97 
---> 98         intercept_df, effect_df = self.summary_prepare()
     99 
    100         self.intercept_df = intercept_df

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/util/result_classes.py in summary_prepare(self, est_fdr, *args, **kwargs)
    158         # Calculation of columns that are not from az.summary
    159         intercept_df = self.complete_alpha_df(intercept_df)
--> 160         effect_df = self.complete_beta_df(intercept_df, effect_df, est_fdr)
    161 
    162         # Give nice column names, remove unnecessary columns

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/util/result_classes.py in complete_beta_df(self, intercept_df, effect_df, fdr)
    258             return 1., 0
    259 
--> 260         threshold, fdr = opt_thresh(effect_df, fdr)
    261 
    262         self.model_specs["threshold_prob"] = threshold

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/sccoda/util/result_classes.py in opt_thresh(result, alpha)
    246         def opt_thresh(result, alpha):
    247 
--> 248             incs = np.array(result.loc[result["Inclusion probability"] > 0, "Inclusion probability"])
    249             incs[::-1].sort()
    250 

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2798             if self.columns.nlevels > 1:
   2799                 return self._getitem_multilevel(key)
-> 2800             indexer = self.columns.get_loc(key)
   2801             if is_integer(indexer):
   2802                 indexer = [indexer]

/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2646                 return self._engine.get_loc(key)
   2647             except KeyError:
-> 2648                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2649         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2650         if indexer.ndim > 1 or indexer.size > 1:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Inclusion probability'

Thanks in advance,

Juan

how to make one condition as a reference

Hi,
In the notebook:
https://github.com/theislab/scCODA/blob/master/docs/source/getting_started.ipynb

the log fold change is control vs disease, but what if I have cond 1 and cond2, how to make one of them as a reference, so we can easily interpret the sign of fold change?

How to approach inference on data with multiple timepoints

Hi,

I wonder if you have any suggestion on how to sample on data with multiple timepoints.
I am running it now by using timepoint as an additional covariate, however, I was trying to understand if it would be more correct to run a different model for each timepoint.
I am currently testing the model "condition + timepoint". The reason I used all data (multiple timepoints) was that I thought it might help and speed up the inference.

Thanks!

Progessbar possible?

Is there a way to have a progressbar to asses how long the sampling will take?

how to select cell reference

I have difficult time selecting reference celltypes. dispersion plot show one cell type but it has small count do you have any recommendation how many cells as minimum should be consider as a reference.

Error: Object was never used

I get the following issue:

ERROR:tensorflow:
Object was never used (type <class 'tensorflow.python.ops.tensor_array_ops.TensorArray'>):
<tensorflow.python.ops.tensor_array_ops.TensorArray object at 0x7fef746d6b90>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
  File "/home/benni/miniconda3/envs/scanpy/lib/python3.7/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 2478, in while_loop_v2
    return_same_structure=True)  File "/home/benni/miniconda3/envs/scanpy/lib/python3.7/site-packages/tensorflow_core/python/ops/control_flow_ops.py", line 2757, in while_loop
    return result  File "/home/benni/miniconda3/envs/scanpy/lib/python3.7/site-packages/tensorflow_probability/python/mcmc/internal/util.py", line 389, in _body
    return i + 1, state, trace_arrays  File "/home/benni/miniconda3/envs/scanpy/lib/python3.7/site-packages/tensorflow_probability/python/mcmc/internal/util.py", line 386, in <listcomp>
    a.write(i, v) for a, v in zip(  File "/home/benni/miniconda3/envs/scanpy/lib/python3.7/site-packages/tensorflow_core/python/util/tf_should_use.py", line 237, in wrapped
    error_in_function=error_in_function)

Tensorflow == 2.1.0
TF-probability == 0.9.0

Can this be used for differential expression or abundance analysis in non-single celled data (e.g., metagenomics/transcriptomics)?

I've been incorporating CoDA-valid methodologies in my workflows ever since I realized the caveats of using relative abundance. There's not many packages that implement CoDA versions of differential expression/abundance other than ALDEx[2], ANCOM[-BC], and Songbird.

Are there any assumptions about the data that would render scCODA invalid for 1) non single-celled transcriptomics in one species; 2) metagenomics; or 3) metatranscriptomics?

Regardless, thank you for writing this software as I've been wanting to learn how to use tensorflow_probability for quite some time and your code is very organized to follow logically. Most of the tutorials I come across are for image recognition.

Optimal tuning of parameters for estimation

Hi,

I am running a small chain to infer cell type contributions in a dataset composed by 8 samples, with 14 cell types.
I get the following summary:

Compositional Analysis summary (extended):

Data: 8 samples, 14 cell types
Reference index: 7
Formula: C(condition, Treatment("WT"))
Spike-and-slab threshold: 0.642

MCMC Sampling: Sampled 20000 chain states (5000 burnin samples) in 128.565 sec. Acceptance rate: 57.8%

where the acceptance rate in 57.8%.
How can I make sure that I explored correctly the posterior distribution and I don't need a longer chain?
Is an acceptance rate of 57.8% in line with a correct estimation or are there additional parameters that need to be tuned for this model?

Thanks!
Francesco

Importance of warnings/errors in model_scelltype_Time.sample_hmc() output

Hi scCODA team and thank you for making this nice tool available.
I am running into warnings when running model_scelltype_Time.sample_hmc() apparently involving beta_nonzero_mean.append() and ret.dtype.type(). The function still generate an output with meaningful results.
As I'm not an expert in Python, I wanted to know if this warning can be ignored or not and what kind of error was involved (a setup, implementation or input design issue?). Does it affect the reliability of the output?
Here is what I got as an output (I'm using the singularity image to run scCODA):
`

model_scelltype_Time = mod.CompositionalAnalysis(data_sccelltype, formula="stage + clone", reference_cell_type="automatic")
Automatic reference selection! Reference cell type set to IPC/nN
2022-05-30 14:55:56.605202: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/R/lib/R/lib::/.singularity.d/libs
2022-05-30 14:55:56.605258: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-30 14:55:56.605300: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (c18n01.ruddle.hpc.yale.internal): /proc/driver/nvidia/version does not exist
2022-05-30 14:55:56.607509: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Zero counts encountered in data! Added a pseudocount of 0.5.
simres_celltype_Time = model_scelltype_Time.sample_hmc()
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
WARNING:tensorflow:@custom_gradient grad_fn has 'variables' in signature, but no ResourceVariables were used on the forward pass.
2022-05-30 14:56:32.299744: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
0%| | 0/20000 [00:00<?, ?it/s]
2022-05-30 14:56:33.121718: I tensorflow/compiler/xla/service/service.cc:171] XLA service 0x2b066c011710 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2022-05-30 14:56:33.121764: I tensorflow/compiler/xla/service/service.cc:179] StreamExecutor device (0): Host, Default Version
2022-05-30 14:56:33.223335: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:210] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable.
2022-05-30 14:56:34.553831: I tensorflow/compiler/jit/xla_compilation_cache.cc:363] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20000/20000 [02:26<00:00, 136.38it/s]
MCMC sampling finished. (186.700 sec)
Acceptance rate: 80.6%
/opt/python/lib/python3.8/site-packages/sccoda/util/result_classes.py:252: RuntimeWarning: Mean of empty slice.
beta_nonzero_mean.append(beta_i_raw[beta_i_raw_nonzero].mean())
/opt/python/lib/python3.8/site-packages/numpy/core/_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)`

Another question was the importance of the Acceptance rate %? Is it an important output? how can it be interpreted and below which value is it problematic?
Thank you for your help!

Error + question on model usage

Hi,

after long time, I finally managed to give it a try. I would be very interested to use this method for an organoid dataset I am currently analysing.

In short, my dataset consists of ~200k samples in ~1.5k conditions and 15 phenotypic classes (analogous of cell types in scRNAseq data). Similarly to what you would do in a scRNAseq dataset, I am interested in assessing a compositional difference in classes between the various conditions.

The "frequency" table as showed in the tutorial looks like this:

running

data = dat.from_pandas(comp_table, covariate_columns=["Compound_name"])

# Extract condition from mouse name and add it as an extra column to the covariates
data.obs["Condition"] = data.obs["Compound_name"]

model = mod.CompositionalAnalysis(data, formula="Condition", baseline_index=1460)

sim_results = model.sample_hmc()
sim_results.summary()

Finishes the MCMC sampling with

MCMC sampling finished. (847.986 sec)
Acceptance rate: 23.8%

and then throw the error below.
I will reduce the size of the dataset and try to re run it as well.

Another question I have with respect to usage is: would it be possible to add other covariates in the model, like biological replicates and the likes?

Thanks a lot in advance!

Error:

``` alueError Traceback (most recent call last) in ----> 1 sim_results = model.sample_hmc() 2 sim_results.summary()

~/miniconda3/envs/scdc/lib/python3.7/site-packages/scdcdm/model/dirichlet_models.py in sample_hmc(self, num_results, n_burnin, num_leapfrog_steps, step_size)
192 observed_data=observed_data,
193 dims=dims,
--> 194 coords=coords).to_result_data(y_hat, baseline=False)
195
196 def sample_nuts(self, num_results=int(10e3), n_burnin=int(5e3), max_tree_depth=10, step_size=0.01):

~/miniconda3/envs/scdc/lib/python3.7/site-packages/scdcdm/util/result_classes.py in to_result_data(self, y_hat, baseline)
23 def to_result_data(self, y_hat, baseline):
24
---> 25 post = self.posterior_to_xarray()
26 ss = self.sample_stats_to_xarray()
27 postp = self.posterior_predictive_to_xarray()

~/miniconda3/envs/scdc/lib/python3.7/site-packages/arviz/data/base.py in wrapped(cls, *args, **kwargs)
35 if all([getattr(cls, prop_i) is None for prop_i in prop]):
36 return None
---> 37 return func(cls, *args, **kwargs)
38
39 return wrapped

~/miniconda3/envs/scdc/lib/python3.7/site-packages/arviz/data/io_dict.py in posterior_to_xarray(self)
53 )
54
---> 55 return dict_to_dataset(data, library=None, coords=self.coords, dims=self.dims)
56
57 @requires("sample_stats")

~/miniconda3/envs/scdc/lib/python3.7/site-packages/arviz/data/base.py in dict_to_dataset(data, attrs, library, coords, dims)
199 for key, values in data.items():
200 data_vars[key] = numpy_to_data_array(
--> 201 values, var_name=key, coords=coords, dims=dims.get(key)
202 )
203 return xr.Dataset(data_vars=data_vars, attrs=make_attrs(attrs=attrs, library=library))

~/miniconda3/envs/scdc/lib/python3.7/site-packages/arviz/data/base.py in numpy_to_data_array(ary, var_name, coords, dims)
164 # filter coords based on the dims
165 coords = {key: xr.IndexVariable((key,), data=coords[key]) for key in dims}
--> 166 return xr.DataArray(ary, coords=coords, dims=dims)
167
168

~/miniconda3/envs/scdc/lib/python3.7/site-packages/xarray/core/dataarray.py in init(self, data, coords, dims, name, attrs, indexes, fastpath)
342 data = _check_data_shape(data, coords, dims)
343 data = as_compatible_data(data)
--> 344 coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
345 variable = Variable(dims, data, attrs, fastpath=True)
346 indexes = dict(

~/miniconda3/envs/scdc/lib/python3.7/site-packages/xarray/core/dataarray.py in _infer_coords_and_dims(shape, coords, dims)
153 "conflicting sizes for dimension %r: "
154 "length %s on the data but length %s on "
--> 155 "coordinate %r" % (d, sizes[d], s, k)
156 )
157

ValueError: conflicting sizes for dimension 'cell_type_nb': length 14 on the data but length 15 on coordinate 'cell_type_nb'

</details>.

Zero Fold-change in scCODA results

Hi,
I have ran scCODA on a few sub-clusterings, and in many of them I get zero fold-change in the summary of scCODA run, though there is clearly a non-zero fold-change between the contions.
Attached is one example of cell counts per condition per cluster and the frequency graph for this data.

What I ran is:

    counts_per_sample = pd.read_csv("counts.csv")
    counts_adata = dat.from_pandas(counts_per_sample, covariate_columns=['Condition'])
    model = mod.CompositionalAnalysis(counts_adata, formula=f"C('Condition', Treatment('ctl'))", reference_cell_type='automatic')
    sim_results = model.sample_hmc()
    sim_results.summary()

The full output is also attached.

Thanks for your help!
Rachelly.

counts.csv
output.txt

col_wrap in utils.visualization.boxplots

Hi scCODA team,

Line 265 in data_visualization gives an error when running boxplots with plot_faucets = True.
The issue is in the col_wrap parameter for the sns function. np.floor gives a float, not an int, which leads to an error in the FacetGrid function.

which condition scCODA can work with?

HI,
can only external condition should be consider as condition in scCODA model or should also use control-case status as a condition?
can also I use continuous variable such as treatment exposure time as a condition or scCODA only accept categorical variable?

Thank you

How to save `sim_results` object?

Hi,
I wanted to run the scdcdm model as a python script and I was wondering how I could save the simulation results such that I can inspect the results lateron. I haven't found a save/write function and using arviz.to_netcdf() allows only to save the InferenceData, but not the model results.
Help is much appreciated.

Cannot import sccoda.util.comp_ana

I installed scCODA (pip install sccoda) without any problem/warning and performed the basic analysis following the getting started tutorial, until the model setup and inference. Unfortunately I cannot import sccoda.util.comp_ana function into python. Following the tutorial:

from sccoda.util import cell_composition_data as dat
from sccoda.util import data_visualization as viz
from sccoda.util.comp_ana import comp_ana as mod

doesn't work for my python installation (conda, python 3.9), thus I used

import sccoda.util.cell_composition_data as dat
import sccoda.util.data_visualization as viz

successfully. When I try to import comp_ana using:

import sccoda.util.comp_ana as mod

The kernel crushes. If I import sccoda separately;

import sccoda

And look for the possible functions, but I only can see cell_composition_data and data_visualization, but not comp_ana

comp_ana.py is under the sccoda/util folder where it is supposed to be. I am not sure what exactly the problem is, but somehow comp_ana is 'invisible'.

I have a Macbook pro M1 with 8-core, my environment details are:

anndata     0.7.6
scanpy      1.8.1
sinfo       0.3.4
-----
PIL                 8.2.0
anyio               NA
appnope             0.1.2
argon2              20.1.0
attr                21.2.0
babel               2.9.1
backcall            0.2.0
brotli              NA
cairo               1.19.1
certifi             2021.05.30
cffi                1.14.5
chardet             4.0.0
cloudpickle         1.6.0
cycler              0.10.0
cython_runtime      NA
dask                2021.06.0
dateutil            2.8.1
decorator           4.4.2
fsspec              2021.06.0
google              NA
h5py                3.1.0
idna                2.10
igraph              0.9.6
ipykernel           5.5.5
ipython_genutils    0.2.0
jedi                0.18.0
jinja2              3.0.1
joblib              1.0.1
json5               NA
jsonschema          3.2.0
jupyter_server      1.8.0
jupyterlab_server   2.6.0
kiwisolver          1.3.1
leidenalg           0.8.4
llvmlite            0.36.0
markupsafe          2.0.1
matplotlib          3.4.2
mpl_toolkits        NA
natsort             7.1.1
nbclassic           NA
nbformat            5.1.3
numba               0.53.1
numexpr             2.7.3
numpy               1.19.5
packaging           20.9
pandas              1.2.4
parso               0.8.2
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
prometheus_client   NA
prompt_toolkit      3.0.18
psutil              5.8.0
ptyprocess          0.7.0
pvectorc            NA
pyexpat             NA
pygments            2.9.0
pyparsing           2.4.7
pyrsistent          NA
pytz                2021.1
requests            2.25.1
sccoda              0.1.4
scipy               1.6.2
seaborn             0.11.1
send2trash          NA
six                 1.15.0
sklearn             0.24.2
sniffio             1.2.0
socks               1.7.1
sparse              0.12.0
statsmodels         0.12.2
storemagic          NA
tables              3.6.1
tblib               1.7.0
terminado           0.10.1
texttable           1.6.3
tlz                 0.11.1
toolz               0.11.1
tornado             6.1
traitlets           5.0.5
typing_extensions   NA
urllib3             1.26.5
wcwidth             0.2.5
websocket           0.57.0
yaml                5.4.1
zipp                NA
zmq                 22.1.0
-----
IPython             7.24.1
jupyter_client      6.1.12
jupyter_core        4.7.1
jupyterlab          3.0.16
notebook            6.4.0
-----
Python 3.9.4 | packaged by conda-forge | (default, May 10 2021, 22:13:15) [Clang 11.1.0 ]
macOS-11.2.3-x86_64-i386-64bit
8 logical CPU cores, i386
-----
Session information updated at 2021-09-26 14:07

I have tried this on a linux server as well with identical results.
Thanks for the time

Error in samle_hmc()

Hey there,
I have successfully used this tool on my data a couple of months ago. unfortunately I now get the following error (both for the model with ans without baseline):

MCMC sampling finished. (189.715 sec)
Acceptance rate: 45.1%

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-65-b8c3ad8ee3bd> in <module>
      7 for i in range(5):
      8     ca = mod.CompositionalAnalysis(a, "condition", baseline_index=baseline)
----> 9     params_mcmc = ca.sample_hmc()
     10     thisgrp = [i[1] for i in params_mcmc.summary_prepare()[1][params_mcmc.summary_prepare()[1]['final_parameter'] != 0]['final_parameter'].index]
     11     print('Significant:', thisgrp)

/storage/groups/ml01/workspace/leander.dony/projects/organoids_chronic/git/SCDCdm_public/scdcdm/model/dirichlet_models.py in sample_hmc(self, num_results, n_burnin, num_leapfrog_steps, step_size)
    224                                      dims=dims,
    225                                      coords=coords).to_result_data(sampling_stats=sampling_stats,
--> 226                                                                    model_specs=model_specs)
    227 
    228     def sample_nuts(self, num_results=int(10e3), n_burnin=int(5e3), max_tree_depth=10, step_size=0.01):

/storage/groups/ml01/workspace/leander.dony/projects/organoids_chronic/git/SCDCdm_public/scdcdm/util/result_classes.py in to_result_data(self, sampling_stats, model_specs)
     33                 "sample_stats_prior": ssp,
     34                 "prior_predictive": prip,
---> 35                 "observed_data": obs,
     36             }
     37         )

/storage/groups/ml01/workspace/leander.dony/projects/organoids_chronic/git/SCDCdm_public/scdcdm/util/result_classes.py in __init__(self, sampling_stats, model_specs, **kwargs)
     78         self.model_specs = model_specs
     79 
---> 80         intercept_df, effect_df = self.summary_prepare()
     81 
     82         self.intercept_df = intercept_df

/storage/groups/ml01/workspace/leander.dony/projects/organoids_chronic/git/SCDCdm_public/scdcdm/util/result_classes.py in summary_prepare(self, *args, **kwargs)
    132         hpds_new = hpds.str.replace("hpd_", "HPD ")
    133 
--> 134         intercept_df = intercept_df.loc[:, ["final_parameter", hpds[0], hpds[1], "sd", "expected_sample"]]
    135         intercept_df = intercept_df.rename(columns=dict(zip(
    136             intercept_df.columns,

/opt/python/lib/python3.7/site-packages/pandas/core/indexes/base.py in __getitem__(self, key)
   3927         if is_scalar(key):
   3928             key = com.cast_scalar_indexer(key)
-> 3929             return getitem(key)
   3930 
   3931         if isinstance(key, slice):

IndexError: index 0 is out of bounds for axis 0 with size 0

"division by zero" error

Hello,

I am working with a single cell data set with 12 clusters and 3 treatment conditions (including the control). For a couple different cell types I've tried to set as the reference, I get a division by zero error after running model_obj_iggvsnat.sample_hmc() . I am not sure why this happens when I set certain cell types as the reference, but not with others.

There have also been some instances where I have run the code based on the getting started tutorial, and received results, but then have had the "division by zero" error when reruning the same code without changing the reference cell type or anything else.

Any insights into what might be going on here would be really helpful, thank you!

add_dots parameter not respecting hue_order arg in boxplot

Hi,

I'm using the boxplot wrapper from this package to plot differences in proportions.
I noticed that, if I want to specify the hue_order of the feature_name categories with:

viz.boxplots(
    data, feature_name='genotype', dpi=150, cmap=['#393e41', '#ca3c25'],
    figsize=(10,10), args_boxplot={'hue_order': ['Control', 'Case']},
    add_dots=True,
)

the dots added by add_dots=True do not respect the hue_order parameter but they default to the original ordering (images attached, without and with custom hue_order).

I'm using scCODA==0.1.2.post1.

Thanks

import from scanpy

Sorry, I am not sure where to post issues, so this is a duplicate.

Hi,

I am trying to use scdcdm with an existing scanpy object.

First, I tried using scdcdm.util.cell_composition_data.from_scanpy, but it turns out that this function does not return a CompositionalData object. Instead, it returns a np.array with cell counts and a list with covariates, which cannot be directly used as an input to scdcdm.util.comp_ana.CompositionalAnalysis. Then, I tried using scdcdm.util.cell_composition_data.from_scanpy_list, but it returned the following error:

Traceback (most recent call last):

  File "<ipython-input-26-2483c45c611d>", line 3, in <module>
    covariate_key='Condition')

  File "/Applications/python_modules/SCDCdm/SCDCdm_public/scdcdm/util/cell_composition_data.py", line 77, in from_scanpy_list
    covariate_data = covariate_data.append(pd.Series(covs), ignore_index=True)

  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 7116, in append
    other.values.reshape((1, len(other))),

AttributeError: 'Categorical' object has no attribute 'reshape'

Any tips?

Fix docs

Hi all, I wanted to follow the examples suggested at the end of README.md page but rendering of html pages is gibberish (see here).

Segmentation Fault

Hi, I am excited to use scCODA for my data analysis, however I am running into an issue. I am following the getting started tutorial notebook, and I get a segmentation fault when I run this line:

sim_results = model_salm.sample_hmc()

I am suspicious that I didn't install scCODA properly? I followed the guidelines as posted on the github repo for installation (fresh conda environment, Python 3.8.5). I noticed that on the scCODA github, it specifically mentioned tensorflow==2.3.2 as the version but after pip installing scCODA, I noticed that the version was 2.4.1.

Obtain p-values for cell type abundance effect

Hi
Thank you for the package! I wonder, how should I obtain the p-values for cell type abundance comparisons?
Can I use Inclusion probability from summary_extended for it? As far as I understand the methods, this is used for selecting the credible effects.

Also, in equation 12 in the paper shouldn't there be max instead of min?

Thank you

cc @karolinasenkow

Which tensorflow version to use?

Hi sccoda team,

While trying to also run sccoda on my laptop (MacBook Air M1), and downloading all relevant packages for my project, I've come across a version discrepancy. (Note: when I installed sccoda/tensorflow on an iMac a few weeks back the issue did not arise)

When installing tensorflow:
sccoda 0.1.7 requires numpy>=1.21, but you have numpy 1.19.5 which is incompatible. sccoda 0.1.7 requires tensorflow>=2.8, but you have tensorflow 2.5.0 which is incompatible. sccoda 0.1.7 requires tensorflow-probability>=0.16.0, but you have tensorflow-probability 0.12.0 which is incompatible.

On the description for sccoda installation however, other package versions are described.

This package uses the tensorflow (>=2.4, <2.6) and tensorflow-probability (==0.12) packages.

Which package versions ones should I be using?

Also, I've been having issues importing sccoda into my jupyter notebook, as the kernel stops running. Could that issue be related to the package versions? (Note: I do not have this issue using sccoda in jupyter notebook on an iMac, only on my macbook air)

Thanks!

sample_hmc error

this is a duplicate because i am not sure where to post issues.

hi,

i am trying to run scdcdm with a tiny fraction of my real data set as a test, and i am basically following the tutorial.

however, i get the error message below, which i do not understand. i would appreciate any advice!

In [110]: import scdcdm

In [111]: from scdcdm.util import cell_composition_data as dat

In [112]: from scdcdm.util import comp_ana as mod

In [113]: dat_pandas
Out[113]: 
      0     1     2  condition
0  1065  1001  1232    control
1   439   435   197  treatment

In [114]: dat_scdcdm = dat.from_pandas(dat_pandas,
     ...:                              covariate_columns=["condition"])
Transforming to str index.

In [115]: dat_model = mod.CompositionalAnalysis(dat_scdcdm,
     ...:                                       formula="condition",
     ...:                                       baseline_index=None)

In [116]: sim_results = dat_model.sample_hmc()
WARNING:tensorflow:5 out of the last 5 calls to <function CompositionalModel.sampling.<locals>.sample_mcmc at 0x7f8fe4d5e8c0> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
MCMC sampling finished. (42.321 sec)
Acceptance rate: 35.0%
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:376: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py:576: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item_labels[indexer[info_axis]]] = value
Traceback (most recent call last):

  File "<ipython-input-116-4c85d20e3019>", line 1, in <module>
    sim_results = dat_model.sample_hmc()

  File "/Applications/python_modules/SCDCdm/SCDCdm_public/scdcdm/model/dirichlet_models.py", line 226, in sample_hmc
    model_specs=model_specs)

  File "/Applications/python_modules/SCDCdm/SCDCdm_public/scdcdm/util/result_classes.py", line 35, in to_result_data
    "observed_data": obs,

  File "/Applications/python_modules/SCDCdm/SCDCdm_public/scdcdm/util/result_classes.py", line 80, in __init__
    intercept_df, effect_df = self.summary_prepare()

  File "/Applications/python_modules/SCDCdm/SCDCdm_public/scdcdm/util/result_classes.py", line 134, in summary_prepare
    intercept_df = intercept_df.loc[:, ["final_parameter", hpds[0], hpds[1], "sd", "expected_sample"]]

  File "/opt/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4280, in __getitem__
    return getitem(key)

IndexError: index 0 is out of bounds for axis 0 with size 0

different results in different runs

Hi,

My data set contains 155 clusters and 2 conditions (treatment vs control). I performed 15 sample_hmc runs with the same data and the same NoBaselineModel, and the results could be broadly classified as follows:
10/15 runs: ~15-20 cell types had non-zero final parameter values, all values were negative
5/15 runs: ~50-60 cell types had non-zero final parameter values, all values were positive

Expectedly, the cell types that had negative final parameter values in some runs were not overlapping with the cell types that had positive final parameter values in the other runs. However, I am puzzled by the fact that there was not a single run in which I would have captured both positive and negative cell types at the same time.

What can it mean and how should I go about it? Should I do some kind of bootstrapping?

Thanks!

MCMC reproducibility

Hi scCODA team,
Thanks for creating this great algorithm. I have a question on MCMC reproducibility. I am running the following model:
model_nb = mod.CompositionalAnalysis(data_nb, formula="C(MYCN, Treatment('non-amplified'))", reference_cell_type="Endothelial")

When I run the model sampling two separate times, I get slight disagreements in what effects are credible. Below, in one run the stroma cells show up as credible and in another run they show up as not credible.

Do you have recommendations on what to do in this case? I assume these effects are at the borderline of being credible. Is there a random seed I should set to get reproducible MCMC? And do you have any suggestions on setting the false discovery rate in a systematic manner? Thank you so much! Best, Orr

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.