calico / spatial_lda Goto Github PK
View Code? Open in Web Editor NEWProbabilistic topic model for identifying cellular micro-environments.
License: MIT License
Probabilistic topic model for identifying cellular micro-environments.
License: MIT License
The infer()
function accepts an argument for the number of parallel processes to use, but then it hard codes the number of processes in the update xis step to be 2. See below:
xis = _update_xis(sample_features,
difference_matrices,
difference_penalty,
gamma=gamma,
n_parallel_processes=2,
verbosity=0)
This forces inference to be very slow.
I think there is an extra call to compute_r()
in the line_search()
function on line 137 of admm.py
. It seems like build_linear_system()
returns a value for r
on line 128, and then compute_r()
is called again on line 137 using the same unchanged arguments.
This is the call on line 128:
M, r = build_linear_system(gamma, u, C, e, rho, s, t)
This is what build_linear_system()
calls:
def build_linear_system(gamma, u, C, e, rho, s, t):
"""Build the linear system for ADMM.primal_dual (see appendix section 5.2.7)."""
n, k = e.shape
H = hessian_f0(gamma, e, rho, s)
uC = scipy.sparse.diags(np.squeeze(u), 0).dot(C)
Cg = scipy.sparse.diags(np.squeeze(C.dot(gamma)))
M = scipy.sparse.vstack((scipy.sparse.hstack((H, C.T)),
scipy.sparse.hstack((-uC, -Cg)))).tocsr()
r = compute_r(gamma, u, C, e, rho, s, t)
return M, r
And then compute_r()
called again with the same variables on line 137:
r = compute_r(gamma, u, C, e, rho, s, t)
I'm not sure if this was intentional, nor do I know how computationally taxing it is either since I haven't had a chance to explicitly profile compute_r()
. But if it's a redundant call, then it is potentially being called up to 1,500 times more than it should per topic (15 iterations of update_xis
inside admm()
and then up to 100 iterations of line_search()
inside primal_dual()
).
When increasing the number of topics to be discovered I get an Exception:
Exception: Stopping in admm.newton_regularized_dirichlet.
As you suggested It could be fixed by increasing the number of line search iterations (ls_iter
in the admm.newton_regularized_dirichlet
) and the number of Dirichlet iterations (max_diricihlet_iter
in the admm.admm
).
Could you add these options to the model.train
, please?
func_type = {"marker": neighborhood_to_marker, "cluster": neighborhood_to_cluster, "avg_marker": neighborhood_to_avg_marker, "count": neighborhood_to_count}
if TYPE == "marker" or TYPE == "avg_marker":
neighborhood_feature_fn = functools.partial(func_type[TYPE], markers = markers)
else:
neighborhood_feature_fn = functools.partial(func_type[TYPE])
The max_iter argument used in the admm() function currently has a default value of 15 without the option to decrease it, nor is there a criteria for breaking the loop over these iterations. Given that a substantial amount of computational time is spent in this part of the code during model training, having the option to decrease these iterations manually might speed things up, especially if there aren't too many gains to be had from all 15 iterations.
Incorporating a break in the loop when the marginal gain from the last iteration is less than some tolerance could also be helpful. When I ran the example code using the TNBC data, the objective was showing changes of less than 1% as early as the fifth iteration (see screenshot of logging output for sample 6). I should also note that I had set TRAIN_SIZE_FRACTION = 0.1 instead of 0.9 for the purposes of the demo and running it locally on my machine.
N_PARALLEL_PROCESSES = 1#@param
TRAIN_SIZE_FRACTION = 0.1 #@param
N_TOPICS = 3 #@param
DIFFERENCE_PENALTY = 250 #@param
RETRAIN_MODEL = True#@param
logger = logging.getLogger()
logger.setLevel(logging.INFO)
N_IMMUNE_CLUSTERS = 12
with open(PATH_TO_PATIENT_DFS_PKL, 'rb') as f:
patient_dfs = pickle.load(f)
for patient_id in patient_dfs.keys():
df = patient_dfs[patient_id]
df['combined_cluster_id'] = (df['immune_cluster'].fillna(0) +
(df.cluster_id + N_IMMUNE_CLUSTERS).fillna(0))
df.loc[df['combined_cluster_id'] == 0, 'combined_cluster_id'] = None
df.loc[:, 'is_tumor'] = ~df['isimmune']
patient_dfs[patient_id] = df
compartmentalized_tumors = [3, 4, 5, 6, 9, 10, 16, 28, 32, 35, 36, 37, 40, 41]
noncompartmentalized_tumors = [x for x in patient_dfs.keys()
if x not in compartmentalized_tumors]
immune_cluster_names={8:'Treg',6:'MF/Glia',1:'B',4:'CD11c-high',7:'NK',0:'CD4T',
2:'DC',3:'CD8T',10:'MF',11:'Neutrophil',9:'Other', 5:'MF'}
other_cluster_names={0:'Epithelial',2:'Tumor/Keratin',3:'Tumor/EGFR',
4:'Endothelial/Vim',1:'Mesenchymal/SMA'}
markers = patient_dfs[1].columns[2:44]
num_patients = 41
if os.path.exists(PATH_TO_TUMOR_MARKER_FEATURES_PKL):
with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'rb') as f:
tumor_marker_features = pickle.load(f)
else:
neighborhood_feature_fn = functools.partial(neighborhood_to_marker,
markers=markers)
tumor_marker_features = featurize_tumors(patient_dfs,
neighborhood_feature_fn,
radius=75,
n_processes=N_PARALLEL_PROCESSES)
with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'wb') as f:
pickle.dump(tumor_marker_features, f)
# Subsample tumor cells
all_tumor_idxs = tumor_marker_features.index.map(lambda x: x[0])
_sets = train_test_split(tumor_marker_features,
test_size=1. - TRAIN_SIZE_FRACTION,
stratify=all_tumor_idxs)
train_tumor_marker_features, test_tumor_marker_features = _sets
train_difference_matrices = make_merged_difference_matrices(
train_tumor_marker_features, patient_dfs, 'x', 'y')
tumor_idxs = train_tumor_marker_features.index.map(lambda x: x[0])
path_to_train_model = '_'.join((f'{PATH_TO_MODELS}/tnbc_training',
f'penalty={DIFFERENCE_PENALTY}',
f'topics={N_TOPICS}',
f'trainfrac={TRAIN_SIZE_FRACTION}')) + '.pkl'
if os.path.exists(path_to_train_model) and not RETRAIN_MODEL:
print('Loading existing model and inferred topics.')
with open(path_to_train_model, 'rb') as f:
spatial_lda_model = pickle.load(f)
else:
spatial_lda_model = spatial_lda.spatial_lda.model.train(train_tumor_marker_features,
train_difference_matrices,
n_topics=N_TOPICS,
difference_penalty=DIFFERENCE_PENALTY,
verbosity=1,
n_parallel_processes=N_PARALLEL_PROCESSES,
n_iters=1,
admm_rho=0.1,
primal_dual_mu=2)
with open(path_to_train_model, 'wb') as f:
pickle.dump(spatial_lda_model, f)
The online_lda.py module tries to import logsumexp from sklearn, but I think this is outdated and throws an ImportError. Importing from scipy.special seems to work.
from sklearn.utils.fixes import logsumexp
ImportError: cannot import name 'logsumexp'
I'd like to run the spatial-LDA on all the cells in my dataset. I tried to set all cells as index cells by setting is_anchor_col argument in the featurize_samples function to a column where all values are True, but that produces an error.
if os.path.exists(PATH_TO_TUMOR_MARKER_FEATURES_PKL):
with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'rb') as f:
tumor_marker_features = pickle.load(f)
else:
neighborhood_feature_fn = functools.partial(neighborhood_to_marker,
markers=markers)
tumor_marker_features = featurize_samples(patient_dfs, neighborhood_feature_fn, 100, 'is_index',
'x', 'y', n_processes=N_PARALLEL_PROCESSES)
with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'wb') as f:
pickle.dump(tumor_marker_features, f)
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/spatial_lda-0.0.3-py3.7.egg/spatial_lda/featurization.py", line 51, in _featurize_sample
is_anchor_col, x_col, y_col, z_col=z_col)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/spatial_lda-0.0.3-py3.7.egg/spatial_lda/featurization.py", line 38, in _featurize_cells
neighborhood_kdTree = KDTree(neighborhood_cells[coord_cols].values)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/scipy/spatial/kdtree.py", line 239, in init
self.maxes = np.amax(self.data,axis=0)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2505, in amax
initial=initial)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
in
7
8 tumor_marker_features = featurize_samples(patient_dfs, neighborhood_feature_fn, 100, 'is_index',
----> 9 'x', 'y', n_processes=N_PARALLEL_PROCESSES)
10 with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'wb') as f:
11 pickle.dump(tumor_marker_features, f)
~/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/spatial_lda-0.0.3-py3.7.egg/spatial_lda/featurization.py in featurize_samples(sample_dfs, neighborhood_feature_fn, radius, is_anchor_col, x_col, y_col, z_col, n_processes)
87 all_sample_features = list(tqdm(pool.imap(featurize_sample_fn,
88 sample_dfs.items()),
---> 89 total=total))
90 else:
91 for i, sample_df in sample_dfs.items():
~/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/tqdm/_tqdm_notebook.py in iter(self, *args, **kwargs)
221 def iter(self, *args, **kwargs):
222 try:
--> 223 for obj in super(tqdm_notebook, self).iter(*args, **kwargs):
224 # return super(tqdm...) will not catch exception
225 yield obj
~/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/tqdm/_tqdm.py in iter(self)
1003 """), fp_write=getattr(self.fp, 'write', sys.stderr.write))
1004
-> 1005 for obj in iterable:
1006 yield obj
1007 # Update and possibly print the progressbar.
~/miniconda3/envs/spatial-lda/lib/python3.7/multiprocessing/pool.py in next(self, timeout)
746 if success:
747 return value
--> 748 raise value
749
750 next = next # XXX
ValueError: zero-size array to reduction operation maximum which has no identity
I think it's because the neighborhood_cells becomes an empty arrey if I set all cells as index cells.
Can you kindly advise on how to run the pipeline on all the cells, please?
Thanks a lot!
Magda
In featurization, featurize_samples code takes input x_col, y_col, and z_col.
However, make_merged_difference_matrices does not.
Change make_merged_difference_matrices to accept and use x_col and y_col args.
Both the "Mouse Spleen Analysis" and the "TNBC Analysis" require permission to be opened.
Probably should make those visible by link instead.
If this is the repo meant to be released, the url in setup should point to this one instead of the multi_imaging one.
Just wonder will you share the code for your new paper neural spatial LDA?
Thank you.
Hi,
I wanted to fit the spatial LDA model on ~1.6 million cells (using all cells except ~50000 tumor cells as anchor cells), but I ran into some issues when training the model. Here is a snippet of the code:
spatial_lda_models = {}
difference_penalty = 0.25
N_TOPICS_LIST = [3,4,5,6,7,8,9,10,11]
for n in N_TOPICS_LIST:
path_to_train_model = '_'.join((f'spatialLDA_model/',f'penalty = {difference_penalty}', f'topics={n}',f'trainfrac=0.9')) + '.pkl'
print(f'Running n_topics = {n}, d = {difference_penalty}\n')
spatial_lda_model = spatial_lda.model.train(sample_features = train_hodgkin_features,
difference_matrices = train_difference_matrices,
difference_penalty = difference_penalty,
n_topics = n,
n_parallel_processes = 20,
verbosity = 1,
admm_rho = 0.1,
primal_dual_mu = 10)
spatial_lda_models[n] = spatial_lda_model
with open(path_to_train_model, 'wb') as f:
pickle.dump(spatial_lda_model, f)
order_topics_consistently(spatial_lda_models.values())
When training the model, I got exception: stopping in admm.newton_regularized_dirichlet
. From the traceback message, it seems like this is caused by the hyperparameter primal_dual_mu
. I tried to train the model on a subset of the data with the same hyperparameters and the code did run successfully. Can you provide some insight in to the issue I am facing? Could it be caused by the large sample size and improper hyperparameter choice making the optimization algorithm to diverge?
The online_lda
module uses np.float
at
spatial_lda/spatial_lda/online_lda.py
Line 31 in de6b00e
AttributeError
with numpy
1.24 when importing.A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.