calico / spatial_lda Goto Github PK

Probabilistic topic model for identifying cellular micro-environments.

License: MIT License

Python 100.00%

cancer-imaging cancer-imaging-research multiplexed-imaging

spatial_lda's Issues

hard-coded number of parallel processes in infer function

The infer() function accepts an argument for the number of parallel processes to use, but then it hard codes the number of processes in the update xis step to be 2. See below:

xis = _update_xis(sample_features,
                      difference_matrices,
                      difference_penalty,
                      gamma=gamma,
                      n_parallel_processes=2,
                      verbosity=0)

This forces inference to be very slow.

Repeat call in line_search()

I think there is an extra call to compute_r() in the line_search() function on line 137 of admm.py. It seems like build_linear_system() returns a value for r on line 128, and then compute_r() is called again on line 137 using the same unchanged arguments.

This is the call on line 128:

M, r = build_linear_system(gamma, u, C, e, rho, s, t)

This is what build_linear_system() calls:

def build_linear_system(gamma, u, C, e, rho, s, t):
    """Build the linear system for ADMM.primal_dual (see appendix section 5.2.7)."""
    n, k = e.shape
    H = hessian_f0(gamma, e, rho, s)
    uC = scipy.sparse.diags(np.squeeze(u), 0).dot(C)
    Cg = scipy.sparse.diags(np.squeeze(C.dot(gamma)))
    M = scipy.sparse.vstack((scipy.sparse.hstack((H, C.T)),
                                               scipy.sparse.hstack((-uC, -Cg)))).tocsr()
    r = compute_r(gamma, u, C, e, rho, s, t)
    return M, r

And then compute_r() called again with the same variables on line 137:

r = compute_r(gamma, u, C, e, rho, s, t)

I'm not sure if this was intentional, nor do I know how computationally taxing it is either since I haven't had a chance to explicitly profile compute_r(). But if it's a redundant call, then it is potentially being called up to 1,500 times more than it should per topic (15 iterations of update_xis inside admm() and then up to 100 iterations of line_search() inside primal_dual()).

add max_ls_iter and max_dirichlet_iter as options in model.train

When increasing the number of topics to be discovered I get an Exception:

Exception: Stopping in admm.newton_regularized_dirichlet.

As you suggested It could be fixed by increasing the number of line search iterations (ls_iter in the admm.newton_regularized_dirichlet) and the number of Dirichlet iterations (max_diricihlet_iter in the admm.admm).

Could you add these options to the model.train, please?

Recommendations for Updates

Provide description of dataframe/input that enters the model
Provide a list of options of function for data type of input from which user can choose:
-- Example:

func_type = {"marker": neighborhood_to_marker, "cluster": neighborhood_to_cluster, "avg_marker": neighborhood_to_avg_marker, "count": neighborhood_to_count}

if TYPE == "marker" or TYPE == "avg_marker":
neighborhood_feature_fn = functools.partial(func_type[TYPE], markers = markers)
else:
neighborhood_feature_fn = functools.partial(func_type[TYPE])

Make "anchor cell" more generalizable/define more clearly (since most data sets will probably have all cells as anchors
Enable user to plot 1 FOV at a time (plot_samples_in_a_row does not work when only one fov number is input)
-- Maybe rename visualization functions as "sample" instead of "tumor" since not all MIBI images will be of tumors
Enable user to save resulting plots (currently plots are produced within the function, so to plot, it would require user to modify function)

Option to decrease the max_iter argument in admm().

The max_iter argument used in the admm() function currently has a default value of 15 without the option to decrease it, nor is there a criteria for breaking the loop over these iterations. Given that a substantial amount of computational time is spent in this part of the code during model training, having the option to decrease these iterations manually might speed things up, especially if there aren't too many gains to be had from all 15 iterations.

Incorporating a break in the loop when the marginal gain from the last iteration is less than some tolerance could also be helpful. When I ran the example code using the TNBC data, the objective was showing changes of less than 1% as early as the fifth iteration (see screenshot of logging output for sample 6). I should also note that I had set TRAIN_SIZE_FRACTION = 0.1 instead of 0.9 for the purposes of the demo and running it locally on my machine.

N_PARALLEL_PROCESSES =  1#@param
TRAIN_SIZE_FRACTION = 0.1 #@param
N_TOPICS = 3 #@param
DIFFERENCE_PENALTY = 250 #@param
RETRAIN_MODEL = True#@param

logger = logging.getLogger()
logger.setLevel(logging.INFO)

N_IMMUNE_CLUSTERS = 12

with open(PATH_TO_PATIENT_DFS_PKL, 'rb') as f:
   patient_dfs = pickle.load(f)
   

for patient_id in patient_dfs.keys():
  df = patient_dfs[patient_id]
  df['combined_cluster_id'] = (df['immune_cluster'].fillna(0) + 
                               (df.cluster_id + N_IMMUNE_CLUSTERS).fillna(0))
  df.loc[df['combined_cluster_id'] == 0, 'combined_cluster_id'] = None
  df.loc[:, 'is_tumor'] = ~df['isimmune']
  patient_dfs[patient_id] = df
  
compartmentalized_tumors = [3, 4, 5, 6, 9, 10, 16, 28, 32, 35, 36, 37, 40, 41]
noncompartmentalized_tumors = [x for x in patient_dfs.keys() 
                               if x not in compartmentalized_tumors]
immune_cluster_names={8:'Treg',6:'MF/Glia',1:'B',4:'CD11c-high',7:'NK',0:'CD4T',
                      2:'DC',3:'CD8T',10:'MF',11:'Neutrophil',9:'Other', 5:'MF'}
other_cluster_names={0:'Epithelial',2:'Tumor/Keratin',3:'Tumor/EGFR',
                     4:'Endothelial/Vim',1:'Mesenchymal/SMA'}
markers = patient_dfs[1].columns[2:44]
num_patients = 41

if os.path.exists(PATH_TO_TUMOR_MARKER_FEATURES_PKL):
  with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'rb') as f:
    tumor_marker_features = pickle.load(f)
else:
  neighborhood_feature_fn = functools.partial(neighborhood_to_marker, 
                                              markers=markers)
  tumor_marker_features = featurize_tumors(patient_dfs, 
                                           neighborhood_feature_fn, 
                                           radius=75,
                                           n_processes=N_PARALLEL_PROCESSES)
  
  with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'wb') as f:
    pickle.dump(tumor_marker_features, f)


# Subsample tumor cells
all_tumor_idxs = tumor_marker_features.index.map(lambda x: x[0])
_sets = train_test_split(tumor_marker_features, 
                         test_size=1. - TRAIN_SIZE_FRACTION,
                         stratify=all_tumor_idxs)
train_tumor_marker_features, test_tumor_marker_features = _sets
train_difference_matrices = make_merged_difference_matrices(
    train_tumor_marker_features, patient_dfs, 'x', 'y')
tumor_idxs = train_tumor_marker_features.index.map(lambda x: x[0])

path_to_train_model = '_'.join((f'{PATH_TO_MODELS}/tnbc_training',
                                f'penalty={DIFFERENCE_PENALTY}',
                                f'topics={N_TOPICS}',
                                f'trainfrac={TRAIN_SIZE_FRACTION}')) + '.pkl'

if os.path.exists(path_to_train_model) and not RETRAIN_MODEL:
  print('Loading existing model and inferred topics.')
  with open(path_to_train_model, 'rb') as f:
    spatial_lda_model = pickle.load(f)
else:
  spatial_lda_model = spatial_lda.spatial_lda.model.train(train_tumor_marker_features, 
                                              train_difference_matrices, 
                                              n_topics=N_TOPICS, 
                                              difference_penalty=DIFFERENCE_PENALTY, 
                                              verbosity=1,
                                              n_parallel_processes=N_PARALLEL_PROCESSES,
                                              n_iters=1,
                                              admm_rho=0.1,
                                              primal_dual_mu=2)
  with open(path_to_train_model, 'wb') as f:
    pickle.dump(spatial_lda_model, f)

logsumexp import error

The online_lda.py module tries to import logsumexp from sklearn, but I think this is outdated and throws an ImportError. Importing from scipy.special seems to work.

from sklearn.utils.fixes import logsumexp

ImportError: cannot import name 'logsumexp'

how to run the spatial-LDA on all cells

I'd like to run the spatial-LDA on all the cells in my dataset. I tried to set all cells as index cells by setting is_anchor_col argument in the featurize_samples function to a column where all values are True, but that produces an error.

if os.path.exists(PATH_TO_TUMOR_MARKER_FEATURES_PKL):
    with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'rb') as f:
        tumor_marker_features = pickle.load(f)
else:
    neighborhood_feature_fn = functools.partial(neighborhood_to_marker, 
                                              markers=markers)
    
    tumor_marker_features = featurize_samples(patient_dfs, neighborhood_feature_fn, 100, 'is_index',
                             'x', 'y', n_processes=N_PARALLEL_PROCESSES)
    with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'wb') as f:
        pickle.dump(tumor_marker_features, f)

ERROR:

RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/spatial_lda-0.0.3-py3.7.egg/spatial_lda/featurization.py", line 51, in _featurize_sample
is_anchor_col, x_col, y_col, z_col=z_col)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/spatial_lda-0.0.3-py3.7.egg/spatial_lda/featurization.py", line 38, in _featurize_cells
neighborhood_kdTree = KDTree(neighborhood_cells[coord_cols].values)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/scipy/spatial/kdtree.py", line 239, in init
self.maxes = np.amax(self.data,axis=0)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2505, in amax
initial=initial)
File "/home/kidzik/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity
"""

The above exception was the direct cause of the following exception:

ValueError Traceback (most recent call last)
in
7
8 tumor_marker_features = featurize_samples(patient_dfs, neighborhood_feature_fn, 100, 'is_index',
----> 9 'x', 'y', n_processes=N_PARALLEL_PROCESSES)
10 with open(PATH_TO_TUMOR_MARKER_FEATURES_PKL, 'wb') as f:
11 pickle.dump(tumor_marker_features, f)

~/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/spatial_lda-0.0.3-py3.7.egg/spatial_lda/featurization.py in featurize_samples(sample_dfs, neighborhood_feature_fn, radius, is_anchor_col, x_col, y_col, z_col, n_processes)
87 all_sample_features = list(tqdm(pool.imap(featurize_sample_fn,
88 sample_dfs.items()),
---> 89 total=total))
90 else:
91 for i, sample_df in sample_dfs.items():

~/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/tqdm/_tqdm_notebook.py in iter(self, *args, **kwargs)
221 def iter(self, *args, **kwargs):
222 try:
--> 223 for obj in super(tqdm_notebook, self).iter(*args, **kwargs):
224 # return super(tqdm...) will not catch exception
225 yield obj

~/miniconda3/envs/spatial-lda/lib/python3.7/site-packages/tqdm/_tqdm.py in iter(self)
1003 """), fp_write=getattr(self.fp, 'write', sys.stderr.write))
1004
-> 1005 for obj in iterable:
1006 yield obj
1007 # Update and possibly print the progressbar.

~/miniconda3/envs/spatial-lda/lib/python3.7/multiprocessing/pool.py in next(self, timeout)
746 if success:
747 return value
--> 748 raise value
749
750 next = next # XXX

ValueError: zero-size array to reduction operation maximum which has no identity

I think it's because the neighborhood_cells becomes an empty arrey if I set all cells as index cells.

Can you kindly advise on how to run the pipeline on all the cells, please?

Thanks a lot!

Magda

make_difference_matrix assumes 'x' and 'y' columns exist in the patient_dfs

In featurization, featurize_samples code takes input x_col, y_col, and z_col.
However, make_merged_difference_matrices does not.
Change make_merged_difference_matrices to accept and use x_col and y_col args.

Links to examples in README need permission to open

Both the "Mouse Spleen Analysis" and the "TNBC Analysis" require permission to be opened.
Probably should make those visible by link instead.

Update github url in setup.py to point to this repo

If this is the repo meant to be released, the url in setup should point to this one instead of the multi_imaging one.

neural spatial LDA

Just wonder will you share the code for your new paper neural spatial LDA?

Thank you.

Error encountered when training the model

Hi,

I wanted to fit the spatial LDA model on ~1.6 million cells (using all cells except ~50000 tumor cells as anchor cells), but I ran into some issues when training the model. Here is a snippet of the code:


spatial_lda_models = {}
difference_penalty = 0.25

N_TOPICS_LIST = [3,4,5,6,7,8,9,10,11]

for n in N_TOPICS_LIST:
    path_to_train_model = '_'.join((f'spatialLDA_model/',f'penalty = {difference_penalty}', f'topics={n}',f'trainfrac=0.9')) + '.pkl'
    print(f'Running n_topics = {n}, d = {difference_penalty}\n')
    spatial_lda_model = spatial_lda.model.train(sample_features = train_hodgkin_features,
                                               difference_matrices = train_difference_matrices,
                                               difference_penalty = difference_penalty,
                                               n_topics = n,
                                               n_parallel_processes = 20,
                                               verbosity = 1,
                                               admm_rho = 0.1,
                                               primal_dual_mu = 10)
    spatial_lda_models[n] = spatial_lda_model
    with open(path_to_train_model, 'wb') as f:
        pickle.dump(spatial_lda_model, f)

order_topics_consistently(spatial_lda_models.values())

When training the model, I got exception: stopping in admm.newton_regularized_dirichlet. From the traceback message, it seems like this is caused by the hyperparameter primal_dual_mu. I tried to train the model on a subset of the data with the same hyperparameters and the code did run successfully. Can you provide some insight in to the issue I am facing? Could it be caused by the large sample size and improper hyperparameter choice making the optimization algorithm to diverge?

`np.float` incompatible with current version of `numpy`

The online_lda module uses np.float at

spatial_lda/spatial_lda/online_lda.py

Line 31 in de6b00e

EPS = np.finfo(np.float).eps

This raises an AttributeError with numpy 1.24 when importing.

calico / spatial_lda Goto Github PK

spatial_lda's Issues

hard-coded number of parallel processes in infer function

Repeat call in line_search()

add max_ls_iter and max_dirichlet_iter as options in model.train

Recommendations for Updates

Option to decrease the max_iter argument in admm().

logsumexp import error

how to run the spatial-LDA on all cells

ERROR:

make_difference_matrix assumes 'x' and 'y' columns exist in the patient_dfs

Links to examples in README need permission to open

Update github url in setup.py to point to this repo

neural spatial LDA

Error encountered when training the model

`np.float` incompatible with current version of `numpy`

Add comment in README on primal_dual versus admm.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent