A general purpose Snakemake workflow to perform unsupervised analyses (dimensionality reduction & cluster analysis) and visualizations of high-dimensional data.
look for clustering benchmark datasets (from various domains) to test the approach and put the result into the documentation)
→ Clustering benchmark papers
“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true be-elvers who have experience and great courage.”
Algorithms for Clustering Data, Jain and Dubes 1988
determine metrics at every iteration and plot at the end the time course.
at least for the stopping criterion max. edge weight, but maybe also for f1 score and accuracy,....
features_to_plot option "all” plots all features with a warning that this is only useful with relatively low dimensional data.
provide a number eg <50 dimensions, and only plot that at maximum (as safety)
What's the best way to directly visualize marker signature expression, e.g., to get a feeling for cell type assignment, in the unsupervised pipeline? My approach would have been to precalculate them and add them to the metadata. Is that sensible/intended?
Summary:
Clustering stability could be assessed by doing multiple clusterings, always randomly sampling 90% of the data.
Consensus approach could be used to extract stable clustering.
Drawbacks:
Computationally expensive
Open questions:
How to combine this with clustification? Should it be only used to evaluate stability of one approach, or automatically to generate consensus clustering?
Background:
Daria implemented a resampling strategy, because she noted that adding two new samples completely changed her previous clustering. She ended up finding gene programs by seeing which genes are stably co-differential in clusters, and then came up with hard clusters by using thresholds to assign cells to one or multiple cluster labels.
I adapted the pipeline for use with a scRNAseq dataset with ~50000 cells. The input for PCA is the normalized expression matrix.
! Disclaimer: I edited the pipeline, but the PCA is at the very beginning, so this should be the same. However, I may have messed something up. !
The PCA job gets stuck on "Activating conda environment: ..." for hours and then fails. It worked fine for a dataset of 500 cells.
Here is an example of a log file:
Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: disk_mb=18545, disk_mib=17686
Select jobs to execute...
[Wed Oct 18 08:59:38 2023]
rule pca:
input: /data/path/normalized_expr_data.csv, /data/path/obs_metadata.csv
output: /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_object.pickle, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_var.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_axes.csv
log: logs/rules/PCA_sample_name_default_2.log
jobid: 0
reason: Missing output files: /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_var.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_axes.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_object.pickle
wildcards: sample=sample_name, parameters=default_2
threads: 2
resources: mem_mb=400000, disk_mb=18545, disk_mib=17686, tmpdir=/tmp
python -c "from __future__ import print_function; import sys, json; print(json.dumps([sys.version_info.major, sys.version_info.minor]))"
Activating conda environment: /path/to/conda/env/snakemake_conda/40431c23e3640492480b1b2b0c8d33df_
python /path/to/pipeline/.snakemake/scripts/tmp83o38501.pca.py
Activating conda environment: /path/to/conda/env/snakemake_conda/40431c23e3640492480b1b2b0c8d33df_
slurmstepd: error: *** JOB 4197138 ON d015 CANCELLED AT 2023-10-18T09:49:43 ***
for large data (define too large?) do not do heatmaps showing features and data, but instead determine distance matrices and show those via the given metric eg correlation.
do not plot it if either observations or dimensions exceed e.g., 50k
use Heatgraphy, a new visualization package for multi-dimensional data.
visualize as heatmap w/ & w/o metadata (scale by scores)
- clusterings (and/or metadata) as rows are ordered by MCDM ranking
- indices as columns are hierarchically clustered
Config file config/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: disk_mb=14965
Select jobs to execute...
Dask: Dask is a parallel computing library that integrates with pandas, NumPy, and scikit-learn. It can handle larger-than-memory datasets and can distribute the computation across multiple cores or even multiple machines.
importdask.dataframeasddfromsklearn.decompositionimportPCAfromsklearn.preprocessingimportStandardScalerimportdask.arrayasda# load data with daskddata=dd.read_csv(data_path, index_col=0)
# convert to dask arraydata_array=ddata.to_dask_array(lengths=True)
# standardize datascaler=StandardScaler()
data_scaled=scaler.fit_transform(data_array)
# PCA transformationpca_obj=PCA(n_components=None, random_state=42)
data_pca=pca_obj.fit_transform(data_scaled)
which parametrization to use? Maybe a two-step thing in the user manual, first find one representation/parameter set per dataset and method and then apply meta-vis.
Traceback (most recent call last):
File "path/to/projects/project/modules/unsupervised_analysis/.snakemake/scripts/tmpgpjrocx8.plot_umap_diagnostics.py", line 41, in
umap.plot.diagnostic(umap_obj, diagnostic_type='neighborhood', nhood_size=min(umap_obj.n_neighbors, 15), ax=ax_diag[1,1])
File "/path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/umap/plot.py", line 1124, in diagnostic
accuracy = nhood_compare(
File "/path/to/snakemake_conda/4ba29b5deef3de008651353701702e01/lib/python3.9/site-packages/numba/core/dispatcher.py", line 468, in compile_for_args
error_rewrite(e, 'typing')
File "/path/to/snakemake_conda/4ba29b5deef3de008651353701702e01/lib/python3.9/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite
raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function intersect1d at 0x1555481dcca0>) found for signature:
Of which 2 did not match due to:
Overload in function 'jit_np_intersect1d': File: numba/np/arraymath.py: Line 3586.
With argument(s): '(array(int32, 1d, C), array(int32, 1d, C), assume_unique=bool)':
Rejected as the implementation raised a specific error:
TypingError: got an unexpected keyword argument 'assume_unique'
raised from /path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/numba/core/typing/templates.py:784
During: resolving callee type: Function(<function intersect1d at 0x1555481dcca0>)
During: typing of call at /path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/umap/plot.py (209)
File "../../../../../../path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/umap/plot.py", line 209:
def _nhood_compare(indices_left, indices_right):
Activating conda environment: ../../../../../../path/to/snakemake_conda/7e3a48a04ecb72cc15f09fd456de7cf6_
Error in if (all(metadata[[col]] == round(metadata[[col]]))) { :
missing value where TRUE/FALSE needed
Execution halted
Not cleaning up path/to/projects/project/modules/unsupervised_analysis/.snakemake/scripts/tmpa5ioybfx.plot_2d.R
[Thu Feb 29 10:43:21 2024]
Error in rule plot_dimred_metadata:
jobid: 0
output: path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/UMAP/plots/UMAP_euclidean_15_0.1_2_metadata.png
log: logs/rules/plot_metadata_subset_id_UMAP_euclidean_15_0.1_2.log (check log file(s) for error message)
conda-env: /path/to/snakemake_conda/7e3a48a04ecb72cc15f09fd456de7cf6_
RuleException:
CalledProcessErrorin line 69 of path/to/projects/project/modules/unsupervised_analysis/workflow/rules/visualization.smk:
Command 'source /path/to/miniconda3/bin/activate '/path/to/snakemake_conda/7e3a48a04ecb72cc15f09fd456de7cf6_'; set -eo pipefail; Rscript --vanilla path/to/projects/project/modules/unsupervised_analysis/.snakemake/scripts/tmpa5ioybfx.plot_2d.R' returned non-zero exit status 1.
File "path/to/projects/project/modules/unsupervised_analysis/workflow/rules/visualization.smk", line 69, in __rule_plot_dimred_metadata
File "/path/to/miniconda3/envs/snakemake7_15_2/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The metadata file is a csv with multiple categorical columns but also numerical columns like gene_module_scores. One of those is indicated as metadata_of_interest: ["sampleid__donor"] in the config.
Only other thing I changed compared to example config (paths of course as well): sample_proportion: 0.3 to increase iteration speed.
Thank you for your help and the amazing pipelines!