Git Product home page Git Product logo

epigen / unsupervised_analysis Goto Github PK

View Code? Open in Web Editor NEW
12.0 6.0 0.0 57.33 MB

A general purpose Snakemake workflow to perform unsupervised analyses (dimensionality reduction & cluster analysis) and visualizations of high-dimensional data.

License: MIT License

Python 70.31% R 29.69%
data-science high-dimensional-data snakemake workflow unsupervised-learning principal-component-analysis umap pca visualization clustering

unsupervised_analysis's Introduction

DOI

Unsupervised Analysis Workflow

A general purpose Snakemake workflow to perform unsupervised analyses (dimensionality reduction and cluster analysis) and visualizations of high-dimensional data.

This workflow adheres to the module specifications of MR.PARETO, an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository.

If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI 10.5281/zenodo.8405360.

Workflow Rulegraph

Table of contents

Authors

Software

This project wouldn't be possible without the following software and their dependencies

Software Reference (DOI)
clusterCrit https://CRAN.R-project.org/package=clusterCrit
clustree https://doi.org/10.1093/gigascience/giy083
ComplexHeatmap https://doi.org/10.1093/bioinformatics/btw313
densMAP https://doi.org/10.1038/s41587-020-00801-7
ggally https://CRAN.R-project.org/package=GGally
ggplot2 https://ggplot2.tidyverse.org/
ggrepel https://CRAN.R-project.org/package=ggrepel
igraph https://doi.org/10.5281/zenodo.3630268
leidenalg https://doi.org/10.5281/zenodo.1469356
pandas https://doi.org/10.5281/zenodo.3509134
patchwork https://CRAN.R-project.org/package=patchwork
PCA https://doi.org/10.1080/14786440109462720
plotly express https://plot.ly
pymcdm https://doi.org/10.1016/j.softx.2023.101368
scikit-learn http://jmlr.org/papers/v12/pedregosa11a.html
scipy https://doi.org/10.1038/s41592-019-0686-2
Snakemake https://doi.org/10.12688/f1000research.29032.2
umap-learn https://doi.org/10.21105/joss.00861

Methods

This is a template for the Methods section of a scientific publication and is intended to serve as a starting point. Only retain paragraphs relevant to your analysis. References [ref] to the respective publications are curated in the software table above. Versions (ver) have to be read out from the respective conda environment specifications (.yaml file) or post execution. Parameters that have to be adapted depending on the data or workflow configurations are denoted in squared brackets e.g. [X].

The outlined analyses were performed using the programming languages R (ver) [ref] and Python (ver) [ref] unless stated otherwise. We applied both linear and non-linear unsupervised analysis methods for dimensionality reduction on normalized data for downstream analyses (e.g., clustering) and to visualize emerging patterns in lower dimensional embeddings.

Dimensionality Reduction

Principal Component Analysis (PCA) We used Principal Component Analysis (PCA) [ref] from scikit-learn (ver) [ref] as the linear approach. We visualized [n_components] principal components and kept [X/all] components for downstream analyses. For diagnostic purposes we visualized the variance explained of all and the top 10% of principal components (PCs) using elbow- and cumulative-variance-plots, sequential pair-wise PCs for up to 10 PCs using scatter-, and density-plots (colored by [metadata_of_interest]), and finally loadings plots showing the magnitude and direction of the 10 most influential features for each PC combination. The R packages ggally (ver) [ref] and ggrepel (ver) [ref] were used to improve the diagnostic visualizations.

Uniform Manifold Approximation and Projection (UMAP) Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] was used as the non-linear approach. The metric [metric] and number of neighbors [n_neighbors] were used for the generation of a shared k-nearest-neighbor graph. The graph was embedded in [n_components] dimensions with minimum distance parameter [min_dist]. (Optional) We used the density preserving regularization option, densMAP [ref], during the embedding step, with default parameters to account for varying local density of the data within its original high dimensional space.

Hierarchically Clustered Heatmap Hierarchically clustered heatmaps of scaled data (z-score) were generated using the R package ComplexHeatmap (ver) [ref]. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.

Visualization The R-packages ggplot2 (ver) [ref] and patchwork (ver) [ref] were used to generate all 2D visualizations colored by metadata [metadata], feature(s) [features_to_plot], and/or clustering results. Interactive visualizations in self-contained HTML files of all 2D and 3D projections/embeddings were generated using plotly express (ver) [ref].

Cluster Analysis

Leiden Clustering We applied the Leiden algorithm (ver) [ref] to the UMAP KNN graphs specified by the respective parameters (metric, n_neighbors). The adjacency matrix of the KNN graph was converted to a weighted undirected graph using igraph (ver) [ref]. The Leiden algorithm was then applied to this graph, using the specified partition type [partition_types], resolution [resolutions], and number of iterations [n_iterations]. All clustering results were visualized as described above as 2D and interactive 2D and 3D plots for all available embedings/projections.

Clustification Approach We developed/employed an iterative clustering approach, termed Clustification, that merges clusters based on misclassification. The method was initialized with the clustering result that had the highest resolution (i.e., the most clusters). We then performed iterative classification using the cluster labels, to determine if the classifier can distinguish between clusters or if they should be merged. This involved a stratified 5-fold cross-validation and a Random Forest classifier with default parameters (e.g., 100 trees). The predicted labels were retained for each iteration. Clusters were merged based on a normalized confusion matrix built using the predicted labels. This matrix was made symmetric and upper triangular, resulting in a similarity graph, such that each edge weight ranges from 0 to 1, where 0 means that the classifier was able to distinguish all observations between the two respective clusters. The stopping criterion was set such that if the maximum edge weight was less than 2.5% (i.e., 0.025 – less than 5% of observations are misclassified between any two clusters), the process would stop and return the current cluster labels. Otherwise, the two clusters connected by the maximum edge weight were merged. This process was repeated until the stopping criterion was met.

Clustree Analysis & Visualization We performed cluster analysis and visualization using the Clustree package (ver) [ref] with the parameters [count_filter], [prop_filter], and [layout]. The default analysis produced a standard Clustree visualization, ordered by the number of clusters and annotated accordingly. For the custom analysis, we extended the default behaviour by adding [metadata_of_interest] as additional "clusterings". Metadata and features, specified in the configuration, were highlighted on top of the clusterings using aggregation functions. For numerical data, we used the [numerical_aggregation_option] function , and for categorical data, we used the [categorical_label_option] function.

Cluster Validation - External Indices We validated/analyzed the clustering results by comparing them with all categorical metadata using external cluster indices. The complementary indices used were Adjusted Mutual Information (AMI) [ref], Adjusted Rand Index (ARI) [ref], Fowlkes-Mallows Index (FMI) [ref], Homogeneity, Completeness, and V-Measure [ref] from scikit-learn (ver) [ref]. The indices were calculated for each clustering result and each categorical metadata, and visualized using hierarchically clustered heatmaps.

Cluster Validation - Internal Indices & MCDM using TOPSIS We performed internal cluster validation using six complementary indices: Silhouette, Calinski-Harabasz, C-index, Dunn index, Davis-Bouldin Score from the clusterCrit package (ver) [ref], and a weighted Bayesian Information Criterion (BIC) approach as described in Reichl 2018 - Chapter 4.2.2 - Internal Indices. Due to computational cost, PCA results representing 90% of variance explained were used as input, and only a random sample proportion of [sample_proportion] was used. These internal cluster indices are linear, using Euclidean distance metrics. To rank all clustering results and [metadata_of_interest] from best to worst, we applied the Multiple-criteria decision-making (MCDM) method TOPSIS from the the Python package pymcdm (ver) [ref] to the internal cluster indices, as described in Reichl 2018 - Chapter 4.3.1 - The Favorite Approach.

The analysis and visualizations described here were performed using a publicly available Snakemake [ver] (ref) workflow 10.5281/zenodo.8405360.

Features

The workflow perfroms the following analyses on each dataset provided in the annotation file. A result folder "unsupervised_analysis" is generated containing a folder for each dataset.

Dimensionality Reduction

  • Principal Component Anlaysis (PCA) keeping all components (.pickle and .CSV)
    • diagnostics (.PNG):
      • variance: scree-plot and cumulative explained variance-plot of all and top 10% principal components
      • pairs: sequential pair-wise PCs for up to 10 PCs using scatter- and density-plots colored by [metadata_of_interest]
      • loadings: showing the magnitude and direction of the 10 most influential features for each PC combination
  • Uniform Manifold Approximation & Projection (UMAP)
    • k-nearest-neighbor graph (.pickle): generated using the [n_neighbors] parameter together with the provided [metrics].
      • fix any pickle load issue by specifying Python version to 3.9 (in case you want to use the graph downstream)
    • low dimensional embedding (.pickle and .CSV): using the precomputed-knn graph from before, embeddings are parametrized using [min_dist] and [n_components]
    • densMAP (optional): local density preserving regularization as additional dimensionality reduction method (i.e., all UMAP parameter combinations and downstream visualizations apply)
    • diagnostics (.PNG): 2D embedding colored by PCA coordinates, vector quantization coordinates, approximated local dimension, neighborhood Jaccard index
    • connectivity (.PNG): graph/network-connectivity plot with edge-bundling (hammer algorithm variant)
  • Hierarchically Clustered Heatmap (.PNG)
    • hierarchically clustered heatmaps of scaled data (z-score) with configured distance [metrics] and clustering methods ([hclust_methods]). All combinations are computed, and annotated with [metadata_of_interest].
  • Visualization
    • 2D metadata and feature plots (.PNG) of the first 2 principal components and all 2D embeddings, respectively.
    • interactive 2D and 3D visualizations as self contained HTML files of all projections/embeddings.
  • Results directories for each dataset have the following structure:
    • "method" (containing all the data as .pickle and/or .CSV files)
    • plots (for all visualizations as .PNG files)

Cluster Analysis

"The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage." from Algorithms for Clustering Data (1988) by Jain & Dubes

  • Clustering
    • Leiden algorithm
      • Applied to the UMAP KNN graphs specified by the respective parameters (metric, n_neighbors).
      • All algorithm specific parameters are supported: [partition_types], [resolutions], and [n_iterations].
    • Clustification: an ML-based clustering approach that iteratively merges clusters based on misclassification 0. User: Specify a clustering method [method].
      1. Chose the clustering with the most clusters as starting point (i.e., overclustered).
      2. Iterative classification using the cluster labels, to determine if the classifier can distinguish between clusters or if they should be merged.
        • Stratified 5-fold CV
        • RF with 100 trees (i.e., defaults)
        • Retain predicted labels
      3. Merging of clusters.
        • Build a normalized confusion matrix using the predicted labels.
        • Make it symmetric and upper triangle, resulting in a similarity graph.
        • Edge weight ranges from 0 to 1, where 0 means that the classifier was able to distinguish all observations between the two respective clusters.
        • Check stopping criterion: if maximum edge weight < 2.5% (i.e., 0.025 – less than 5% of observations are misclassified between any two clusters).
          • -> STOP and return current cluster labels
        • Merge the two clusters connected by the maximum edge weight.
      4. Back to 2. using the new labels.
  • Clustree analysis and visualization
    • The following clustree specific parameters are supported: [count_filter], [prop_filter], and [layout].
    • default: produces the standard clustree visualization, ordered by number of clusters and annotated.
    • custom: extends default by adding [metadata_of_interest] as additional "clusterings".
    • metadata and features, specified in the config, are highlighted on top of the clusterings using aggregation functions
      • numeric: available aggregation functions: mean, median, max, min
      • categorical: available aggregation functions: "pure" or "majority"
  • Cluster Validation
    • External cluster indices are determined comparing all clustering results with all categorical metadata
    • Internal cluster indices are determined for each clustering and [metadata_of_interest]
      • 6 complementary indices are used
      • Due to the comutational cost PCA results, representing 90% of variance explained, are used for as input and a [sample_proportion] can be configured.
      • Caveat: internal cluster indices are linear i.e., using Euclidean distance metrics.
    • Multiple-criteria decision-making (MCDM) using TOPSIS for ranking clustering results
      • The MCDM method TOPSIS is applied to the internal cluster indices to rank all clustering results (and [metadata_of_interest]) from best to worst.
      • This approach has been described in Reichl 2018 - Chapter 4.3.1 - The Favorite Approach
      • Caveat: Silhouette score sometimes generates NA due to a known bug. Clusterings with NA scores are removed before TOPSIS is applied.
  • Visualization
    • all clustering results as 2D and interactive 2D & 3D plots for all available embedings/projections.
    • external cluster indices as hierarchically clustered heatmaps, aggregated in one panel.
    • internal cluster indices as one heatmap with clusterings (and [metadata_of_interest]) sorted by TOPSIS ranking from top to bottom and split cluster indices split by type (cost/benefit functions to be minimized/maximized).

Usage

Here are some tips for the usage of this workflow:

  • Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (they are computational expensive and slow).
  • Heatmaps require a lot of memory, hence the memory allocation is solved dynamically based on retries. If a out-of-memory exception occurs the flag --retries X can be used to trigger automatic resubmission X time upon failure with X times the memory.
  • Clustification performance scales with available cores, i.e., more cores faster internal parallelization of RF training & testing.
  • Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata.

Configuration

Detailed specifications can be found here ./config/README.md

Examples

We provide a minimal example of the analysis of the UCI ML hand-written digits datasets imported from sklearn in the test folder:

  • config
    • configuration: config/config.yaml
    • sample annotation: digits_unsupervised_analysis_annotation.csv
  • data
    • dataset (1797 observations, 64 features): digits_data.csv
    • metadata (consisting of the ground truth label "target"): digits_labels.csv
  • results will be generated in a subfolder .test/results/
  • performance: on an HPC it took less than 5 minutes to complete a full run (with up to 32GB of memory per task)

single-cell RNA sequencing (scRNA-seq) data analysis

Unsupervised analyses, dimensionality reduction and cluster analysis, are corner stones of scRNA-seq data analyses. A full run on a published scRNA-seq cancer dataset with 21,657 cells and 18,245 genes took 2.5 hours to complete (without heatmaps, with 32GB memory, with 8 cores for clustification, ). Below are configurations of the two most commonly used frameworks, scanpy (Python) and Seurat (R), and the original package's defaults as comparison and to facilitate reproducibility:

UMAP for dimensionality reduction

  • umap-learn
    • initialization: spectral
    • metric: Euclidean
    • neighbors: 15
    • min. distance: 0.1
  • scanpy
    • initialization: spectral
    • metric: Euclidean
    • neighbors: 15
    • min. distance: 0.5
  • Seurat
    • initialization: PCA
    • method: "uwot" (not umap-learn package)
    • metric: Cosine (or Correlation)
    • neighbors: 30
    • min. distance: 0.3

Leiden algorithm for clustering

  • leidenalg
    • no defaults
  • scanpy
    • input: batch balanced UMAP KNN graph
    • partition type: RBConfigurationVertexPartition
    • resolution: 1
  • Seurat
    • input: SNN graph
    • partition type: RBConfigurationVertexPartition
    • resolution: 0.8

Links

Resources

Publications

The following publications successfully used this module for their analyses.

  • ...

unsupervised_analysis's People

Contributors

sreichl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

unsupervised_analysis's Issues

Resampling to evaluate clustering stability

Summary:
Clustering stability could be assessed by doing multiple clusterings, always randomly sampling 90% of the data.
Consensus approach could be used to extract stable clustering.

Drawbacks:
Computationally expensive

Open questions:
How to combine this with clustification? Should it be only used to evaluate stability of one approach, or automatically to generate consensus clustering?

Background:
Daria implemented a resampling strategy, because she noted that adding two new samples completely changed her previous clustering. She ended up finding gene programs by seeing which genes are stably co-differential in clusters, and then came up with hard clusters by using thresholds to assign cells to one or multiple cluster labels.

utilize metadata of interest config field as list for multiple analyses

to increase performance, users can indicate the metadata_of_interest categorical(!) columns for the following analysis

  • PCA diagnostic annotation (only 1st entry of list used)
  • Heatmap annotation (only 1st entry of list used)
  • clustree custom (remove respective field from clustree configs)
  • internal cluster validation (less clustering means faster computation as it scales linearly with the number of clustering)

snakemake找不到srcdir函数

I am a novice, I am currently using your library, but I have some problems, I do not know how to solve
Here are the wrong questions

Traceback (most recent call last):

File "/home/wang/miniconda3/envs/snakemake/lib/python3.12/site-packages/snakemake/cli.py", line 1893, in args_to_api

dag_api = workflow_api.dag(

          ^^^^^^^^^^^^^^^^^

File "/home/wang/miniconda3/envs/snakemake/lib/python3.12/site-packages/snakemake/api.py", line 326, in dag

return DAGApi(

       ^^^^^^^

File "", line 6, in init

File "/home/wang/miniconda3/envs/snakemake/lib/python3.12/site-packages/snakemake/api.py", line 436, in post_init

self.workflow_api._workflow.dag_settings = self.dag_settings

^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/home/wang/miniconda3/envs/snakemake/lib/python3.12/site-packages/snakemake/api.py", line 383, in _workflow

workflow.include(

File "/home/wang/miniconda3/envs/snakemake/lib/python3.12/site-packages/snakemake/workflow.py", line 1379, in include

exec(compile(code, snakefile.get_path_or_uri(), "exec"), self.globals)

File "/home/wang/rna/unsupervised_analysis/workflow/Snakefile", line 10, in

SDIR = os.path.realpath(os.path.dirname(srcdir("Snakefile")))

                                                    ^^^^^^

NameError: name 'srcdir' is not defined

Benchmark clf-based clustering approach

look for clustering benchmark datasets (from various domains) to test the approach and put the result into the documentation)
→ Clustering benchmark papers

feature plot option "all"

features_to_plot option "all” plots all features with a warning that this is only useful with relatively low dimensional data.
provide a number eg <50 dimensions, and only plot that at maximum (as safety)

test & document

test

  • test with digits toy data
    • run time 00:04:46 w/ 32GB memory on HPC
  • test with scRNAseq data (with "ground truth") e.g., Lee2020NatGenet
    • run time 02:36:56 w/ 32GB memory on HPC (w/o heatmaps)
  • document speed of each from start to finish

document

  • config.yaml
  • README
    • note in the docs: fix any pickle (knn graph) issue by fixing Python version to 3.9!
    • used software and their DOIs
    • make section for scRNAseq analysis/bioinformatiocs related information like package defaults and linking there\
    • Features
      • clustering
        • Leiden clustering supporting all partition_type, resolution_parameter and n_iterations
        • clustification an ML-based parameter-free clustering approach
      • cluster validation
        • clustree
        • external cluster indices
        • internal cluster indices + MCDM TOPSIS ranking
      • Visualization
        • clustering results as separate 2D scatter plot panel across all available embeddings
        • Interactive visualization as categorical variables in 2D & 3D
        • cluster validation (internal and external) as clustered/ranked heatmaps
    • Methods analogous to Features)
    • adapt Example (config is in config/ and only data and annot is provided)

Error plot_dimred_metadata

Hi Stephan,

another error:

logs/logs_slurm/plot_dimred_metadata_method=UMAP,n_components=2,parameters=euclidean_15_0.1,sample=subset_id.err

rule plot_dimred_metadata:
input: path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/UMAP/UMAP_euclidean_15_0.1_2_data.csv, path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/UMAP/UMAP_euclidean_15_0.1_2_axes.csv, path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/subset_id/labels.csv, path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/metadata_features.csv, path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/metadata_clusterings.csv
output: path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/UMAP/plots/UMAP_euclidean_15_0.1_2_metadata.png
log: logs/rules/plot_metadata_subset_id_UMAP_euclidean_15_0.1_2.log
jobid: 0
reason: Forced execution
wildcards: sample=subset_id, method=UMAP, parameters=euclidean_15_0.1, n_components=2
threads: 2
resources: mem_mb=128000, disk_mb=1000, tmpdir=/tmp

Activating conda environment: ../../../../../../path/to/snakemake_conda/7e3a48a04ecb72cc15f09fd456de7cf6_
Error in if (all(metadata[[col]] == round(metadata[[col]]))) { :
missing value where TRUE/FALSE needed
Execution halted
Not cleaning up path/to/projects/project/modules/unsupervised_analysis/.snakemake/scripts/tmpa5ioybfx.plot_2d.R
[Thu Feb 29 10:43:21 2024]
Error in rule plot_dimred_metadata:
jobid: 0
output: path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/UMAP/plots/UMAP_euclidean_15_0.1_2_metadata.png
log: logs/rules/plot_metadata_subset_id_UMAP_euclidean_15_0.1_2.log (check log file(s) for error message)
conda-env: /path/to/snakemake_conda/7e3a48a04ecb72cc15f09fd456de7cf6_

RuleException:
CalledProcessErrorin line 69 of path/to/projects/project/modules/unsupervised_analysis/workflow/rules/visualization.smk:
Command 'source /path/to/miniconda3/bin/activate '/path/to/snakemake_conda/7e3a48a04ecb72cc15f09fd456de7cf6_'; set -eo pipefail; Rscript --vanilla path/to/projects/project/modules/unsupervised_analysis/.snakemake/scripts/tmpa5ioybfx.plot_2d.R' returned non-zero exit status 1.
File "path/to/projects/project/modules/unsupervised_analysis/workflow/rules/visualization.smk", line 69, in __rule_plot_dimred_metadata
File "/path/to/miniconda3/envs/snakemake7_15_2/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

My sample sheet looks like this

name,data,metadata,samples_by_features
subset_name,/path/to/results/demultiplexing/first_batch_of_samples/scvi/X_scVI__subset_name.csv,/path/to/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/subset_name/labels.csv,1

My data file are the scVI coordinates

,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
sID367_AAACAGCCAAGTTATC-1,0.8766483,0.011868089,-0.14010176,0.0021616258,-1.7530527,0.8266239,0.05426112,-0.29800522,-0.5499921,0.40404356,0.55395395,-0.011951547,0.070310116,0.1317476,0.09836078,-1.216123,-0.9669612,0.44787252,1.2134984,-1.6698662,-0.88864696,-0.31392187,-0.18586576,-1.0976224,-0.89937776,0.7491747,-0.39786023,-0.3978194,0.009200797,-0.009404337
sID367_AAACATGCAACTAGCC-1,1.2735096,-1.2301883,0.4899947,0.004316354,-0.4477949,-1.0801263,-0.41472286,-0.10565293,-1.0443822,0.1124156,0.71335185,0.01858944,0.069815695,1.8595614,0.9100859,-0.7941134,0.13103442,-0.38214165,0.01599136,0.719963,-1.0942267,0.033875763,0.5481672,-0.029896438,-1.0036578,-0.7464532,-0.04965532,0.33992022,0.016930878,0.016150381
sID367_AAACATGCACATGCTA-1,0.25002998,-0.112220734,-0.36979952,0.027052928,-0.23896998,-0.51691395,1.0869765,1.0108525,-0.81537515,0.71203756,-0.94883174,-0.014021037,0.07627165,-1.5595407,-0.6811844,-0.051620245,1.4360468,0.37079245,0.6642489,1.3201993,0.53024554,1.7682714,1.0888612,-0.40217578,-0.3562716,-0.63303614,0.22093366,0.09114313,0.008224259,-0.0056118146

The metadata file is a csv with multiple categorical columns but also numerical columns like gene_module_scores. One of those is indicated as metadata_of_interest: ["sampleid__donor"] in the config.

Only other thing I changed compared to example config (paths of course as well): sample_proportion: 0.3 to increase iteration speed.

Thank you for your help and the amazing pipelines!

Gene program/marker expression score as metadata

What's the best way to directly visualize marker signature expression, e.g., to get a feeling for cell type assignment, in the unsupervised pipeline? My approach would have been to precalculate them and add them to the metadata. Is that sensible/intended?

address slow heatmaps

define too large: e.g., >10,000 samples/cells?

ideas

improve data loading speed with Dask or NumPy

test it for e.g., pca.py

Dask: Dask is a parallel computing library that integrates with pandas, NumPy, and scikit-learn. It can handle larger-than-memory datasets and can distribute the computation across multiple cores or even multiple machines.

import dask.dataframe as dd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import dask.array as da

# load data with dask
ddata = dd.read_csv(data_path, index_col=0)

# convert to dask array
data_array = ddata.to_dask_array(lengths=True)

# standardize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_array)

# PCA transformation
pca_obj = PCA(n_components=None, random_state=42)
data_pca = pca_obj.fit_transform(data_scaled)

implement cluster validation using indices

  • structure this into coherent notes & subtasks
  • implement external clustering indices
  • visualize external indices
  • implement internal clustering indices
  • #21
  • visualize internal indices
  • investigate if there is a bug in metadata conversion to factor in validation_internal.R
  • create issue(s) for not implemented ideas
  • test on Lee2020NatGenet
  • test robsutness of cluster indices in different sample proportions 0.1, 0.25, 0.5, 0.75, 0.9, 1 on digits

internal cluster indices

  • Question to answer: Which clustering separates/groups "objectively" the data the best?
  • input
    • original data/distance matrix (which metric?!)
    • all clustering results
    • categorical metadata
  • output:
    • cluster_validation/internal_indices_clustering.csv/.png
    • cluster_validation/internal_indices_metadata.csv/.png
    • ?internal_indices_ALL.png (indices & metadata combined)
    • consider one plot consisting of a panel of three (differently sized) heatmaps
  • pseudocode
  1. determine scores for clusterings and categorical metadata (requires distance matrix)
  2. determine rank using MCDM/MCDA e.g., TOPSIS
    MCDM / MCDA python package with TOPSIS: https://pymcdm.readthedocs.io/en/master/index.html
    • check if better alternatives exist
  3. save ranked score matrix as CSV
  4. visualize as heatmap w/ & w/o metadata (scale by scores)
    - clusterings (and/or metadata) as rows are ordered by MCDM ranking
    - indices as columns are hierarchically clustered

external cluster indices

  • Question to answer: Which metadata explains the clustering best?
    • Benchmark scenario: Which clustering most resembles the ground truth?
  • input
    • all clustering results
    • categorical metadata
  • output:
    • cluster_validation/external_indices_{index}.csv
    • cluster_validation/external_indices.png
  • pseudocode
    1. determine scores for clusterings vs categorical metadata
    2. save the score matrix as CSV
    3. visualize each index as heatmap -> panel of heatmaps
    • clusterings as rows are hierarchically clustered
    • categorical metadata as columns are hierarchically clustered
  • indices
    • ARI & NMI
      # metrics
      from sklearn.metrics.cluster import adjusted_rand_score
      from sklearn.metrics.cluster import normalized_mutual_info_score
      
      # external cluster indices with ground truth
      adjusted_rand_score(labels, partition.membership)
      normalized_mutual_info_score(labels, partition.membership)

image

UMAP diagnostics bug - numba intersect1d

Hi Stephan, here the bug:

file: logs/logs_slurm/plot_umap_diagnostics_method=densMAP,parameters=euclidean_15_0.1_2,sample=cancer__primary.err
error:

Traceback (most recent call last):
File "path/to/projects/project/modules/unsupervised_analysis/.snakemake/scripts/tmpgpjrocx8.plot_umap_diagnostics.py", line 41, in
umap.plot.diagnostic(umap_obj, diagnostic_type='neighborhood', nhood_size=min(umap_obj.n_neighbors, 15), ax=ax_diag[1,1])
File "/path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/umap/plot.py", line 1124, in diagnostic
accuracy = nhood_compare(
File "/path/to/snakemake_conda/4ba29b5deef3de008651353701702e01
/lib/python3.9/site-packages/numba/core/dispatcher.py", line 468, in compile_for_args
error_rewrite(e, 'typing')
File "/path/to/snakemake_conda/4ba29b5deef3de008651353701702e01
/lib/python3.9/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite
raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function intersect1d at 0x1555481dcca0>) found for signature:

intersect1d(array(int32, 1d, C), array(int32, 1d, C), assume_unique=Literalbool)

There are 2 candidate implementations:

  • Of which 2 did not match due to:
    Overload in function 'jit_np_intersect1d': File: numba/np/arraymath.py: Line 3586.
    With argument(s): '(array(int32, 1d, C), array(int32, 1d, C), assume_unique=bool)':
    Rejected as the implementation raised a specific error:
    TypingError: got an unexpected keyword argument 'assume_unique'
    raised from /path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/numba/core/typing/templates.py:784

During: resolving callee type: Function(<function intersect1d at 0x1555481dcca0>)
During: typing of call at /path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/umap/plot.py (209)

File "../../../../../../path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/umap/plot.py", line 209:
def _nhood_compare(indices_left, indices_right):

for i in range(indices_left.shape[0]): intersection_size = np.intersect1d(indices_left[i], indices_right[i], ^

My sample sheet looks like this

name,data,metadata,samples_by_features
subset_name,/path/to/results/demultiplexing/first_batch_of_samples/scvi/X_scVI__subset_name.csv,/path/to/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/subset_name/labels.csv,1

My data file are the scVI coordinates

,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
sID367_AAACAGCCAAGTTATC-1,0.8766483,0.011868089,-0.14010176,0.0021616258,-1.7530527,0.8266239,0.05426112,-0.29800522,-0.5499921,0.40404356,0.55395395,-0.011951547,0.070310116,0.1317476,0.09836078,-1.216123,-0.9669612,0.44787252,1.2134984,-1.6698662,-0.88864696,-0.31392187,-0.18586576,-1.0976224,-0.89937776,0.7491747,-0.39786023,-0.3978194,0.009200797,-0.009404337
sID367_AAACATGCAACTAGCC-1,1.2735096,-1.2301883,0.4899947,0.004316354,-0.4477949,-1.0801263,-0.41472286,-0.10565293,-1.0443822,0.1124156,0.71335185,0.01858944,0.069815695,1.8595614,0.9100859,-0.7941134,0.13103442,-0.38214165,0.01599136,0.719963,-1.0942267,0.033875763,0.5481672,-0.029896438,-1.0036578,-0.7464532,-0.04965532,0.33992022,0.016930878,0.016150381
sID367_AAACATGCACATGCTA-1,0.25002998,-0.112220734,-0.36979952,0.027052928,-0.23896998,-0.51691395,1.0869765,1.0108525,-0.81537515,0.71203756,-0.94883174,-0.014021037,0.07627165,-1.5595407,-0.6811844,-0.051620245,1.4360468,0.37079245,0.6642489,1.3201993,0.53024554,1.7682714,1.0888612,-0.40217578,-0.3562716,-0.63303614,0.22093366,0.09114313,0.008224259,-0.0056118146

The metadata file is a csv with multiple categorical columns but also numerical columns like gene_module_scores

Thank you for your help and the amazing pipelines!

investigate PCA OOM error with Lee2020NatGenet data

Config file config/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: disk_mb=14965
Select jobs to execute...

[Thu Sep 7 20:54:30 2023]
rule plot_dimred_clustering:
input: /nobackup/lab_bock/projects/macroIC/results/Lee2020NatGenet/unsupervised_analysis/merged_NORMALIZED/PCA/PCA_default_2_data.csv, /nobackup/lab_bock/projects/macroIC/results/Lee2020NatGenet/unsupervised_analysis/merged_NORMALIZED/PCA/PCA_default_2_axes.csv, /nobackup/lab_bock/projects/macroIC/results/Lee2020NatGenet/unsupervised_analysis/merged_NORMALIZED/metadata_clusterings.csv
output: /nobackup/lab_bock/projects/macroIC/results/Lee2020NatGenet/unsupervised_analysis/merged_NORMALIZED/PCA/plots/PCA_default_2_clustering.png
log: logs/rules/plot_clustering_merged_NORMALIZED_PCA_default_2.log
jobid: 0
reason: Forced execution
wildcards: sample=merged_NORMALIZED, method=PCA, parameters=default, n_components=2
threads: 2
resources: mem_mb=32000, disk_mb=14965, tmpdir=/home/sreichl/tmp

Rscript --vanilla /home/sreichl/projects/unsupervised_analysis/.snakemake/scripts/tmpk0yahhlg.plot_2d.R
Activating conda environment: ../../../../nobackup/lab_bock/users/sreichl/snakemake_conda/3be8ec44298849ccd0fe471cb3096506_
slurmstepd: error: *** JOB 3737892 ON s003 CANCELLED AT 2023-09-07T21:07:26 ***

Visualize clustering results

  • should show up everywhere as metadata
  • 2D & 3D, like metadata plots -> utilize same scripts
  • What if multiple 2D embeddings per data set exist? → per parameter combination as with metadata
  • generate "new” metadata file that includes all (aggregates) clustering results or separate plots -> metadata_clusterings.csv
  • implement for PCA, UMAP, densMAP including checks if parameters overlap and if densMAP is activated/enabled etc.
  • interactive plots 2D/3D plots
  • address that plot in 2D/3D with lots of clusters could be interpreted as continuous value -> force to be categorical
    • 2d and 3d plot forced categorical when substring is “cluster” in colname -> document
  • refactorize the plotting logic if possible

add quote about clustering to documentation

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true be-elvers who have experience and great courage.”
Algorithms for Clustering Data, Jain and Dubes 1988

PCA fails for large dataset

I adapted the pipeline for use with a scRNAseq dataset with ~50000 cells. The input for PCA is the normalized expression matrix.
! Disclaimer: I edited the pipeline, but the PCA is at the very beginning, so this should be the same. However, I may have messed something up. !

The PCA job gets stuck on "Activating conda environment: ..." for hours and then fails. It worked fine for a dataset of 500 cells.
Here is an example of a log file:

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: disk_mb=18545, disk_mib=17686
Select jobs to execute...

[Wed Oct 18 08:59:38 2023]
rule pca:
    input: /data/path/normalized_expr_data.csv, /data/path/obs_metadata.csv
    output: /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_object.pickle, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_var.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_axes.csv
    log: logs/rules/PCA_sample_name_default_2.log
    jobid: 0
    reason: Missing output files: /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_var.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_axes.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_object.pickle
    wildcards: sample=sample_name, parameters=default_2
    threads: 2
    resources: mem_mb=400000, disk_mb=18545, disk_mib=17686, tmpdir=/tmp

python -c "from __future__ import print_function; import sys, json; print(json.dumps([sys.version_info.major, sys.version_info.minor]))"
Activating conda environment: /path/to/conda/env/snakemake_conda/40431c23e3640492480b1b2b0c8d33df_
python /path/to/pipeline/.snakemake/scripts/tmp83o38501.pca.py
Activating conda environment: /path/to/conda/env/snakemake_conda/40431c23e3640492480b1b2b0c8d33df_
slurmstepd: error: *** JOB 4197138 ON d015 CANCELLED AT 2023-10-18T09:49:43 ***

The scheduler info doesn't point to an OOM issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.