epigen / unsupervised_analysis Goto Github PK

A general purpose Snakemake workflow to perform unsupervised analyses (dimensionality reduction & cluster analysis) and visualizations of high-dimensional data.

License: MIT License

Python 70.31% R 29.69%

data-science high-dimensional-data snakemake workflow unsupervised-learning principal-component-analysis umap pca visualization clustering

unsupervised_analysis's Issues

fix PCA pairplot by limiting legend exist only when less than X classes eg 10?

add Variation of Information (VI) and Split/Join as external indices

consider Variation of Information (VI) and Split/Join:
https://stats.stackexchange.com/questions/24961/comparing-clusterings-rand-index-vs-variation-of-information

improve scatter plot speed

use geom_point(pch='.') for 5x increase
https://ggirelli.info/blog/2021/08/17/speed-up-ggplot
https://stackoverflow.com/questions/10945707/speed-up-plot-function-for-large-dataset/33528065#33528065

Benchmark clf-based clustering approach

look for clustering benchmark datasets (from various domains) to test the approach and put the result into the documentation)
→ Clustering benchmark papers

reimplement internal cluster indices supporting different metrics and to improve performance

Current implementation (clusterCrit) is fast on it's own but does not reuse distance matrices that could be determined only once.
Only euclidean metric is supported, extension to support arbitrary metrics would make the internal cluster indices more meaningful.

change storing and aggregation of clustering results

Change clustering saving and aggregation to CSV with index sample name and then pd.concat
Index comes from data csv

document all versions in environment yaml specifications

note: latest numpy (1.24) seems to have an error
- easy fix: numpy 1.23.5 works
- check if error still exists via forced reinstallation, if yes change/fix version in umap.yaml
read directly out of env exports from test data

PCA: additional visualization of top loadings per PC

add quote about clustering to documentation

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true be-elvers who have experience and great courage.”
Algorithms for Clustering Data, Jain and Dubes 1988

snakemake找不到srcdir函数

I am a novice, I am currently using your library, but I have some problems, I do not know how to solve
Here are the wrong questions

Traceback (most recent call last):

File "/home/wang/miniconda3/envs/snakemake/lib/python3.12/site-packages/snakemake/cli.py", line 1893, in args_to_api

dag_api = workflow_api.dag(

          ^^^^^^^^^^^^^^^^^

File "/home/wang/miniconda3/envs/snakemake/lib/python3.12/site-packages/snakemake/api.py", line 326, in dag

return DAGApi(

       ^^^^^^^

File "", line 6, in init

File "/home/wang/miniconda3/envs/snakemake/lib/python3.12/site-packages/snakemake/api.py", line 436, in post_init

self.workflow_api._workflow.dag_settings = self.dag_settings

^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/home/wang/miniconda3/envs/snakemake/lib/python3.12/site-packages/snakemake/api.py", line 383, in _workflow

workflow.include(

File "/home/wang/miniconda3/envs/snakemake/lib/python3.12/site-packages/snakemake/workflow.py", line 1379, in include

exec(compile(code, snakefile.get_path_or_uri(), "exec"), self.globals)

File "/home/wang/rna/unsupervised_analysis/workflow/Snakefile", line 10, in

SDIR = os.path.realpath(os.path.dirname(srcdir("Snakefile")))

                                                    ^^^^^^

NameError: name 'srcdir' is not defined

clustification: add visualization of performance/convergence over time

determine metrics at every iteration and plot at the end the time course.
at least for the stopping criterion max. edge weight, but maybe also for f1 score and accuracy,....

clustree: investigate and implement automatic sc3 stability scoring

Visualize clustering results

should show up everywhere as metadata
2D & 3D, like metadata plots -> utilize same scripts
What if multiple 2D embeddings per data set exist? → per parameter combination as with metadata
generate "new” metadata file that includes all (aggregates) clustering results or separate plots -> metadata_clusterings.csv
implement for PCA, UMAP, densMAP including checks if parameters overlap and if densMAP is activated/enabled etc.
interactive plots 2D/3D plots
address that plot in 2D/3D with lots of clusters could be interpreted as continuous value -> force to be categorical
- 2d and 3d plot forced categorical when substring is “cluster” in colname -> document
refactorize the plotting logic if possible

utilize metadata of interest config field as list for multiple analyses

to increase performance, users can indicate the metadata_of_interest categorical(!) columns for the following analysis

PCA diagnostic annotation (only 1st entry of list used)
Heatmap annotation (only 1st entry of list used)
clustree custom (remove respective field from clustree configs)
internal cluster validation (less clustering means faster computation as it scales linearly with the number of clustering)

feature plot option "all"

features_to_plot option "all” plots all features with a warning that this is only useful with relatively low dimensional data.
provide a number eg <50 dimensions, and only plot that at maximum (as safety)

Gene program/marker expression score as metadata

What's the best way to directly visualize marker signature expression, e.g., to get a feeling for cell type assignment, in the unsupervised pipeline? My approach would have been to precalculate them and add them to the metadata. Is that sensible/intended?

implement significance analysis for clustering

Significance analysis for clustering with single-cell RNA-sequencing data
https://www.nature.com/articles/s41592-023-01933-9

implement Leiden clustering

on UMAP knn graph (which one? config or largest?)
https://github.com/vtraag/leidenalg/tree/f22536ce535656db61b8c46a1a580519550ed734
leidenalg python package (e.g. via pip install leidenalg), see Traag et al (2018)
configs resolution: [0.1,0.5,1,2] (Leiden clustering with config list of resolutions)

clustification: test simultaneous merging

Alternative merging strategy:
Consider merging all clusters above threshold weight in one step to reduce compute

Resampling to evaluate clustering stability

Summary:
Clustering stability could be assessed by doing multiple clusterings, always randomly sampling 90% of the data.
Consensus approach could be used to extract stable clustering.

Drawbacks:
Computationally expensive

Open questions:
How to combine this with clustification? Should it be only used to evaluate stability of one approach, or automatically to generate consensus clustering?

Background:
Daria implemented a resampling strategy, because she noted that adding two new samples completely changed her previous clustering. She ended up finding gene programs by seeing which genes are stably co-differential in clusters, and then came up with hard clusters by using thresholds to assign cells to one or multiple cluster labels.

PCA fails for large dataset

I adapted the pipeline for use with a scRNAseq dataset with ~50000 cells. The input for PCA is the normalized expression matrix.
! Disclaimer: I edited the pipeline, but the PCA is at the very beginning, so this should be the same. However, I may have messed something up. !

The PCA job gets stuck on "Activating conda environment: ..." for hours and then fails. It worked fine for a dataset of 500 cells.
Here is an example of a log file:

Building DAG of jobs...
Your conda installation is not configured to use strict channel priorities. This is however crucial for having robust and correct environments (for details, see https://conda-forge.org/docs/user/tipsandtricks.html). Please consider to configure strict priorities by executing 'conda config --set channel_priority strict'.
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: disk_mb=18545, disk_mib=17686
Select jobs to execute...

[Wed Oct 18 08:59:38 2023]
rule pca:
    input: /data/path/normalized_expr_data.csv, /data/path/obs_metadata.csv
    output: /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_object.pickle, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_var.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_axes.csv
    log: logs/rules/PCA_sample_name_default_2.log
    jobid: 0
    reason: Missing output files: /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_var.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_loadings.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_axes.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data_small.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_data.csv, /results/path/unsupervised_analysis/sample_name/PCA/PCA_default_2_object.pickle
    wildcards: sample=sample_name, parameters=default_2
    threads: 2
    resources: mem_mb=400000, disk_mb=18545, disk_mib=17686, tmpdir=/tmp

python -c "from __future__ import print_function; import sys, json; print(json.dumps([sys.version_info.major, sys.version_info.minor]))"
Activating conda environment: /path/to/conda/env/snakemake_conda/40431c23e3640492480b1b2b0c8d33df_
python /path/to/pipeline/.snakemake/scripts/tmp83o38501.pca.py
Activating conda environment: /path/to/conda/env/snakemake_conda/40431c23e3640492480b1b2b0c8d33df_
slurmstepd: error: *** JOB 4197138 ON d015 CANCELLED AT 2023-10-18T09:49:43 ***

The scheduler info doesn't point to an OOM issue.

implement partition based clustering

for example kmeans/kmedian

implement Manifold trustworthiness

https://scikit-learn.org/stable/modules/generated/sklearn.manifold.trustworthiness.html#sklearn.manifold.trustworthiness

determine (if computational feasible) trustworthiness for every embedding and provide it in the results

address slow heatmaps

define too large: e.g., >10,000 samples/cells?

ideas

for large data (define too large?) do not do heatmaps showing features and data, but instead determine distance matrices and show those via the given metric eg correlation.
do not plot it if either observations or dimensions exceed e.g., 50k
use Heatgraphy, a new visualization package for multi-dimensional data.
- GitHub repo: https://github.com/Heatgraphy/heatgraphy
- Documentation: https://heatgraphy.readthedocs.io/en/latest/index.html
- Web version: https://heatgraphy.streamlit.app/
fast distance matrix computation (which metric?)
(favorite) downsample to equal size of the smallest group provided in the metadata with a minimum of 100 or 10?
min(100,table(metadata$column))

implement cluster validation using indices

internal cluster indices

Question to answer: Which clustering separates/groups "objectively" the data the best?
input
- original data/distance matrix (which metric?!)
- all clustering results
- categorical metadata
output:
- cluster_validation/internal_indices_clustering.csv/.png
- cluster_validation/internal_indices_metadata.csv/.png
- ?internal_indices_ALL.png (indices & metadata combined)
- consider one plot consisting of a panel of three (differently sized) heatmaps
pseudocode

determine scores for clusterings and categorical metadata (requires distance matrix)
determine rank using MCDM/MCDA e.g., TOPSIS
MCDM / MCDA python package with TOPSIS: https://pymcdm.readthedocs.io/en/master/index.html
- check if better alternatives exist
save ranked score matrix as CSV
visualize as heatmap w/ & w/o metadata (scale by scores)
- clusterings (and/or metadata) as rows are ordered by MCDM ranking
- indices as columns are hierarchically clustered

indices
- internal clustering indices: Silhouette, Calinski-Harabasz, Tau, C-index (used in master thesis)
- use Dunn index instead of Tau index
- add Davis-Bouldin Score
- Akaike Information Criterion (self-made in Master Thesis)
- Bayesian Information Criteria (self-made in Master Thesis)
  - https://towardsdatascience.com/are-you-still-using-the-elbow-method-5d271b3063bd
- @sreichl Master Thesis Chapter 4.2.2. Internal Indices

external cluster indices

Question to answer: Which metadata explains the clustering best?
- Benchmark scenario: Which clustering most resembles the ground truth?
input
- all clustering results
- categorical metadata
output:
- cluster_validation/external_indices_{index}.csv
- cluster_validation/external_indices.png
pseudocode
1. determine scores for clusterings vs categorical metadata
2. save the score matrix as CSV
3. visualize each index as heatmap -> panel of heatmaps
- clusterings as rows are hierarchically clustered
- categorical metadata as columns are hierarchically clustered

indices

ARI & NMI

# metrics
from sklearn.metrics.cluster import adjusted_rand_score
from sklearn.metrics.cluster import normalized_mutual_info_score

# external cluster indices with ground truth
adjusted_rand_score(labels, partition.membership)
normalized_mutual_info_score(labels, partition.membership)

Consider alternative clustering algorithms

graph based alternatives

Infomap, but literature claims: Leiden>Louvain>Infomap → Leiden
Spectral clustering
- https://towardsdatascience.com/spectral-clustering-aba2640c0d5b
- https://en.wikipedia.org/wiki/Spectral_clustering

implement consensus cluster calling

for details see @sreichl Master Thesis Chapter 4.3.2. The Consensus Approach

release version 1.0.0

scope/features

investigate PCA OOM error with Lee2020NatGenet data

Config file config/config.yaml is extended by additional config specified via the command line.
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 2
Rules claiming more threads will be scaled down.
Provided resources: disk_mb=14965
Select jobs to execute...

[Thu Sep 7 20:54:30 2023]
rule plot_dimred_clustering:
input: /nobackup/lab_bock/projects/macroIC/results/Lee2020NatGenet/unsupervised_analysis/merged_NORMALIZED/PCA/PCA_default_2_data.csv, /nobackup/lab_bock/projects/macroIC/results/Lee2020NatGenet/unsupervised_analysis/merged_NORMALIZED/PCA/PCA_default_2_axes.csv, /nobackup/lab_bock/projects/macroIC/results/Lee2020NatGenet/unsupervised_analysis/merged_NORMALIZED/metadata_clusterings.csv
output: /nobackup/lab_bock/projects/macroIC/results/Lee2020NatGenet/unsupervised_analysis/merged_NORMALIZED/PCA/plots/PCA_default_2_clustering.png
log: logs/rules/plot_clustering_merged_NORMALIZED_PCA_default_2.log
jobid: 0
reason: Forced execution
wildcards: sample=merged_NORMALIZED, method=PCA, parameters=default, n_components=2
threads: 2
resources: mem_mb=32000, disk_mb=14965, tmpdir=/home/sreichl/tmp

Rscript --vanilla /home/sreichl/projects/unsupervised_analysis/.snakemake/scripts/tmpk0yahhlg.plot_2d.R
Activating conda environment: ../../../../nobackup/lab_bock/users/sreichl/snakemake_conda/3be8ec44298849ccd0fe471cb3096506_
slurmstepd: error: *** JOB 3737892 ON s003 CANCELLED AT 2023-09-07T21:07:26 ***

new minor release when adapted to very large data

new mini release highlighting bug fixes and adaption to large (120k x 28k) & complex (342 groups of interest/labels) data

Visualize clustering results using clustree

https://academic.oup.com/gigascience/article/7/7/giy083/5052205
https://cran.r-project.org/web/packages/clustree/vignettes/clustree.html
https://github.com/lazappi/clustree
https://lazappi.github.io/clustree/reference/clustree.html
per method across all clustering results or across all available clustering results to get a comprehensive overview

improve data loading speed with Dask or NumPy

test it for e.g., pca.py

Dask: Dask is a parallel computing library that integrates with pandas, NumPy, and scikit-learn. It can handle larger-than-memory datasets and can distribute the computation across multiple cores or even multiple machines.

import dask.dataframe as dd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import dask.array as da

# load data with dask
ddata = dd.read_csv(data_path, index_col=0)

# convert to dask array
data_array = ddata.to_dask_array(lengths=True)

# standardize data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data_array)

# PCA transformation
pca_obj = PCA(n_components=None, random_state=42)
data_pca = pca_obj.fit_transform(data_scaled)

check if sklearn PCA automatically standard scales the data

if yes, then simplify pca.py script

test & document

test

test with digits toy data
- run time 00:04:46 w/ 32GB memory on HPC
test with scRNAseq data (with "ground truth") e.g., Lee2020NatGenet
- run time 02:36:56 w/ 32GB memory on HPC (w/o heatmaps)
document speed of each from start to finish

document

consider adding CartoGRAPHs for visualization and DataDiVR export for VR based analysis

both packages from MencheLab
https://github.com/menchelab/CartoGRAPHs
https://github.com/menchelab/VRNetzer -> now: https://github.com/menchelab/DataDiVR_WebApp

discuss w/ RB
discuss w/ Menchies e.g., Chris H.

implement parameter-free classifier-approach to clustering

inspiration from https://github.com/SCCAF/sccaf
can be applied to any of the previous clustering results (always take the one w/ max(#clusters) )
- e.g., Leiden resolution -> use the largest

implement meta visualization as an aggregation of all methods (PCA, densMAP, UMAP, …)

https://github.com/rongstat/meta-visualization
https://www.nature.com/articles/s41467-023-36492-2
which parametrization to use? Maybe a two-step thing in the user manual, first find one representation/parameter set per dataset and method and then apply meta-vis.

implement HDBSCAN clustering

on densMAP results i.e., embedding
https://github.com/scikit-learn-contrib/hdbscan
config lambda (as list?)

UMAP diagnostics bug - numba intersect1d

Hi Stephan, here the bug:

file: logs/logs_slurm/plot_umap_diagnostics_method=densMAP,parameters=euclidean_15_0.1_2,sample=cancer__primary.err
error:

Traceback (most recent call last):
File "path/to/projects/project/modules/unsupervised_analysis/.snakemake/scripts/tmpgpjrocx8.plot_umap_diagnostics.py", line 41, in
umap.plot.diagnostic(umap_obj, diagnostic_type='neighborhood', nhood_size=min(umap_obj.n_neighbors, 15), ax=ax_diag[1,1])
File "/path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/umap/plot.py", line 1124, in diagnostic
accuracy = nhood_compare(
File "/path/to/snakemake_conda/4ba29b5deef3de008651353701702e01/lib/python3.9/site-packages/numba/core/dispatcher.py", line 468, in compile_for_args
error_rewrite(e, 'typing')
File "/path/to/snakemake_conda/4ba29b5deef3de008651353701702e01/lib/python3.9/site-packages/numba/core/dispatcher.py", line 409, in error_rewrite
raise e.with_traceback(None)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<function intersect1d at 0x1555481dcca0>) found for signature:

intersect1d(array(int32, 1d, C), array(int32, 1d, C), assume_unique=Literalbool)

There are 2 candidate implementations:

Of which 2 did not match due to:
Overload in function 'jit_np_intersect1d': File: numba/np/arraymath.py: Line 3586.
With argument(s): '(array(int32, 1d, C), array(int32, 1d, C), assume_unique=bool)':
Rejected as the implementation raised a specific error:
TypingError: got an unexpected keyword argument 'assume_unique'
raised from /path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/numba/core/typing/templates.py:784

During: resolving callee type: Function(<function intersect1d at 0x1555481dcca0>)
During: typing of call at /path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/umap/plot.py (209)

File "../../../../../../path/to/snakemake_conda/4ba29b5deef3de008651353701702e01_/lib/python3.9/site-packages/umap/plot.py", line 209:
def _nhood_compare(indices_left, indices_right):
for i in range(indices_left.shape[0]): intersection_size = np.intersect1d(indices_left[i], indices_right[i], ^

My sample sheet looks like this

name,data,metadata,samples_by_features
subset_name,/path/to/results/demultiplexing/first_batch_of_samples/scvi/X_scVI__subset_name.csv,/path/to/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/subset_name/labels.csv,1

My data file are the scVI coordinates

,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
sID367_AAACAGCCAAGTTATC-1,0.8766483,0.011868089,-0.14010176,0.0021616258,-1.7530527,0.8266239,0.05426112,-0.29800522,-0.5499921,0.40404356,0.55395395,-0.011951547,0.070310116,0.1317476,0.09836078,-1.216123,-0.9669612,0.44787252,1.2134984,-1.6698662,-0.88864696,-0.31392187,-0.18586576,-1.0976224,-0.89937776,0.7491747,-0.39786023,-0.3978194,0.009200797,-0.009404337
sID367_AAACATGCAACTAGCC-1,1.2735096,-1.2301883,0.4899947,0.004316354,-0.4477949,-1.0801263,-0.41472286,-0.10565293,-1.0443822,0.1124156,0.71335185,0.01858944,0.069815695,1.8595614,0.9100859,-0.7941134,0.13103442,-0.38214165,0.01599136,0.719963,-1.0942267,0.033875763,0.5481672,-0.029896438,-1.0036578,-0.7464532,-0.04965532,0.33992022,0.016930878,0.016150381
sID367_AAACATGCACATGCTA-1,0.25002998,-0.112220734,-0.36979952,0.027052928,-0.23896998,-0.51691395,1.0869765,1.0108525,-0.81537515,0.71203756,-0.94883174,-0.014021037,0.07627165,-1.5595407,-0.6811844,-0.051620245,1.4360468,0.37079245,0.6642489,1.3201993,0.53024554,1.7682714,1.0888612,-0.40217578,-0.3562716,-0.63303614,0.22093366,0.09114313,0.008224259,-0.0056118146

The metadata file is a csv with multiple categorical columns but also numerical columns like gene_module_scores

Thank you for your help and the amazing pipelines!

make cluster validation using indices optional

re-use the existing configuration of proportion empty "" or 0 (number probably better) and then instruct target rule accordingly.

add Dynamic visualization of high-dimensional data

Diagnostics - Dynamic visualization of high-dimensional data
- https://www.nature.com/articles/s43588-022-00380-4
- https://github.com/sunericd/dynamicviz

fix job selection for cluster indices without clusterings

although no clustering is performed, these jobs are wrongly submitted/selected

unsupervised_analysis_plot_indices
unsupervised_analysis_aggregate_rank_internal

Error plot_dimred_metadata

Hi Stephan,

another error:

logs/logs_slurm/plot_dimred_metadata_method=UMAP,n_components=2,parameters=euclidean_15_0.1,sample=subset_id.err

rule plot_dimred_metadata:
input: path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/UMAP/UMAP_euclidean_15_0.1_2_data.csv, path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/UMAP/UMAP_euclidean_15_0.1_2_axes.csv, path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/subset_id/labels.csv, path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/metadata_features.csv, path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/metadata_clusterings.csv
output: path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/UMAP/plots/UMAP_euclidean_15_0.1_2_metadata.png
log: logs/rules/plot_metadata_subset_id_UMAP_euclidean_15_0.1_2.log
jobid: 0
reason: Forced execution
wildcards: sample=subset_id, method=UMAP, parameters=euclidean_15_0.1, n_components=2
threads: 2
resources: mem_mb=128000, disk_mb=1000, tmpdir=/tmp

Activating conda environment: ../../../../../../path/to/snakemake_conda/7e3a48a04ecb72cc15f09fd456de7cf6_
Error in if (all(metadata[[col]] == round(metadata[[col]]))) { :
missing value where TRUE/FALSE needed
Execution halted
Not cleaning up path/to/projects/project/modules/unsupervised_analysis/.snakemake/scripts/tmpa5ioybfx.plot_2d.R
[Thu Feb 29 10:43:21 2024]
Error in rule plot_dimred_metadata:
jobid: 0
output: path/to/projects/project/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/unsupervised_analysis/subset_id/UMAP/plots/UMAP_euclidean_15_0.1_2_metadata.png
log: logs/rules/plot_metadata_subset_id_UMAP_euclidean_15_0.1_2.log (check log file(s) for error message)
conda-env: /path/to/snakemake_conda/7e3a48a04ecb72cc15f09fd456de7cf6_

RuleException:
CalledProcessErrorin line 69 of path/to/projects/project/modules/unsupervised_analysis/workflow/rules/visualization.smk:
Command 'source /path/to/miniconda3/bin/activate '/path/to/snakemake_conda/7e3a48a04ecb72cc15f09fd456de7cf6_'; set -eo pipefail; Rscript --vanilla path/to/projects/project/modules/unsupervised_analysis/.snakemake/scripts/tmpa5ioybfx.plot_2d.R' returned non-zero exit status 1.
File "path/to/projects/project/modules/unsupervised_analysis/workflow/rules/visualization.smk", line 69, in __rule_plot_dimred_metadata
File "/path/to/miniconda3/envs/snakemake7_15_2/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

My sample sheet looks like this

name,data,metadata,samples_by_features
subset_name,/path/to/results/demultiplexing/first_batch_of_samples/scvi/X_scVI__subset_name.csv,/path/to/results/demultiplexing/first_batch_of_samples/unsupervised_analysis/subset_name/labels.csv,1

My data file are the scVI coordinates

,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29
sID367_AAACAGCCAAGTTATC-1,0.8766483,0.011868089,-0.14010176,0.0021616258,-1.7530527,0.8266239,0.05426112,-0.29800522,-0.5499921,0.40404356,0.55395395,-0.011951547,0.070310116,0.1317476,0.09836078,-1.216123,-0.9669612,0.44787252,1.2134984,-1.6698662,-0.88864696,-0.31392187,-0.18586576,-1.0976224,-0.89937776,0.7491747,-0.39786023,-0.3978194,0.009200797,-0.009404337
sID367_AAACATGCAACTAGCC-1,1.2735096,-1.2301883,0.4899947,0.004316354,-0.4477949,-1.0801263,-0.41472286,-0.10565293,-1.0443822,0.1124156,0.71335185,0.01858944,0.069815695,1.8595614,0.9100859,-0.7941134,0.13103442,-0.38214165,0.01599136,0.719963,-1.0942267,0.033875763,0.5481672,-0.029896438,-1.0036578,-0.7464532,-0.04965532,0.33992022,0.016930878,0.016150381
sID367_AAACATGCACATGCTA-1,0.25002998,-0.112220734,-0.36979952,0.027052928,-0.23896998,-0.51691395,1.0869765,1.0108525,-0.81537515,0.71203756,-0.94883174,-0.014021037,0.07627165,-1.5595407,-0.6811844,-0.051620245,1.4360468,0.37079245,0.6642489,1.3201993,0.53024554,1.7682714,1.0888612,-0.40217578,-0.3562716,-0.63303614,0.22093366,0.09114313,0.008224259,-0.0056118146

The metadata file is a csv with multiple categorical columns but also numerical columns like gene_module_scores. One of those is indicated as metadata_of_interest: ["sampleid__donor"] in the config.

Only other thing I changed compared to example config (paths of course as well): sample_proportion: 0.3 to increase iteration speed.

Thank you for your help and the amazing pipelines!

rank all clustering results according to all internal cluster indices using MCDM algorithm

for details see @sreichl Master Thesis Chapter 4.3.1. The Favorite Approach

diagnostic plot of all aggregated clustering results

idea: a barplot ordered by number of clusters within each clustering
research alternatives that are common in the field

epigen / unsupervised_analysis Goto Github PK

unsupervised_analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org