dendroulab / panpipes Goto Github PK

Multi-modal single cell analysis pipelines

License: BSD 3-Clause "New" or "Revised" License

R 8.28% Python 91.72%

panpipes's Introduction

Panpipes - multimodal single cell pipelines

Overview

Panpipes is a set of computational workflows designed to automate multimodal single-cell and spatial transcriptomic analyses by incorporating widely-used Python-based tools to perform quality control, preprocessing, integration, clustering, and reference mapping at scale. Panpipes allows reliable and customisable analysis and evaluation of individual and integrated modalities, thereby empowering decision-making before downstream investigations.

See our documentation and our preprint

These workflows make use of cgat-core

Available workflows:

"ingest" : Ingest data and compute quality control metrics
"preprocess" : Filter and normalize per modality
"integration" : Integrate and batch correct using single and multimodal methods
"clustering" : Cluster cells per modality
"refmap" : Map queries against reference datasets
"vis" : Visualize metrics from other pipelines in the context of experiment metadata
"qc_spatial" : Ingest spatial transcriptomics data (Vizgen, Visium) and compute quality control metrics
"preprocess_spatial" : Filtering and normalize spatial transcriptomics data
"deconvolution_spatial" : Deconvolve cell types of spatial transcriptomics slides

Installation and configuration

For detailed installation instructions (including those for Apple Silicon machines), refer to the installation instructions here.

We recommend installing panpipes in a conda environment. For that, we provide a minimal conda config file in pipeline_env.yaml. First, clone this repository and navigate to the root directory of the repository:

git clone https://github.com/DendrouLab/panpipes.git
cd panpipes

Then, create the conda environment and install the nightly version of panpipes using the following command:

conda env create --file=pipeline_env.yaml 
conda activate pipeline_env
pip install -e .

Oxford BMRC Rescomp users find additional advice on the installation here.

Releases

Since panpipes v0.4.0, the ingest workflow expects different headers for the RNA and Protein modalities from the previous releases. Check the example submission file and the documentation for more detailed instructions.

Citation

Panpipes: a pipeline for multiomic single-cell and spatial transcriptomic data analysis Fabiola Curion, Charlotte Rich-Griffin, Devika Agarwal, Sarah Ouologuem, Tom Thomas, Fabian J. Theis, Calliope A. Dendrou bioRxiv 2023.03.11.532085; doi: https://doi.org/10.1101/2023.03.11.532085

Contributors

Created and Maintained by Charlotte Rich-Griffin and Fabiola Curion. Additional contributors: Sarah Ouologuem, Devika Agarwal, Lilly May, Kevin Rue-Albrecht, Giulia Garcia, Lukas Heumos.

panpipes's People

Contributors

Stargazers

Watchers

Forkers

crichgriffin shekhar-chauhan ryan2han bio-la mostafaosama999 topcrusader vadimnazarov

panpipes's Issues

'TypeError: _init_from_dict_() got an unexpected keyword argument 'matrix'

I am using panpipes to analyze CITEseq data, and managed to run the ingest workflow and get the resulting h5mu file. However, when I try to load the resulting 'x_unfilt.h5mu' file in a jupyter notebook using 'muon.read_h5mu()', I get an error that says: ' 'TypeError: init_from_dict() got an unexpected keyword argument 'matrix'.

Preparing submission file for multiome data

I am running a multiome data and preparing my submission file for the ingest workflow. Specifying the cellranger 'outs' folder as a x_path and 'cellranger' as x_filetype results in an error. However, specifying the complete path i.e 'outs/flitered_feature_bc_matrix.h5' and filetype '10X_h5' solves the issue.

dsb does not run when intersection is False

dsb does not run when half the samples are rna + adt, half the samples are rna only (and no intersection between rna,adt is taken).

fix is to take the intersection of the background: mu.pp.intersect_obs(mdata_bg) prior to mu.prot.pp.dsb

[preprocess] how to run pre-filtered objects

These lines are unnecessary:

panpipes/panpipes/panpipes/pipeline_preprocess.py

Line 29 in 1e671bc

PARAMS['filt_file'] = PARAMS['sample_prefix'] + "_filt.h5mu"

panpipes/panpipes/panpipes/pipeline_preprocess.py

Line 30 in 1e671bc

PARAMS['scaled_file'] = PARAMS['sample_prefix'] + "_scaled.h5mu"

And the yml for pipeline_preprocess is wrong about what files that have already been filtered is wrong

The correct thing to do is name your file {sample_prefix}.h5mu

Muon's LSI, HVFs

Hi,

I noticed that Muon's implementation of the LSI for ATAC data doesn't take highly variable features into account.
As far as I see, the function takes the adata.X slot without providing the possibility to first select specific features for the LSI to be run on.
Please see their source code: https://github.com/scverse/muon/blob/master/muon/_atac/tools.py#L28
Also in their documentation, there is no such parameter: https://muon.readthedocs.io/en/latest/api/generated/muon.atac.tl.lsi.html

Meaning, when running LSI in panpipes in run_preprocess_atac.py, it is always run on all the features, even if atac.var["highly_variable"] is defined.

Please correct me if I'm wrong.

different entry points

https://panpipes-pipelines.readthedocs.io/en/latest/usage/different_entry_points.html

Error Visualization Continuous Variable

Hi,

I ran the visualization part of the pipeline with a muData containing both scRNA- and scATAC- data. (Before the visualization step I ran the QC, Preprocessing, Clustering successfully).
When specifying both categorical & continuous variables for the RNA (not the ATAC, I left the ATAC part empty),
so f.ex. "rna:
- rna:total_counts",
it worked completely fine. But when only wanting to plot categorical variables and leaving the continuous variables empty, errors were thrown and the pipeline stopped. The parameter "continuous_violin" was set to False in both cases.
The errors included:
Error in mutate():
ℹ In argument: mod = ifelse(X1 %in% c("rna", "prot", "atac", "rep"), X1, "multimodal").
Caused by error in X1 %in% c("rna", "prot", "atac", "rep"):
! object 'X1' not found
when running:
Rscript /home/sarah/anaconda3/envs/pipeline_env/lib/python3.8/site-packages/panpipes/R_scripts/plot_metrics.R --mtd_object sample1_cell_metadata.tsv --params_yaml pipeline.yml > logs/plot_metrics.log

Do I need to specify the parameters in the pipeline.yml in a specific way so that it works? What do I have to consider when only wanting to plot categorical variables?

Thanks.

AttributeError: 'NoneType' object has no attribute 'get_legend_handles_labels

I am ingest workflow in a multiome data. however, the panpipes ingest aborts with an error when it reaches 'run_scanpyQC_atac.py ' script. The error I get is:

sc.pl.violin(atac, qc_vars_plot,
File "/miniconda3/envs/pipeline_env/lib/python3.9/site-packages/scanpy/plotting/_anndata.py", line 795, in violin

g = sns.catplot(
File "/seaborn/categorical.py", line 2932, in catplot
p.plot_violins(
File "/seaborn/categorical.py", line 1153, in plot_violi
ns
self._configure_legend(ax, legend_artist, common_kws)
File "/seaborn/categorical.py", line 420, in configure
legend
handles, _ = ax.get_legend_handles_labels()
AttributeError: 'NoneType' object has no attribute 'get_legend_handles_labels' \

Experience using the pipeline for the 1st time

Hi,

there are some aspects I've noticed while using the QC+Preprocessing for the first time with a RNA+ATAC multiome dataset (filtered_feature_bc_matrix.h5 file):

Sample submission file: unclear to me what is meant by the cellranger "outs" folder in regards to the keys "cellranger" and "cellranger_multi". What files are expected to be in the outs folder? (The barcodes.tsv, genes.tsv and matrix.mtx f.ex.?)
- was unsure whether the folder containing the .h5 file (or cellranger outputs) needs to be named "outs"
Regarding the QC_mm gene lists: didn't know before running the pipeline that one has to provide a list & that it's not an option, as the documentation of the gene list formats states "...,the user can provide custom gene lists..."
Regarding the QC pipeline.yml file:
- wasn't sure how to specify the "score_genes" parameter & what "MarkersNeutro" is (-> MarkersNeutro is a group of genes in the provided gene list, right?)
- ATAC QC: did not know how to specify the "partner_rna" parameter for the multiome (RNA+ATAC) dataset, whether to set it as "True"/"False" etc; was not clear to me that this parameter needs to be left empty for my case + threw an error when trying to set "partner_rna" to the .h5 file of the RNA+ATAC data;
Regarding the output of the QC:
- The scatter plot of the "n_genes_by_counts x doublet_scores" was too small, couldn't see the distribution clearly (see attached)
- Filtering in the "Preprocessing" step of the pipeline: When wanting to filter genes by the number of cells they are expressed in (i.e. n_cells_by_counts) and the genes' total_counts, I wasn't able to decide on a cutoff because the QC produced no plots of the two metrics
- Violin plots of "n_genes_by_counts" and the number of molecules in each cell (total_counts) would be nice for the user to have to decide on cutoffs. I know a lot of people who used those violin plots (including me), Seurat's tutorial: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html also uses them
- I ran the QC multiple times for the same dataset. Somehow I only got suggested thresholds for the RNA in tsv files the first time that I ran the QC. The other times I ran the QC, I didn't get this output

Possible scATAC extensions

Hi,

the following may be possible extensions to panpipes in regards to scATAC-data:

Preprocessing:

Alternative to using the scanpy sc.pp.highly_variable_genes for the ATAC data: FindTopFeatures() (https://stuartlab.org/signac/reference/findtopfeatures) Signac selects features in this way for the scATAC data, see f.ex.: https://stuartlab.org/signac/articles/pbmc_vignette.html#normalization-and-linear-dimensional-reduction

Visualization:

Plotting genomic regions: https://stuartlab.org/signac/reference/coverageplot
They also have many other different plots: https://stuartlab.org/signac/articles/visualization.html
But I think especially the CoveragePlot() is really nice

Analysis:
Not sure how deep panpipes wants to go on the analysis part, but:

Signac's gene activity assay
Find co-accessible networks with Cicero. The co-accessible links can then also be nicely plotted with CoveragePlot(), see: https://stuartlab.org/signac/articles/cicero.html
Motif analysis

RNA+ATAC:

For a multiomic dataset it could be interesting to run weighted nearest neighbors and with this graph run UMAP to build a joint visualization, see f.ex.: https://stuartlab.org/signac/articles/pbmc_multiomic.html
In signac, they use the FindMultiModalNeighbors() function for this

[integration] bbknn for prot, plus extra parameterisation

Currently yml says that bbknn works for protein, but isn't in the ruffus pipe.
Also we don't parse any parameters and only use the defaults.

reduce wnn runtime by fetching the precomputed no_batch_*

in the current version of the integration , if wnn is run on no-batch corrected modalities, it will run neighbours on each modality on the flight in a "no_batch" way (i.e. on precomputed dimred such as PCA or LSI if specified) with the same param as specified for each of the no_batch unimodal analyses.
it's a different behaviour when wnn is calc on pre-batch corrected unimodal data, cause in that case the pipeline expects each batch corrected object to exist and it's correctly reflected in the decorators flow.

we need to modify wnn to fetch precomputed no_batch instead of running on the flight to reduce the runtime (currently runs nobatch twice per modality if wnn is called on no_batch)

ADT PCA run parameters choice

hi,
Currently PCA on the ADT modality , if clr and dsb is run, is always run on dsb. Perhaps this can be parametrised, so the user can decide if they want PCA on clr or dsb. If this is in-convinient then this should be made clearer in the pipeline.yml, so users are aware that when running dsb, pca is always based on dsb. As this also affects downstream tasks.

Secondly, based on the recent single cell best practices book , removing the isotypes when doing dim-reduction might be an a sensisble choice, since not everyone might want to do this, perhaps this choice can also be paramterised and be an option in the pipeline.yml for the panpies_preprocess workflow

best,
Devika

1st Experience: Preprocessing, Integration, Clustering

Hi,

again just a few little things I ran into while running the steps Preprocessing, Integration, and Clustering.

Preprocessing:
- Specifying the parameter "output_logged_mudata" in the pipeline.yml did not work for me, it threw an error. But in the pipeline.yml it is stated as "TODO" that this parameter is supposed to go, so I guessed that this error isn't so important. When leaving this parameter empty, the preprocessing worked completely fine
Integration:
- Running "panpipes integration make plot_pcas --local" resulted in an error, stating:
  "Target task 'plot_pcas' is not a pipelined task in Ruffus. Is it spelt correctly ?"
Clustering:
- So that others don't experience the same error, it may be helpful to mention that one needs to specify more than one resolution, otherwise clustree throws an error:
  "Error: Less than two column names matched the prefix: leiden_re
  Execution halted"

Best,
Sarah

Update figure to include ST

Documentation suggestions

Clustering: Make it more clear that if you want to subcluster, you need to re run preprocess & integration before clustering.

Repertoire: no panpipes documentation on what gets incorporated. there doesn't seem to be a column for productive sequence or not?

Current installation instructions do not work

I am aware that pip install . does not work as intended, fixing it will require a resturcture of the repo.

Current alternative method of installation is as follows

pip install -r requirements_minimal.txt
Rscript r_install_libraries.R
python setup_orig.py develop

Hopefully this will get fixed up in the next couple of days!

Documentation mentions non-existent `integration plot_pcas` task

Documentation of integration step here in step 2 mentions panpipes integration plot_pcas task, which doesn't exist in pipeline_integration.py. The PCA plots are already produced at panpipes preprocess stage.

check downsample background in dsb script

          check downsample background in dsb script

Originally posted by @bio-la in #6 (review)

"Time: command not found" error

Hi,

while running the pipeline (the QC) on my local machine (Linux) for the first time, I encountered the following error:

/bin/bash: line 1: time: command not found

I solved the issue by following: "https://superuser.com/questions/418325/sh-time-command-not-found" and installing the "time" package on my Linux machine.
Would be good to add the "time" package to the requirements or to mention it as required.

Best,
Sarah

plotting covariates on batch corrected umaps, legend not consistent

Hi All,
I have noticed that the legend for the umap plotting after batch correction in the panpipes integration workflow is not consistent across all methods.

Devika

No X_pca in obsm if filtering hvgs

When filtering to keep top hvgs only the outputted h5mu does not contain variables associated with scaling (.var 'std' or 'mean') or PCA (.obsm X_pca), even though other outputs (output_pca.txt.gz, filtered_genes.tsv) indicate these steps are being run:

AnnData object with n_obs × n_vars = 370316 × 61860
    obs: 'sample_id', 'doublet_scores', 'predicted_doublets', 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'total_counts_hb', 'log1p_total_counts_hb', 'pct_counts_hb', 'total_counts_mt', 'log1p_total_counts_mt', 'pct_counts_mt', 'total_counts_rp', 'log1p_total_counts_rp', 'pct_counts_rp', 'total_counts_ig', 'log1p_total_counts_ig', 'pct_counts_ig', 'MarkersNeutro_score', 'S_score', 'G2M_score', 'batch'
    var: 'gene_ids', 'feature_types', 'genome', 'interval', 'hb', 'mt', 'rp', 'ig', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm'
    uns: 'hvg', 'log1p'
    layers: 'raw_counts'

When not filtering hvgs, this is not an issue.

ATAC preprocess binarizing data even when 'binarize' set to False

Hello! Am having some issues with preprocessing atac data (paired multiome). I am trying to perform preprocessing to be able to run harmony for batch correction.

The pipeline.yml settings are as follows: 
atac:
  binarize: False
  normalize: log1p

Arguments appear to be read in correctly when running the pipeline:

pid: 45740, system: Linux 3.10.0-1160.62.1.el7.x86_64 #1 SMP Tue Apr 5 16:57:59 UTC 2022 x86_64
2023-09-01 17:42:35,606 INFO main control - atac                                    : {'binarize': False, 'normalize': 'log1p', 'TFIDF_flavour': None, 'feature_selection_flavour': 'scanpy', 'min_mean': None, 'max_mean': None, 'min_disp': None, 'min_cutoff': None, 'dimred': 'PCA', 'dim_remove': None} \
                                           atac_TFIDF_flavour                      : None \
                                           atac_binarize                           : False \
                                           atac_dim_remove                         : None \
                                           atac_dimred                             : PCA \
                                           atac_feature_selection_flavour          : scanpy \
                                           atac_normalize                          : log1p \

But I still have the preprocess log outputs as:

2023-09-01 18:06:29,095: INFO - running with args:
2023-09-01 18:14:14,192: INFO - binarizing peak count matrix
2023-09-01 18:14:15,416: WARNING - Careful, you have decided to binarize data but also to normalize per cell and log1p. Not sure this is meaningful
2023-09-01 18:14:45,824: WARNING - You have 8984 Highly Variable Features
2023-09-01 18:38:51,939: INFO - Done

Is there any other variable causing the atac processing to default to binarizing?
Thank you!

Implement plots from preprint

Some of the plots in the preprint have not been implemented fully yet.
We should do this at some point.

Clustree plot won't render if ggraph package not updated

I got this error when running clustree: Error in check.length(gparname) : 'gpar' element 'lwd' must not be length 0

Solution: R package ggraph must be 2.1.0 (it won't work if ggraph is 2.0.5)

LSI, Neighbors, panpipes_clustering

Hi,

a small detail I noticed in panpipes_clustering:

When calculating neighbors (meaning, use_existing=False), the pipeline only recalculates the PCA if not present, and not the LSI (Please see: https://github.com/DendrouLab/panpipes/blob/main/panpipes/python_scripts/rerun_find_neighbors_for_clustering.py#L53).
It is possible to run neighbors on LSI, but only if the LSI is already present in the object. Otherwise, an error is thrown: ValueError: Did not find X_lsi in .obsm.keys(). You need to compute it first. in: https://github.com/DendrouLab/panpipes/blob/main/panpipes/funcs/scmethods.py#L267
A small enhancement could be to also recalculate the LSI if dim_red: X_lsi and it is not present in the object (the same way it's already done for PCA).

AttributeError: 'YTick' object has no attribute 'label'

While running 'panpipes ingest make full --local' locally on my computer I receive this error: AttributeError: 'YTick' object has no attribute 'label'.

I guess it has to do something with matplotlib.

Full error code:

Traceback (most recent call last):
  File "/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
    return_value = job_wrapper(params, user_defined_work_func,
  File "/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/ruffus/task.py", line 608, in job_wrapper_output_files
    job_wrapper_io_files(params, user_defined_work_func, register_cleanup, touch_files_only,
  File "/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/ruffus/task.py", line 540, in job_wrapper_io_files
    ret_val = user_defined_work_func(*(params[1:]))
  File "/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/panpipes/panpipes/pipeline_ingest.py", line 469, in run_dsb_clr
    P.run(cmd, **job_kwargs)
  File "/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/cgatcore/pipeline/execution.py", line 1244, in run
    benchmark_data = r.run(statement_list)
  File "/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/cgatcore/pipeline/execution.py", line 1029, in run
    raise OSError(
OSError: ---------------------------------------
Child was terminated by signal -1:
The stderr was:
/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.
  self.seed = seed
/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.
  self.dl_pin_memory_gpu_training = (
/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/muon/_prot/preproc.py:219: UserWarning: adata.X is sparse but not in CSC format. Converting to CSC.
  warn("adata.X is sparse but not in CSC format. Converting to CSC.")
Traceback (most recent call last):
  File "/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/panpipes/python_scripts/run_preprocess_prot.py", line 144, in <module>
    pnp.plotting.ridgeplot(mdata["prot"], features=plot_features, layer="clr",  splitplot=6)
  File "/Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/panpipes/funcs/plotting.py", line 299, in ridgeplot
    tick.label.set_fontsize(10)
AttributeError: 'YTick' object has no attribute 'label'

python /Users/justina/opt/anaconda3/envs/multiome_panpipes/lib/python3.9/site-packages/panpipes/python_scripts/run_preprocess_prot.py         --filtered_mudata test_unfilt.h5mu         --figpath ./figures/prot          --channel_col sample_id --normalisation_methods clr --quantile_clipping True --clr_margin 0 > logs/run_dsb_clr.log

Matplotlib version: 3.8.0

run_lisi.py is erorring when batch terms is multiple covariates

Hi,
When running the panpipes integration workflow and batch correcting for more than 2 covariates, the run_lisi.py script is erorring when it tries to find a modality:bc_batch column in the cell_mtd generated in the previous step.

Devika

signac hvf selection expects gene_ids

https://github.com/DendrouLab/panpipes/blob/47f752e7a570dece48c420ae10d6298938282cbf/panpipes/funcs/scmethods.py#L42C9-L51C30

running this selection on an atac object that doesn't have "gene_ids" column doesn't work.
@SarahOuologuem can it be substituted with "features" instead? (the function should test if "gene_ids" is present in var_names otherwise use features or else issue warning and automatically set hvf selection to "scanpy")

                                              features  n_cells_by_counts  mean_counts  pct_dropout_by_counts  total_counts
chr1-9962-10510                        chr1-9962-10510                 12     0.005464              99.453552          12.0
chr1-180614-181999                  chr1-180614-181999                 65     0.031876              97.040073          70.0
chr1-191356-191736                  chr1-191356-191736                  3     0.001366              99.863388           3.0
chr1-267811-268201                  chr1-267811-268201                 13     0.005920              99.408015          13.0
chr1-586031-586368                  chr1-586031-586368                  3     0.001366              99.863388           3.0
...                                                ...                ...          ...                    ...           ...
KI270727.1-52104-52803          KI270727.1-52104-52803                 59     0.028689              97.313297          63.0
KI270728.1-232459-232988      KI270728.1-232459-232988                  6     0.002732              99.726776           6.0
KI270728.1-1791305-1792428  KI270728.1-1791305-1792428                  9     0.005009              99.590164          11.0
KI270734.1-117216-117331      KI270734.1-117216-117331                  5     0.002277              99.772313           5.0
KI270734.1-133749-134116      KI270734.1-133749-134116                  8     0.004098              99.635701           9.0

thank you!

add spatial workflow description in "workflows"

https://panpipes-pipelines.readthedocs.io/en/latest/workflows/index.html

refactoring scib

We have removed all scib metrics computation from the integration pipeline.

scib metrics were implemented with the scope of evaluating unimodal integration. The use of scib metrics for evaluating multimodal integration and reference mapping has been adopted by the community and can provide useful insights for evaluation of multimodal integration.
However there is currently a lack of benchmarking metrics developed specifically for the evaluation of these tasks, which can result in misleading interpretation of integration results.
We and others in the sc field are currently working on generating ad-hoc benchmarking metrics for these tasks and they will be released in the near future.

Therefore, our aim for the next panpipes release is to:

write a separate workflow to allow extra flexibility when calling the scib computation
substitute scib with the faster scib-metrics package wherever possible
increase the number of metrics to include newly ad-hoc generated ones

We have left for now the calculation of scib metrics in the refmap workflow as a legacy example of how these are currently computed, but we will be refactoring them in due time.

If you feel you have ideas on implementing integration and/or refmap benchmarking metrics and want to contribute feel free to reach out!

Typo in readthedocs (integration)

Typo in step 6 of Steps to run here: https://panpipes-pipelines.readthedocs.io/en/latest/workflows/integration.html

Change panpipes integration make merge_batch_correction to panpipes integration make merge_integration

[preprocess] write out cell numbers returns only rna values

this should be mm not mod
https://github.com/DendrouLab/panpipes/blob/e4a08192468c5853f9f9e81492794640e928cf28/panpipes/python_scripts/run_filter.py#LL178C36-L178C36

dsb normalisation: ValueError: could not convert string to float: 'Sample_587'

I am using panpipes to analyse a CITE seq data. In my submission file, the 'sample_id' column is 'Sample_587'. When I run the ingest workflow, I specify the 'dsb' normalization. However, the pipeline aborts at 'assess_background.py' script with the following error:

sns.heatmap(plt_df.iloc[1:split_int,:], ax=ax[0])
File "/envs/pipeline_env/lib/python3.9/site-packages/seaborn/matrix.py", line 446, in heatmap
plotter = _HeatMapper(data, vmin, vmax, cmap, center, robust, annot, fmt,
File "/envs/pipeline_env/lib/python3.9/site-packages/seaborn/matrix.py", line 163, in init
self._determine_cmap_params(plot_data, vmin, vmax,
File "~/envs/pipeline_env/lib/python3.9/site-packages/seaborn/matrix.py", line 197, in determine_cmap
params
calc_data = plot_data.astype(float).filled(np.nan)
ValueError: could not convert string to float: 'Sample_587' \

missing package openpyxl - checked on new PyPy installation

message came up while running find cluster markers. @crichgriffin check installation requirements please!

  File "/Users/fabiola.curion/Documents/devel/github/panpipes/panpipes/python_scripts/run_find_markers_multi.py", line 213, in <module>
    main(adata, 
  File "/Users/fabiola.curion/Documents/devel/github/panpipes/panpipes/python_scripts/run_find_markers_multi.py", line 183, in main
    with pd.ExcelWriter(excel_file_top) as writer:
  File "/Users/fabiola.curion/Documents/devel/miniconda3/envs/pipeline_bbknn/lib/python3.9/site-packages/pandas/io/excel/_openpyxl.py", line 56, in __init__
    from openpyxl.workbook import Workbook
ModuleNotFoundError: No module named 'openpyxl'

Violin plots are not being plotted

Hi,
I have noticed that violin plots are no longer being for the data. The files are generated and there is a border but i dont actually see any violin plots being plotted. This has only started happening recently.
Best,
Devika

make official release of gex/adt nomenclature change

currently in branch fc_namescheck

mofa failing within panpipes

Hi
I have installed panpipes using a python venv for the Oxford BMRC cluster.

modules i have loaded are
Python/3.10.4-GCCcore-11.3.0 R-bundle-Bioconductor/3.15-foss-2022a-R-4.2.1
I am using muon version 0.1.5, mudata 0.2.3 and mofapy2 0.7.0.

the training model converges, but the pipeline fails with this error (attached screenshot)

Thanks,
Devika

Error "Getting requirements to build wheel did not run successfully"

By the end of the second step from the installation guide an error"getting requirements to build wheel did not run successfully" pops up. The underlying problem seems to be downloading pysam package, which from what I have browsed is only a problem for windows.

Protein PCA fails when number of samples > number of features

In case of small number of samples and when number of features is 50, panpipes preprocess incorrectly establishes the number of PCAs that should be calculated:

panpipes/panpipes/python_scripts/run_preprocess_prot.py

Line 232 in 035c91c

n_comps=min(50,all_mdata['prot'].var.shape[0]-1),

Changing this line to n_comps=min(50,all_mdata['prot'].var.shape[0]-1, all_mdata['prot'].var.shape[1]-1) fixes the issue.

I also suggest considering changing the solver to auto below a certain threshold of cells, as it's more robust (but slower).

LSI, scATAC-Seq

Hi,

I noticed that the preprocessing/QC part of the pipeline doesn't provide a plot that could guide the decision on whether to exclude the first LSI component or not.
The signac package provides a plot of the correlation between the sequencing depth and the components: https://stuartlab.org/signac/reference/depthcor. Including this plot in the pipeline may be a nice extension.

Clustering will not work to find prot markers for rna clusters when samples/ no of cells in rna and adt are not identical

Hiya
Currently the clustering workflow can calculate prot markers for the different RNA leiden resolutions, however this function never finishes for calculating the prot markers if the number of samples/cell in rna are not the same as the prot modaltiy

pipeline_ingest.concat_filtered_mudatas requires pytz

pipeline_ingest.concat_filtered_mudatas part of the pipeline throws an error because of missing pytz.

ERROR main control -

Original exception:

Exception #1
'builtins.OSError(Job 29542491 has non-zero exitStatus 1: hasExited=True, wasAborted=FalsehasSignal=False, terminatedSignal=''
Traceback (most recent call last):
File "[path]/envs/panpipes/lib/python3.9/site-packages/panpipes/python_scripts/concat_adata.py", line 1, in
import scanpy as sc
File "[path]/envs/panpipes2/lib/python3.9/site-packages/scanpy/init.py", line 6, in
from ._utils import check_versions
File "[path]/envs/panpipes2/lib/python3.9/site-packages/scanpy/_utils/init.py", line 21, in
from anndata import AnnData, version as anndata_version
File "[path]/envs/panpipes2/lib/python3.9/site-packages/anndata/init.py", line 7, in
from ._core.anndata import AnnData
File "[path]/envs/panpipes2/lib/python3.9/site-packages/anndata/_core/anndata.py", line 21, in
import pandas as pd
File "[path]/envs/panpipes2/lib/python3.9/site-packages/pandas/init.py", line 16, in
raise ImportError(
ImportError: Unable to import required dependencies:
pytz: No module named 'pytz'
)' raised in ...
Task = def pipeline_ingest.concat_filtered_mudatas(...):
Traceback (most recent call last):
File "[path]/envs/panpipes/lib/python3.9/site-packages/ruffus/task.py", line 712, in run_pooled_job_without_exceptions
return_value = job_wrapper(params, user_defined_work_func,
File "[path]/envs/panpipes/lib/python3.9/site-packages/ruffus/task.py", line 545, in job_wrapper_io_files
ret_val = user_defined_work_func(*params)
File "[path]/envs/panpipes/lib/python3.9/site-packages/panpipes/panpipes/pipeline_ingest.py", line 178, in concat_filtered_mudatas
P.run(cmd, **job_kwargs)
File "[path]/envs/panpipes/lib/python3.9/site-packages/cgatcore/pipeline/execution.py", line 1244, in run
benchmark_data = r.run(statement_list)
File "[path]/envs/panpipes/lib/python3.9/site-packages/cgatcore/pipeline/execution.py", line 820, in run
stdout, stderr, resource_usage = self.queue_manager.collect_single_job_from_cluster(
File "[path]/envs/panpipes/lib/python3.9/site-packages/cgatcore/pipeline/cluster.py", line 145, in collect_single_job_from_cluster
raise OSError(error_msg)
OSError: Job 29542491 has non-zero exitStatus 1: hasExited=True, wasAborted=FalsehasSignal=False, terminatedSignal=''
Traceback (most recent call last):
File "[path]/envs/panpipes/lib/python3.9/site-packages/panpipes/python_scripts/concat_adata.py", line 1, in
import scanpy as sc
File "[path]/envs/panpipes2/lib/python3.9/site-packages/scanpy/init.py", line 6, in
from ._utils import check_versions
File "[path]/envs/panpipes2/lib/python3.9/site-packages/scanpy/_utils/init.py", line 21, in
from anndata import AnnData, version as anndata_version
File "[path]/envs/panpipes2/lib/python3.9/site-packages/anndata/init.py", line 7, in
from ._core.anndata import AnnData
File "[path]/envs/panpipes2/lib/python3.9/site-packages/anndata/_core/anndata.py", line 21, in
import pandas as pd
File "[path]/envs/panpipes2/lib/python3.9/site-packages/pandas/init.py", line 16, in
raise ImportError(
ImportError: Unable to import required dependencies:
pytz: No module named 'pytz'

\

I removed lines of the log containing individual .h5mu files to protect patient information and replaced my cluster paths with [path]. After conda install -c anaconda pytz the error persists.

issue with plotting covariates and faceting plots uniformly across methods in panpipes_integration.py

hiya!
i noticed this while looking at plots for the different covariates across the multiple batch correction methods, when running integration workflow from panpipes. The same colours are not use to depict the same legend categories (in my case i noticed it for VDJ receptor subtypes) across all the methods, and when facet plots are created, the order of the headings is different for each method. The latter isnt a problem as such, but does make it difficult to compare across methods in a facet plot. Not sure if the first plottign issue happpens for all covariates or not, or only certain types. Thought i would flag the issue.

best,
Devika

name of python script "run_scanpyQC_rep.py" is in-consistent in scripts across panpipes

hi,
I am getting an error, because panpipes qc_mm cannot find the script run_scanpyQC_rep.py . This is because in the python folder the script has been named differently i.e. run_scanpyQC_REP.py , than how it is called in pipeline_qc_mm.py on line [472] .

I have made the necessary amendments locally for my needs , but thought to mention it here.

Devika

typo in documentation

hi ,
Currently the documentation here : https://panpipes-pipelines.readthedocs.io/en/latest/workflows/qc.html
says panpipes ingestion and panpipes ingestion config , panpipes ingestion make full. We tried running this today and it dint work. it only worked with panpipes ingest make full. It would be good to know what is that we want this to be, is the documentaiton wrong, or does the code need to be updated

Devika

dendroulab / panpipes Goto Github PK

panpipes's Introduction

Panpipes - multimodal single cell pipelines

Overview

Installation and configuration

Releases

Citation

Contributors

panpipes's People

Contributors

Stargazers

Watchers

Forkers

panpipes's Issues

Recommend Projects

Recommend Topics

Recommend Org