aertslab / scenicplus Goto Github PK

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.

License: Other

Python 9.84% Jupyter Notebook 90.11% Dockerfile 0.06%

scenicplus's Introduction

SCENIC+ single-cell eGRN inference

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.

Tip

We did a live webinar on using the SCENIC+ workflow. You can rewatch it on YouTube

Documentation

Extensive documentation and tutorials are available at read the docs.

Installing

To install SCENIC+ (in a Linux environment):

We highly recommend to install SCENIC+ in a new conda environment.

$ conda create --name scenicplus python=3.11 -y
$ conda activate scenicplus
$ git clone https://github.com/aertslab/scenicplus
$ cd scenicplus
$ pip install .

Questions?

If you have technical questions or problems, such as bug reports or ideas for new features, please open an issue under the issues tab.
If you have questions about the interpretation of results or your analysis, please start a Discussion under the Discussions tab.

References

Bravo Gonzalez-Blas, C. & De Winter, S. et al. (2022). SCENIC+: single-cell multiomic inference of enhancers and gene regulatory networks

scenicplus's People

Contributors

Stargazers

Watchers

scenicplus's Issues

mm10-based tutorial

Very excited to try this new methodology as a past fan of SCENIC and CisTopic.

I'm sure you have been busy getting the release and manuscript ready. When you have time might you please provide an mm10-based tutorial? (e.g. 10X mouse brain multiome or similar). Or would it be best to contact you by email to request the notebook from Fig. 5?

Fully support sparse matrices: mismatch matrix indices in calculate_TFs_to_genes_relationships()

Describe the bug
In calculate_TFs_to_genes_relationships(), after initialization, throws this error:

Traceback (most recent call last):
File "", line 1, in
File "/home/ubuntu/scenicplus/src/scenicplus/TF_to_gene.py", line 333, in calculate_TFs_to_genes_relationships
ex_matrix = pd.DataFrame(
File "/home/ubuntu/miniconda3/envs/scenicplus/lib/python3.8/site-packages/pandas/core/frame.py", line 737, in init
mgr = ndarray_to_mgr(
File "/home/ubuntu/miniconda3/envs/scenicplus/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 351, in ndarray_to_mgr
_check_values_indices_shape_match(values, index, columns)
File "/home/ubuntu/miniconda3/envs/scenicplus/lib/python3.8/site-packages/pandas/core/internals/construction.py", line 422, in _check_values_indices_shape_match
raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (25945, 1), indices imply (25945, 18892)

To Reproduce
Here's the code, starting at a successfully created cistopic object

import itertools
import anndata

with open('/data/scenic_demo/output/cistopic_obj.pkl', 'rb') as f:
cistopic_obj = pickle.load(f)

import scanpy as sc

adata=sc.read_mtx('/data/CAREHF/multiome_rna_counts.mtx').T
adata.obs=atac_metadata
cell_data_raw = pd.read_csv('/data/CAREHF/multiome_samples.txt')
adata.obs_names =cell_data_raw['x']

gene_names=pd.read_csv('/data/CAREHF/multiome_rna_features.txt')
adata.var_names =gene_names['x']

adata=adata.T

import dill

menr = dill.load(open('/data/scenic_demo/carehf/motifs/menr.pkl', 'rb'))

from scenicplus.scenicplus_class import create_SCENICPLUS_object
import numpy as np
scplus_obj = create_SCENICPLUS_object(
GEX_anndata = adata,
cisTopic_obj = cistopic_obj,
menr = menr,
key_to_group_by = 'predicted.celltype_fromRNA',
multi_ome_mode = True,
bc_transform_func = lambda x: x+'___cisTopic'
)

from scenicplus.preprocessing.filtering import *

filter_genes(scplus_obj, min_pct = 0.5)
filter_regions(scplus_obj, min_pct = 0.5)

from scenicplus.cistromes import *
merge_cistromes(scplus_obj)

from scenicplus.enhancer_to_gene import get_search_space, calculate_regions_to_genes_relationships, GBM_KWARGS
from scenicplus.enhancer_to_gene import GBM_KWARGS

get_search_space(scplus_obj,
biomart_host = 'http://www.ensembl.org',
species = 'hsapiens',
assembly = 'hg38',
upstream = [1000, 150000],
downstream = [1000, 150000])

calculate_regions_to_genes_relationships(scplus_obj,
ray_n_cpu = 20,
_temp_dir = tmp_dir,
importance_scoring_method = 'GBM',
importance_scoring_kwargs = GBM_KWARGS)

with open('/data/scenic_demo/carehf/scplus_obj.pkl', 'wb') as f:
pickle.dump(scplus_obj, f)

from scenicplus.TF_to_gene import *
tf_file = '/data/scenic_demo/allTFs_hg38.txt'

calculate_TFs_to_genes_relationships(scplus_obj,
tf_file = tf_file,
ray_n_cpu = 20,
method = 'GBM',
_temp_dir = tmp_dir,
key= 'TF2G_adj')

Version (please complete the following information):

Python: 3.8.13
SCENIC+: 0.1.dev437+ga57717f

Additional context
Perhaps this is related to creating the scenic object from a .mtx rather than AnnData object? However, prior to TF to genes inference, the scenic object looks fine:

scplus_obj
SCENIC+ object with n_cells x n_genes = 25945 x 18892 and n_cells x n_regions = 25945 x 154930

Anyway, thanks for the work, excited to see the results...

Error when there are no dr_cells

Error in run_scenicplus when there are no dr_cells:


File /opt/venv/lib/python3.8/site-packages/scenicplus/wrappers/run_scenicplus.py:272, in run_scenicplus(scplus_obj, variable, species, assembly, tf_file, save_path, biomart_host, upstream, downstream, region_ranking, gene_ranking, simplified_eGRN, calculate_TF_eGRN_correlation, calculate_DEGs_DARs, export_to_loom_file, export_to_UCSC_file, tree_structure, path_bedToBigBed, n_cpu, _temp_dir)
    265     log.info('Binarizing eGRNs AUC')
    266     binarize_AUC(scplus_obj, 
    267          auc_key='eRegulon_AUC',
    268          out_key='eRegulon_AUC_thresholds',
    269          signature_keys=['Gene_based', 'Region_based'],
    270          n_cpu=n_cpu)
--> 272 if 'eRegulons_UMAP' not in scplus_obj.dr_cell.keys():
    273     log.info('Making eGRNs AUC UMAP')
    274     run_eRegulons_umap(scplus_obj,
    275                scale=True, signature_keys=['Gene_based', 'Region_based'])

AttributeError: 'NoneType' object has no attribute 'keys'

[BUG] AttributeError: 'Index' object has no attribute 'index'

When creating the scenicplus_obj:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_1651/3951783255.py in <module>
----> 1 scplus_obj = create_SCENICPLUS_object(
      2         GEX_anndata = rna_anndata,
      3         cisTopic_obj = cistopic_obj,
      4         imputed_acc_obj = imputed_acc_obj,
      5         menr = {},

/lustre1/project/stg_00002/lcb/cbravo/Multiomics_pipeline/scenicplus/src/scenicplus/scenicplus_class.py in create_SCENICPLUS_object(GEX_anndata, cisTopic_obj, menr, imputed_acc_obj, imputed_acc_kwargs, normalize_imputed_acc, normalize_imputed_acc_kwargs, cell_metadata, region_metadata, gene_metadata, bc_transform_func, ACC_prefix, GEX_prefix)
    510     with warnings.catch_warnings():
    511         warnings.simplefilter("ignore")
--> 512         X_EXP_subset = GEX_anndata[[GEX_cell_names.index(bc) for bc in common_cells], :].X.copy()
    513 
    514     ACC_cell_metadata_subset = ACC_cell_metadata.loc[common_cells]

/lustre1/project/stg_00002/lcb/cbravo/Multiomics_pipeline/scenicplus/src/scenicplus/scenicplus_class.py in <listcomp>(.0)
    510     with warnings.catch_warnings():
    511         warnings.simplefilter("ignore")
--> 512         X_EXP_subset = GEX_anndata[[GEX_cell_names.index(bc) for bc in common_cells], :].X.copy()
    513 
    514     ACC_cell_metadata_subset = ACC_cell_metadata.loc[common_cells]

AttributeError: 'Index' object has no attribute 'index'

Indeed, if I check GEX_cell_names (GEX_anndata.obs_names) it looks like:

Index(['AAACCCACACCCTAAA-1-TEW__02ff09__scNuc_frozen_mouse2',
       'AAACCCACAGACACCC-1-TEW__02ff09__scNuc_frozen_mouse2',
       'AAACCCAGTTACCGTA-1-TEW__02ff09__scNuc_frozen_mouse2',
       'AAACCCAGTTGTTGAC-1-TEW__02ff09__scNuc_frozen_mouse2',
       'AAACCCATCGCATTAG-1-TEW__02ff09__scNuc_frozen_mouse2',
       'AAACGAAAGCCGTCGT-1-TEW__02ff09__scNuc_frozen_mouse2',
       'AAACGAAAGGGAGATA-1-TEW__02ff09__scNuc_frozen_mouse2',
       'AAACGAAAGGGAGGCA-1-TEW__02ff09__scNuc_frozen_mouse2',
       'AAACGAAGTACCTAAC-1-TEW__02ff09__scNuc_frozen_mouse2',
       'AAACGAAGTCGATGCC-1-TEW__02ff09__scNuc_frozen_mouse2',
       ...
       'TTTGTGAAGCCTGATG-1-TEW__ebb273__Mouse_5_Multiome_NST',
       'TTTGTGAAGCGAGCGA-1-TEW__ebb273__Mouse_5_Multiome_NST',
       'TTTGTGAAGTTACTTC-1-TEW__ebb273__Mouse_5_Multiome_NST',
       'TTTGTGGCATCACAGC-1-TEW__ebb273__Mouse_5_Multiome_NST',
       'TTTGTGTTCACATTGA-1-TEW__ebb273__Mouse_5_Multiome_NST',
       'TTTGTGTTCGCTAAGT-1-TEW__ebb273__Mouse_5_Multiome_NST',
       'TTTGTGTTCGGTTTGG-1-TEW__ebb273__Mouse_5_Multiome_NST',
       'TTTGTGTTCTTTGTAC-1-TEW__ebb273__Mouse_5_Multiome_NST',
       'TTTGTTGGTTACATCC-1-TEW__ebb273__Mouse_5_Multiome_NST',
       'TTTGTTGGTTATTGCC-1-TEW__ebb273__Mouse_5_Multiome_NST'],
      dtype='object', length=29857)

Could this be because of the scanpy version? I'm using 1.8.2

Feature request: export to network for visualization

Generate function to export eGRN of celltype(s) to a network.

This network can be either drawn in cytoscape or directly in python using a package like Networkx

[FEATURE REQUEST] Create SCENIC class

Create a SCENIC (or SCENIC+, if allowed) class with:

Input: Imputed accessibility and/or cisTopicObject (ATAC), motif enrichment and annData (RNA)
- Imputed accessibility
- Gene expression
- Metadata (Cell, Region, Genes)
- Embeddings
- Motif enrichment (as dict with Set [Set-type___Variable___Other_info] -> Region-set name -> Method -> Results)
Check-up for consistency:
- Same cell names between imputed accessibility, cisTopicObject and anndata
- Motif enrichment Set has to be included as cell data column if not starting with Topic (e.g. 'DARs___Seurat_cell_type___other_info', Seurat_cell_type has to be a column)
- The Region-set names have to be levels of this column [These conventions will be easier to enforce when we have a cisTopic/cisTarget CLI that only runs the steps needed for SCENIC+]
Functions (Additional slots to fill in the object):
- Filtering lowly accessible regions/genes (OPTIONAL)
- Cistrome pruning
- Region2gene links
- eGRN (GSEA)
Exploratory functions (after running pipeline)
- Plot region2gene (with option to give custom bigwigs)
- Dot płot
- eGRN
- Embeddings (plot metadata, genes, regions,…)
- Combinations TF
- Cytoscape
- Once we have network, predict perturbation effect (boolean modelling?)

Feel free to comment/edit/add extra things :)!!

Assertion Error running wrapper: run_pycistarget, with custom annotation

Describe the bug
I'm getting a KeyError when running the wrapper function. The function works well up until the 'Creating contrast groups' in the motif_enrichment_dem.py script. I ran into this issue when using @Goultard59's original PR. When doing a bit of my own line-by-line troubleshooting, the fg_pr object referenced as the point where the script stops working is created properly, so I'm not entirely sure what exactly is causing this issue.

To Reproduce

from scenicplus.wrappers.run_pycistarget import run_pycistarget
import pandas as pd 

custom_annot = pd.read_csv('/home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data_manipulation/TSS_annot.txt', header=0, sep='\t')

rankings_db = '/home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data_manipulation/for_cistarget/Crotalus_viridis_v1_consensuspeaks.regions_vs_motifs.rankings.feather'
scores_db =  '/home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data_manipulation/for_cistarget/Crotalus_viridis_v1_consensuspeaks.regions_vs_motifs.scores.feather'
motif_annotation = '/home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data_manipulation/for_cistarget/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0_just_jaspar2.tbl'

run_pycistarget(
    region_sets = region_sets,
    species = 'custom',
    save_path = os.path.join(work_dir, 'motifs'),
    custom_annot = custom_annot,
    ctx_db_path = rankings_db,
    dem_db_path = scores_db,
    path_to_motif_annotations = motif_annotation,
    run_without_promoters = False,
    n_cpu = 4,
    _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
    annotation_version = 'v1',
    )

Error output

2022-11-02 13:34:47,544 pycisTarget_wrapper INFO     /home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data/motifs folder already exists.
2022-11-02 13:34:47,552 pycisTarget_wrapper INFO     Loading cisTarget database for topics_otsu
2022-11-02 13:34:47,552 cisTarget    INFO     Reading cisTarget database
2022-11-02 13:35:01,673 pycisTarget_wrapper INFO     Running cisTarget for topics_otsu
2022-11-02 13:35:05,448	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 
(ctx_internal_ray pid=25007) 2022-11-02 13:35:06,881 cisTarget    INFO     Running cisTarget for Topic1 which has 2868 regions
(ctx_internal_ray pid=25006) 2022-11-02 13:35:06,953 cisTarget    INFO     Running cisTarget for Topic2 which has 23 regions
(ctx_internal_ray pid=25005) 2022-11-02 13:35:07,051 cisTarget    INFO     Running cisTarget for Topic3 which has 3338 regions
(ctx_internal_ray pid=25008) 2022-11-02 13:35:07,091 cisTarget    INFO     Running cisTarget for Topic4 which has 2092 regions
(ctx_internal_ray pid=25007) 2022-11-02 13:35:10,787 cisTarget    INFO     Annotating motifs for Topic1
(ctx_internal_ray pid=25006) 2022-11-02 13:35:10,848 cisTarget    INFO     Annotating motifs for Topic2
(ctx_internal_ray pid=25005) 2022-11-02 13:35:10,865 cisTarget    INFO     Annotating motifs for Topic3
(ctx_internal_ray pid=25007) 2022-11-02 13:35:11,005 cisTarget    INFO     Getting cistromes for Topic1
(ctx_internal_ray pid=25006) 2022-11-02 13:35:11,001 cisTarget    INFO     Getting cistromes for Topic2
(ctx_internal_ray pid=25008) 2022-11-02 13:35:10,945 cisTarget    INFO     Annotating motifs for Topic4
(ctx_internal_ray pid=25007) 2022-11-02 13:35:11,052 cisTarget    INFO     Running cisTarget for Topic5 which has 1916 regions
(ctx_internal_ray pid=25006) 2022-11-02 13:35:11,077 cisTarget    INFO     Running cisTarget for Topic6 which has 1932 regions
(ctx_internal_ray pid=25008) 2022-11-02 13:35:11,071 cisTarget    INFO     Getting cistromes for Topic4
(ctx_internal_ray pid=25008) 2022-11-02 13:35:11,112 cisTarget    INFO     Running cisTarget for Topic7 which has 3149 regions
(ctx_internal_ray pid=25005) 2022-11-02 13:35:11,083 cisTarget    INFO     Getting cistromes for Topic3
(ctx_internal_ray pid=25005) 2022-11-02 13:35:11,201 cisTarget    INFO     Running cisTarget for Topic8 which has 1857 regions
(ctx_internal_ray pid=25006) 2022-11-02 13:35:11,575 cisTarget    INFO     Annotating motifs for Topic6
(ctx_internal_ray pid=25007) 2022-11-02 13:35:11,661 cisTarget    INFO     Annotating motifs for Topic5
(ctx_internal_ray pid=25006) 2022-11-02 13:35:11,718 cisTarget    INFO     Getting cistromes for Topic6
(ctx_internal_ray pid=25006) 2022-11-02 13:35:11,760 cisTarget    INFO     Running cisTarget for Topic9 which has 1328 regions
(ctx_internal_ray pid=25008) 2022-11-02 13:35:11,784 cisTarget    INFO     Annotating motifs for Topic7
(ctx_internal_ray pid=25005) 2022-11-02 13:35:11,789 cisTarget    INFO     Annotating motifs for Topic8
(ctx_internal_ray pid=25007) 2022-11-02 13:35:11,851 cisTarget    INFO     Getting cistromes for Topic5
(ctx_internal_ray pid=25005) 2022-11-02 13:35:11,924 cisTarget    INFO     Getting cistromes for Topic8
(ctx_internal_ray pid=25007) 2022-11-02 13:35:11,933 cisTarget    INFO     Running cisTarget for Topic10 which has 14 regions
(ctx_internal_ray pid=25008) 2022-11-02 13:35:11,937 cisTarget    INFO     Getting cistromes for Topic7
(ctx_internal_ray pid=25008) 2022-11-02 13:35:11,985 cisTarget    INFO     Running cisTarget for Topic11 which has 1581 regions
(ctx_internal_ray pid=25005) 2022-11-02 13:35:11,990 cisTarget    INFO     Running cisTarget for Topic12 which has 872 regions
(ctx_internal_ray pid=25007) 2022-11-02 13:35:12,227 cisTarget    INFO     Annotating motifs for Topic10
(ctx_internal_ray pid=25006) 2022-11-02 13:35:12,168 cisTarget    INFO     Annotating motifs for Topic9
(ctx_internal_ray pid=25007) 2022-11-02 13:35:12,350 cisTarget    INFO     Getting cistromes for Topic10
(ctx_internal_ray pid=25007) 2022-11-02 13:35:12,373 cisTarget    INFO     Running cisTarget for Topic13 which has 22 regions
(ctx_internal_ray pid=25006) 2022-11-02 13:35:12,419 cisTarget    INFO     Getting cistromes for Topic9
(ctx_internal_ray pid=25005) 2022-11-02 13:35:12,390 cisTarget    INFO     Annotating motifs for Topic12
(ctx_internal_ray pid=25008) 2022-11-02 13:35:12,456 cisTarget    INFO     Annotating motifs for Topic11
(ctx_internal_ray pid=25006) 2022-11-02 13:35:12,560 cisTarget    INFO     Running cisTarget for Topic14 which has 945 regions
(ctx_internal_ray pid=25005) 2022-11-02 13:35:12,616 cisTarget    INFO     Getting cistromes for Topic12
(ctx_internal_ray pid=25007) 2022-11-02 13:35:12,707 cisTarget    INFO     Annotating motifs for Topic13
(ctx_internal_ray pid=25008) 2022-11-02 13:35:12,698 cisTarget    INFO     Getting cistromes for Topic11
(ctx_internal_ray pid=25007) 2022-11-02 13:35:12,850 cisTarget    INFO     Getting cistromes for Topic13
(ctx_internal_ray pid=25005) 2022-11-02 13:35:12,762 cisTarget    INFO     Running cisTarget for Topic15 which has 1362 regions
(ctx_internal_ray pid=25008) 2022-11-02 13:35:12,860 cisTarget    INFO     Running cisTarget for Topic16 which has 1492 regions
(ctx_internal_ray pid=25006) 2022-11-02 13:35:13,013 cisTarget    INFO     Annotating motifs for Topic14
(ctx_internal_ray pid=25006) 2022-11-02 13:35:13,199 cisTarget    INFO     Getting cistromes for Topic14
(ctx_internal_ray pid=25008) 2022-11-02 13:35:13,220 cisTarget    INFO     Annotating motifs for Topic16
(ctx_internal_ray pid=25005) 2022-11-02 13:35:13,211 cisTarget    INFO     Annotating motifs for Topic15
(ctx_internal_ray pid=25008) 2022-11-02 13:35:13,355 cisTarget    INFO     Getting cistromes for Topic16
(ctx_internal_ray pid=25005) 2022-11-02 13:35:13,415 cisTarget    INFO     Getting cistromes for Topic15
2022-11-02 13:35:16,123 cisTarget    INFO     Done!
2022-11-02 13:35:16,124 pycisTarget_wrapper INFO     /home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data/motifs/CTX_topics_otsu_All folder already exists.
2022-11-02 13:35:16,319 pycisTarget_wrapper INFO     Running DEM for topics_otsu
2022-11-02 13:35:16,320 DEM          INFO     Reading DEM database
2022-11-02 13:35:32,955 DEM          INFO     Creating contrast groups
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py:3800, in Index.get_loc(self, key, method, tolerance)
   3799 try:
-> 3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pandas/_libs/index.pyx:138, in pandas._libs.index.IndexEngine.get_loc()

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pandas/_libs/index.pyx:165, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5745, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5753, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'Chromosome'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In [16], line 6
      2 import pandas as pd 
      4 custom_annot = pd.read_csv('/home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data_manipulation/TSS_annot.txt', header=0, sep='\t')
----> 6 run_pycistarget(
      7     region_sets = region_sets,
      8     species = 'custom',
      9     save_path = os.path.join(work_dir, 'motifs'),
     10     custom_annot = custom_annot,
     11     ctx_db_path = rankings_db,
     12     dem_db_path = scores_db,
     13     path_to_motif_annotations = motif_annotation,
     14     run_without_promoters = False,
     15     n_cpu = 4,
     16     _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
     17     annotation_version = 'v1',
     18     )

File ~/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/scenicplus/src/scenicplus/wrappers/run_pycistarget.py:263, in run_pycistarget(region_sets, species, save_path, custom_annot, save_partial, ctx_db_path, dem_db_path, run_without_promoters, biomart_host, promoter_space, ctx_auc_threshold, ctx_nes_threshold, ctx_rank_threshold, dem_log2fc_thr, dem_motif_hit_thr, dem_max_bg_regions, annotation, motif_similarity_fdr, path_to_motif_annotations, annotation_version, n_cpu, _temp_dir, exclude_motifs, exclude_collection, **kwargs)
    261     for col in exclude_collection:
    262         dem_db.db_scores = dem_db.db_scores[~dem_db.db_scores.index.str.contains(col)]
--> 263 menr['DEM_'+key+'_All'] = DEM(dem_db = dem_db,
    264                    region_sets = regions,
    265                    log2fc_thr = dem_log2fc_thr,
    266                    motif_hit_thr = dem_motif_hit_thr,
    267                    max_bg_regions = dem_max_bg_regions,
    268                    specie = species,
    269                    genome_annotation = annot_dem,
    270                    promoter_space = promoter_space,
    271                    motif_annotation =   annotation,
    272                    motif_similarity_fdr = motif_similarity_fdr, 
    273                    path_to_motif_annotations = path_to_motif_annotations,
    274                    n_cpu = n_cpu,
    275                    annotation_version = annotation_version,
    276                    tmp_dir = save_path,
    277                    _temp_dir= _temp_dir,
    278                    **kwargs)
    279 out_folder = os.path.join(save_path,'DEM_'+key+'_All')
    280 check_folder = os.path.isdir(out_folder)

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:319, in DEM.__init__(self, dem_db, region_sets, specie, subset_motifs, contrasts, name, max_bg_regions, adjpval_thr, log2fc_thr, mean_fg_thr, motif_hit_thr, n_cpu, fraction_overlap, cluster_buster_path, path_to_genome_fasta, path_to_motifs, genome_annotation, promoter_space, path_to_motif_annotations, annotation_version, motif_annotation, motif_similarity_fdr, orthologous_identity_threshold, tmp_dir, **kwargs)
    317 self.cistromes = None
    318 if dem_db is not None:
--> 319     self.run(dem_db.db_scores, **kwargs)

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:365, in DEM.run(self, dem_db_scores, **kwargs)
    363 # Get region groups
    364 log.info('Creating contrast groups')
--> 365 region_groups = [create_groups(contrast = contrasts[x],
    366                            region_sets_names = region_sets_names,
    367                            max_bg_regions = self.max_bg_regions,
    368                            path_to_genome_fasta = self.path_to_genome_fasta,
    369                            path_to_regions_fasta = os.path.join(self.tmp_dir, contrasts_names[x] +'.fa'),  
    370                            cbust_path = self.cluster_buster_path,
    371                            path_to_motifs = self.path_to_motifs,
    372                            annotation = self.genome_annotation,
    373                            promoter_space = self.promoter_space,
    374                            motifs = dem_db_scores.index.tolist(),
    375                            n_cpu = self.n_cpu,
    376                            **kwargs) for x in range(len(contrasts))]
    378 # Compute p-val and log2FC
    379 if self.n_cpu > len(region_groups):

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:365, in <listcomp>(.0)
    363 # Get region groups
    364 log.info('Creating contrast groups')
--> 365 region_groups = [create_groups(contrast = contrasts[x],
    366                            region_sets_names = region_sets_names,
    367                            max_bg_regions = self.max_bg_regions,
    368                            path_to_genome_fasta = self.path_to_genome_fasta,
    369                            path_to_regions_fasta = os.path.join(self.tmp_dir, contrasts_names[x] +'.fa'),  
    370                            cbust_path = self.cluster_buster_path,
    371                            path_to_motifs = self.path_to_motifs,
    372                            annotation = self.genome_annotation,
    373                            promoter_space = self.promoter_space,
    374                            motifs = dem_db_scores.index.tolist(),
    375                            n_cpu = self.n_cpu,
    376                            **kwargs) for x in range(len(contrasts))]
    378 # Compute p-val and log2FC
    379 if self.n_cpu > len(region_groups):

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:522, in create_groups(contrast, region_sets_names, max_bg_regions, path_to_genome_fasta, path_to_regions_fasta, cbust_path, path_to_motifs, annotation, promoter_space, motifs, n_cpu, **kwargs)
    520 # Nr of promoters in the foreground
    521 fg_pr_overlap = pr.PyRanges(region_names_to_coordinates(foreground)).count_overlaps(annotation)
--> 522 fg_pr = coord_to_region_names(fg_pr_overlap[fg_pr_overlap.NumberOverlaps != 0])
    523 if len(fg_pr) == len(foreground):
    524     nr_pr = max_bg_regions

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/utils.py:18, in coord_to_region_names(coord)
     16 if isinstance(coord, pr.PyRanges):
     17     coord = coord.as_df()
---> 18     return list(coord['Chromosome'].astype(str) + ':' + coord['Start'].astype(str) + '-' + coord['End'].astype(str))

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
   3803 if self.columns.nlevels > 1:
   3804     return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
   3806 if is_integer(indexer):
   3807     indexer = [indexer]

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:
-> 3802     raise KeyError(key) from err
   3803 except TypeError:
   3804     # If we have a listlike key, _check_indexing_error will raise
   3805     #  InvalidIndexError. Otherwise we fall through and re-raise
   3806     #  the TypeError.
   3807     self._check_indexing_error(key)

KeyError: 'Chromosome'

Screenshots
Here's what my custom_annot pandas df looks like:

Here's the structure of my region_sets:

Version (please complete the following information):

Python: 3.8.13
SCENIC+: '0.1.dev452+g22735ca'

Additional context
Thank you for your help!

error when running run_scenicplus

Hi, Authors.Thank you for developing such a great tool.
When I run run_scenicplus for my multiome data, I get an error like the following:

This error seems to have appeared in previous versions.
Any suggestion to solve this problem？
Thanks！

Feature request: Dotplot

Create function for exporting a dotplot with color and size corresponding to resp. fold change expression of TF and NES score of motif.

build_grn hanging

Hello,

Thank you very much for developing this wonderful tool :)

I have noticed that the function build_grn from scenicplus.grn_builder.gsea_approach in some cases seems to hang at the "Running GSEA..." step for a very long time (see screenshot). I am running it on a relatively small scenicplus object obtained by creating metacells from unpaired scRNA-seq and scATAC-seq data as per tutorial.

SCENIC+ object with n_cells x n_genes = 228 x 21278 and n_cells x n_regions = 228 x 152615

The notebook has now been running for over 10 hours with 20 cpu and it still has not finished.

I did an identical analysis on another dataset of comparable size (roughly same number of metacells) and build_grn finished running in less than an hour, so I think something weird is going on here. Do you have any insights as why this might happen? Any suggestion would be greatly appreciated.

Version:

Python: Python 3.8
SCENIC+: 0.1.dev437+ga57717f

[BUG REPORT] Progress bars printing a million times in jupyter notebooks

Describe the bug
When running build_egrn in jupyter, progress bars are printed constantly. This doesn't happen for R2G and TF2G. A possible solution would be to use tqdm.notebook instead, but this may give issues when running in console.

To Reproduce
Run build_egrn in jupyter.

Error output

Version (please complete the following information):

Python: [e.g. Python 3.7.4]
SCENIC+: [e.g. 0.1.dev40+g4ff5053.d20210520]
If a bug is related to another module [e.g. matplotlib 3.3.9]

Additional context
Add any other context about the problem here.

Using CisTopic object generated through R for SCENIC+ analysis

Hello! Thanks for developing SCENIC+. I really want to use this package to redo some of my multiome analyses. However I have already done a fair bit of work using Seurat/Signac and CisTopic in R, prior to this package coming out. I was wondering if I can use the CisTopic object I generated through R and use that as the input for SCENIC+ . I have done the TOPIC analysis and most of the initial preprocessing in R and it would be great if I could just import that to Python . Please let me know if this is possible.

Thanks
Debbie

Are SCENIC+ results stochastic?

Hello, I just have a quick request for clarification. I know that SCENIC has some stochasticity, such that different runs will return slightly different regulons/target gene sets. Is the same true for SCENIC+?

I was expecting that this would be the case, but my initial inspection of a small subset of the results of two independent runs of SCENIC+ on the same data (using the run_scenicplus wrapper) suggests that the target genes of the regulons and the various metadata column values are the same between these two runs. So I wanted to check if this is expected, or if there's an issue with my analysis!

Thank you!

db_fpath and motif_annot_fpath files.

Hi,

Could you please give a link to download the files in bold below

rankings_db = os.path.join(db_fpath, 'cluster_SCREEN.regions_vs_motifs.rankings.v2.feather')
scores_db = os.path.join(db_fpath, 'cluster_SCREEN.regions_vs_motifs.scores.v2.feather')
motif_annotation = os.path.join(motif_annot_fpath, 'motifs-v10-nr.hgnc-m0.00001-o0.0.tbl')

I am mainly using the 10x multiome tutorial to troubleshoot and get scenic+ to run on our cluster, I also changed my paths as below, so I assume these files have to be downloaded or generated

db_fpath = "/staging/leuven/stg_00002/lcb/icistarget/data/make_rankings/v10_clust/CTX_hg38"
motif_annot_fpath = "/staging/leuven/stg_00002/lcb/cbravo/cluster_motif_collection_V10_no_desso_no_factorbook/snapshots"

was changed to 

db_fpath = "db_fpath_files/"
motif_annot_fpath = "motif_annot_fpath_files/ "

I think it would be a good idea to include the links for such files in the jupyter notebook - useful if someone is trying to get the tool to run and then work on generating intermediate files.

Thanks for the help!

In coverage plot: when multiple genes are within a window, don't overlap them.

Now in the coverage plot function if multiple genes are within the same window they are drawn overlapping each other.

Solutions:

draw them underneath each other
only draw one of them (let the user select using a parameter)

Dataset for 10x pbmc multiome

Hi there I was trying to test the 10x pbmc multiome tutorial but got trouble with the datasets below:
rankings_db = os.path.join(db_fpath, 'cluster_SCREEN.regions_vs_motifs.rankings.v2.feather')
scores_db = os.path.join(db_fpath, 'cluster_SCREEN.regions_vs_motifs.scores.v2.feather')
I found similar ones but not the exact same "hg38_screen_v10_clust.regions_vs_motifs.rankings.feather" and "hg38_screen_v10_clust.regions_vs_motifs.scores.feather". The files seem a bit too big so I'm wondering if there are smaller size files that I could try. Thank you,

Clarification request: what scRNA-seq data is needed to create the SCENIC+ object?

I'm working through the tutorial here but with my own data that was processed with Seurat.

To facilitate modularity / interoperability, could you give a bit more detail about what data needs to be in the object provided as the GEX_anndata argument to create_SCENICPLUS_object()? From this line GEX_anndata = adata.raw.to_adata(), it looks like you're basically just passing in the original counts, but I haven't worked through the full scanpy processing so I'm not sure. Is there any other data that needs to be included in that object? (normalised data, PCA, etc...)

In general the tutorial is really nice and clear so far. Just for this purpose of interoperability with other tools, it would be super helpful to have a list up top with all the input data needed, so it's all in one place and easy for users to gather the necessary data in the right formats in advance.

[FEATURE REQUEST] Allow to draw GSEA enrichment curves

Is your feature request related to a problem? Please describe.
NA

Describe the solution you'd like
Allow to draw GSEA curves given a TF ranking and a cistrome. Reusing the GSEApy function should be possible. This could be used as a check-up of the leading edge manually and to compare with using the whole cistrome. Adding dichotomize options would be a plus.

Describe alternatives you've considered
https://gseapy.readthedocs.io/en/latest/gseapy_example.html

Additional context
NA

How to speed up the "calculate_TFs_to_genes_relationships" function wrapped in "run_scenicplus"?

Dear authors,

Thank you for providing such a great tool for regulatory analysis. I am trying to use it on my non-multiome dataset (separate scRNA and scATAC datasets).

Task
Here I am running the functions wrapped in run_scenicplus step by step. While runing the function calculate_TFs_to_genes_relationships by the following command:

calculate_TFs_to_genes_relationships(scplus_obj, 
                tf_file = tf_file,
                ray_n_cpu = 20, 
                method = 'GBM',
                _temp_dir = _temp_dir,
                key= 'TF2G_adj',
                object_store_memory = 400E9)

Problems
I found the job is very slow and consuming lots of memory.
I am running this job on a node with 750G memory and 20 cpu cores. It is requesting hundreds of hours to finish. Here is the last few lines of the job out put. I didn't finish the job because it is also becoming more and more slow.

initializing:   2%|▏         | 564/29773 [1:37:21<84:50:12, 10.46s/it]
initializing:   2%|▏         | 565/29773 [1:37:55<141:31:38, 17.44s/it]
initializing:   2%|▏         | 566/29773 [1:38:08<130:56:20, 16.14s/it]
initializing:   2%|▏         | 567/29773 [1:39:48<335:17:22, 41.33s/it]
initializing:   2%|▏         | 568/29773 [1:39:57<255:00:18, 31.43s/it]
initializing:   2%|▏         | 569/29773 [1:39:57<179:15:03, 22.10s/it]

Dataset description
The dataset I am using has 5590 metacells (generated from 5590*5 cells), 29773 genes, 249968 regions. And I think it is a reasonable dataset size comparing with the pbmc dataset posted in the first tutorial.

scplus_obj.X_ACC.shape = (249968, 5590)
scplus_obj.X_EXP.shape = (5590, 29773)

The TF list I am using is utoronto_human_tfs_v_1.01.txt from the first tutorial.

The methods I tried
I tried to change the regression method to RF but its performance is similar.
I also tried to subset shorter TF lists and do a manual parallelization, but the improvement is marginal.

Questions
The number of metacells is smaller than 6k, why is it significantly slower than the pbmc dataset in the tutorial? Is there a way to improve the speed and lower the memory usage? Because it runs slower and slower, I can imagine it would be difficult for this job to finish even within one week.

Version information
SCENIC+: '0.1.dev445+g615c88b'

Thanks,
Ethan

[BUG] SCENICPLUS.to_df not reading arguments

After pickling and loading scenicplus class the .to_df method does not read arguments properly.

In [3]: scplus_obj.to_df(layer = 'EXP')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-107-1cc95275858c> in <module>
----> 1 scplus_obj.to_df(layer = 'EXP')

TypeError: to_df() missing 1 required positional argument: 'layer'

No cistromes found

Describe the bug

I am trying to run scenicplus on a Pig dataset. I have generated motif scores following the steps in create_cisTarget_databases . I had to make multiple changes in the codebase to add support for this custom workflow, but I am running into issues with merge_cistromes` step.

run_pycistarget goes through without issues and creates cistromes:

run_pycistarget(
    region_sets=region_sets,
    species=species,
    save_path=os.path.join(out_dir, "motifs"),
    ctx_db_path=rankings_db,
    dem_db_path=scores_db,
    path_to_motif_annotations=motif_annotation,
    run_without_promoters=True,
    n_cpu=4,
    _temp_dir=os.path.join(tmp_dir, "ray_spill"),
    annotation_version="2022",
)

The DEM hits make sense:

However, at the merge_cistromes(scplus_obj) step, I get an error No cistromes found! Make sure that the motif enrichment results look good!

To Reproduce

I am not sure how to provide a reproducible example, but if you could suggest a workflow for running Scenic+ on custom organisms, that would be wonderful! It is quite possible that I am missing something obvious (possibly because I had to change the code to handle a new motif database).

Error output

    84 
     85     if len(signatures_direct.keys()) == 0 and len(signatures_extend.keys()) == 0:
---> 86         raise ValueError("No cistromes found! Make sure that the motif enrichment results look good!")
     87 
     88     #merge regions by TF name

ValueError: No cistromes found! Make sure that the motif enrichment results look good!

Version (please complete the following information):

Python: 3.7.10
SCENIC+: '0.1.dev445+g615c88b.d20220902'
If a bug is related to another module [e.g. matplotlib 3.3.9]

Additional context
Add any other context about the problem here.

Can't install with Python 3.10 - can you provide a container image?

It's not possible to install MACS2 with Python 3.10 at the moment (see here and here). It's not clear to me exactly how MACS2 is used within SCENIC+ - would it be possible to provide a path to an already-installed MACS2 binary instead of requiring it as a package dependency? (for unfortunate local set-up reasons I can't use Python 3.9, and probably not earlier versions either, as other dependencies fail... I realise this makes this kind of a niche issue)

Alternatively, would it be possible for you to provide a pre-built Docker or Singularity image? I see that you provide instructions for building an image, but we only have Singularity installed on our HPC, not Docker, so I'm not sure if I will be able to run these.

Question regarding the r2g/tf2g importance scores

If I understand the GENIE3/GRNboost algorithms correctly, we first we first normalize the variance per target to 1, such that the importance scores per target sum up to 1 (given that the prediction is almost perfect) which facilitates the comparison of importances score for different target genes.

Why do the importances scores per gene sum up to 1 in the region-gene data frame, but not in the tf-gene data frame?

This is for example how the sum of importances scores per gene looks like

[FEATURE REQUEST] Add requirements.txt and setup.py for easy installation

When done we can add it to the singularity image (and DEV kernel)

Can scRNA and scATAC paired data from multiple samples be used with Scenic+ ？

Hello, I have scRNA and scATAC data from multiple samples, can I use Scenic+? Because only one set of paired data is used in the PBMC tutorial.

Tutorial 10x multiome - ATAC processing

Describe the bug
Hi, I am trying to run the tutorial for 10x multiome but I am getting an error when processing ATAC seq part.

To Reproduce

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
                 variable = 'celltype',                                                                     # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
                 sample_id_col = 'sample_id',
                 chromsizes = chromsizes,
                 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  # specify where pseudobulk_bed_files should be stored
                 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
                 path_to_fragments = fragments_dict,                                                        # location of fragment fiels
                 n_cpu = 8,                                                                                 # specify the number of cores to use, we use ray for multi processing
                 normalize_bigwig = True,
                 remove_duplicates = True,
                 _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
                 split_pattern = '-')

Error output

PermissionError Traceback (most recent call last)
Cell In [25], line 2
1 from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
----> 2 bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
3 variable = 'celltype', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
4 sample_id_col = 'sample_id',
5 chromsizes = chromsizes,
6 bed_path = os.path.join(work_dir, '/scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored
7 bigwig_path = os.path.join(work_dir, '/scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
8 path_to_fragments = fragments_dict, # location of fragment fiels
9 n_cpu = 8, # specify the number of cores to use, we use ray for multi processing
10 normalize_bigwig = True,
11 remove_duplicates = True,
12 _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
13 split_pattern = '-')

File ~/.conda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:164, in export_pseudobulk(input_data, variable, chromsizes, bed_path, bigwig_path, path_to_fragments, sample_id_col, n_cpu, normalize_bigwig, remove_duplicates, split_pattern, use_polars, **kwargs)
162 # Create pseudobulks
163 if n_cpu > 1:
--> 164 ray.init(num_cpus=n_cpu, **kwargs)
165 ray_handle = ray.wait(
166 [
167 export_pseudobulk_ray.remote(
(...)
181 num_returns=len(groups),
182 )
183 ray.shutdown()

File ~/.local/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook..wrapper(*args, **kwargs)
103 if func.name != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.name)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File ~/.local/lib/python3.8/site-packages/ray/_private/worker.py:1420, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, **kwargs)
1378 ray_params = ray._private.parameter.RayParams(
1379 node_ip_address=node_ip_address,
1380 raylet_ip_address=raylet_ip_address,
(...)
1414 node_name=_node_name,
1415 )
1416 # Start the Ray processes. We set shutdown_at_exit=False because we
1417 # shutdown the node in the ray.shutdown call that happens in the atexit
1418 # handler. We still spawn a reaper process in case the atexit handler
1419 # isn't called.
-> 1420 _global_node = ray._private.node.Node(
1421 head=True, shutdown_at_exit=False, spawn_reaper=True, ray_params=ray_params
1422 )
1423 else:
1424 # In this case, we are connecting to an existing cluster.
1425 if num_cpus is not None or num_gpus is not None:

File ~/.local/lib/python3.8/site-packages/ray/_private/node.py:198, in Node.init(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
195 else:
196 self._webui_url = ray._private.services.get_webui_url_from_internal_kv()
--> 198 self._init_temp()
200 # Validate and initialize the persistent storage API.
201 if head:

File ~/.local/lib/python3.8/site-packages/ray/_private/node.py:398, in Node._init_temp(self)
390 temp_dir = ray._private.utils.internal_kv_get_with_retry(
391 self.get_gcs_client(),
392 "temp_dir",
393 ray_constants.KV_NAMESPACE_SESSION,
394 num_retries=NUM_REDIS_GET_RETRIES,
395 )
396 self._temp_dir = ray._private.utils.decode(temp_dir)
--> 398 try_to_create_directory(self._temp_dir)
400 if self.head:
401 self._session_dir = os.path.join(self._temp_dir, self.session_name)

File ~/.local/lib/python3.8/site-packages/ray/_private/utils.py:916, in try_to_create_directory(directory_path)
910 """Attempt to create a directory that is globally readable/writable.
911
912 Args:
913 directory_path: The path of the directory to create.
914 """
915 directory_path = os.path.expanduser(directory_path)
--> 916 os.makedirs(directory_path, exist_ok=True)
917 # Change the log directory permissions so others can use it. This is
918 # important when multiple people are using the same machine.
919 try_make_directory_shared(directory_path)

File ~/.conda/envs/scenicplus/lib/python3.8/os.py:213, in makedirs(name, mode, exist_ok)
211 if head and tail and not path.exists(head):
212 try:
--> 213 makedirs(head, exist_ok=exist_ok)
214 except FileExistsError:
215 # Defeats race condition when another thread created the path
216 pass

File ~/.conda/envs/scenicplus/lib/python3.8/os.py:223, in makedirs(name, mode, exist_ok)
221 return
222 try:
--> 223 mkdir(name, mode)
224 except OSError:
225 # Cannot rely on checking for EEXIST, since the operating system
226 # could give priority to other errors like EACCES or EROFS
227 if not exist_ok or not path.isdir(name):

PermissionError: [Errno 13] Permission denied: '/scratch/leuven'

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem or show the format of the input data for the command/s.

Version (please complete the following information):

Python: 3.8
SCENIC+: '0.1.dev452+g22735ca'

Additional context
Add any other context about the problem here.

cannot find blacklist file.

Hello,

I am going through the first tutorial - 10x multiomics and my jupyter notebook stopped as it could not locate the blacklist bed file for the example datasets (specifically - hg38-blacklist.v2.bed)

code in tutorial -

 from pycisTopic.iterative_peak_calling import *
 # Other param
 peak_half_width = 250
 path_to_blacklist= '../../pycisTopic/blacklist/hg38-blacklist.v2.bed'
 # Get consensus peaks
 consensus_peaks=get_consensus_peaks(narrow_peaks_dict, peak_half_width, chromsizes=chromsizes, path_to_blacklist=path_to_blacklist)

this is the error that is returned -
output -

2022-09-19 12:47:05,944 cisTopic     INFO     Extending and merging peaks per class
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Cell In [29], line 6
      4 path_to_blacklist= '../../pycisTopic/blacklist/hg38-blacklist.v2.bed'
      5 # Get consensus peaks
----> 6 consensus_peaks=get_consensus_peaks(narrow_peaks_dict, peak_half_width, chromsizes=chromsizes, path_to_blacklist=path_to_blacklist)

File ~/.conda/envs/py_3_8/lib/python3.8/site-packages/pycisTopic/iterative_peak_calling.py:65, in get_consensus_peaks(narrow_peaks_dict, peak_half_width, chromsizes, path_to_blacklist)
     62     chromsizes = pr.PyRanges(chromsizes)
     64 log.info("Extending and merging peaks per class")
---> 65 center_extended_peaks = [
     66     iterative_peak_filtering(
     67         calculate_peaks_and_extend(
     68             narrow_peaks_dict[x], peak_half_width, chromsizes, path_to_blacklist
     69         )
     70     ).df
     71     for x in list(narrow_peaks_dict.keys())
     72 ]
     73 log.info("Normalizing peak scores")
     74 center_extended_peaks_norm = [cpm(x, "Score") for x in center_extended_peaks]

File ~/.conda/envs/py_3_8/lib/python3.8/site-packages/pycisTopic/iterative_peak_calling.py:67, in <listcomp>(.0)
     62     chromsizes = pr.PyRanges(chromsizes)
     64 log.info("Extending and merging peaks per class")
     65 center_extended_peaks = [
     66     iterative_peak_filtering(
---> 67         calculate_peaks_and_extend(
     68             narrow_peaks_dict[x], peak_half_width, chromsizes, path_to_blacklist
     69         )
     70     ).df
     71     for x in list(narrow_peaks_dict.keys())
     72 ]
     73 log.info("Normalizing peak scores")
     74 center_extended_peaks_norm = [cpm(x, "Score") for x in center_extended_peaks]

File ~/.conda/envs/py_3_8/lib/python3.8/site-packages/pycisTopic/iterative_peak_calling.py:142, in calculate_peaks_and_extend(narrow_peaks, peak_half_width, chromsizes, path_to_blacklist)
    138     center_extended_peaks = center_extended_peaks.intersect(
    139         chromsizes, how="containment"
    140     )
    141 if isinstance(path_to_blacklist, str):
--> 142     blacklist = pr.read_bed(path_to_blacklist)
    143     center_extended_peaks = center_extended_peaks.overlap(blacklist, invert=True)
    144 return center_extended_peaks

File ~/.conda/envs/py_3_8/lib/python3.8/site-packages/pyranges/readers.py:79, in read_bed(f, as_df, nrows)
     77     first_start = gzip.open(f).readline().split()[1]
     78 else:
---> 79     first_start = open(f).readline().split()[1]
     81 header = None
     83 try:

FileNotFoundError: [Errno 2] No such file or directory: '../../pycisTopic/blacklist/hg38-blacklist.v2.bed'

I also looked in pycistopic's repo and could not find this file - is it supposed to be automatically created in context of the sample data used in the tutorial?

[BUG] Check if regions in search space are in imputed matrix and viceversa

Information

I get an error when I have regions in my search space that are not in the imputed accessibility matrix. It would be useful to have a check-up for this, since we may filter lowly accessible regions from the imputed accessibility matrix in this step while wanting to use a previously computed search space (e.g. from pycistopic's gene activity) with all regions.

Error

2021-09-10 11:56:58,761 R2G          INFO     Loading imputed accessbility object.
2021-09-10 11:57:06,629 R2G          INFO     Normalizing region cell probabilities ...
2021-09-10 11:57:06,629 cisTopic     INFO     Normalizing imputed data
2021-09-10 11:58:24,364 cisTopic     INFO     Done!
2021-09-10 11:58:24,364 R2G          INFO     Loading expression matrix.
2021-09-10 11:58:25,003 R2G          INFO     Normalizing expression data ...
2021-09-10 11:58:26,193 R2G          INFO     setting expression and accessbility data to common index.
2021-09-10 11:58:35,978 R2G          INFO     Loading search space
2021-09-10 11:58:38,346 R2G          INFO     Calculating region to gene importances
2021-09-10 11:58:41,218	INFO services.py:1253 -- View the Ray dashboard at http://127.0.0.1:8265
"['chr18:64913030-64913530', 'chr18:64968944-64969444', 'chr18:65012037-65012537', 'chr18:65165801-65166301', 'chr18:65008011-65008511', 'chr18:65112785-65113285', 'chr18:64998293-64998793', 'chr18:65012567-65013067', 'chr18:65111959-65112459', 'chr18:64908810-64909310', 'chr18:65177787-65178287', 'chr18:64966196-64966696', 'chr18:65079405-65079905', 'chr18:65113568-65114068', 'chr18:65153469-65153969', 'chr18:65161119-65161619', 'chr18:65217439-65217939', 'chr18:64627572-64628072', 'chr18:64786105-64786605', 'chr18:64507120-64507620', 'chr18:64626910-64627410', 'chr18:64707608-64708108', 'chr18:64634352-64634852', 'chr18:64619821-64620321', 'chr18:64840199-64840699', 'chr18:64656533-64657033', 'chr18:64597944-64598444', 'chr18:64512076-64512576', 'chr18:64513978-64514478', 'chr18:64829805-64830305', 'chr18:64782885-64783385', 'chr18:64812477-64812977', 'chr18:64651445-64651945', 'chr18:64585397-64585897', 'chr18:64716974-64717474', 'chr18:64629764-64630264', 'chr18:64616068-64616568', 'chr18:64655770-64656270', 'chr18:64677657-64678157', 'chr18:64683157-64683657', 'chr18:64753967-64754467', 'chr18:65262701-65263201'] not in index"
Traceback (most recent call last):
  File "/data/leuven/software/biomed/skylake_centos7/2018a/software/Python/3.7.4-GCCcore-6.4.0/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/leuven/software/biomed/skylake_centos7/2018a/software/Python/3.7.4-GCCcore-6.4.0/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus/src/scenicplus/cli/benchmarking.py", line 170, in <module>
    main()
  File "/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus/src/scenicplus/cli/benchmarking.py", line 166, in main
    args.func(args)
  File "/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus/src/scenicplus/cli/benchmarking.py", line 80, in region_to_gene_command
    _temp_dir = temp_dir
  File "/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus/src/scenicplus/enhancer_to_gene.py", line 479, in calculate_regions_to_genes_relationships
    **kwargs)
  File "/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus/src/scenicplus/enhancer_to_gene.py", line 419, in score_regions_to_genes
    regions_to_genes = {gene: regions_to_gene for gene, regions_to_gene in zip(genes_to_use, regions_to_genes)}
UnboundLocalError: local variable 'regions_to_genes' referenced before assignment

Solution

Filtering the regions that have been removed from the search space.

ss_filt=ss[ss['Name'].isin(imp_acc.feature_names)]

error for `export_pseudobulk`

Hi, Thanks for great tools.
It came out RuntimeError when I ran export_pseudobulk . I don't know why it is ok to run part of the cell types, but all of them can not run out. What's more it also got error when run for more than 2 or more fragments.

Error output
`2022-10-23 19:48:05,770 cisTopic INFO Reading fragments from /00.data/07.fragments/Inj_5DPI_A211213013_5col_sorted.bed.gz

2022-10-23 20:03:05,129 cisTopic INFO Creating pseudobulk for CMPN_1
2022-10-23 20:08:35,478 cisTopic INFO CMPN_1 done!
2022-10-23 20:08:35,480 cisTopic INFO Creating pseudobulk for CMPN_2
2022-10-23 20:12:19,711 cisTopic INFO CMPN_2 done!
2022-10-23 20:12:19,712 cisTopic INFO Creating pseudobulk for CMPN_3
Traceback (most recent call last):
File "", line 1, in
File "/pycisTopic/pycisTopic/pseudobulk_peak_calling.py", line 185, in export_pseudobulk
[
File "/pycisTopic/pycisTopic/pseudobulk_peak_calling.py", line 186, in
export_pseudobulk_one_sample(
File "/pycisTopic/pycisTopic/pseudobulk_peak_calling.py", line 284, in export_pseudobulk_one_sample
group_pr.to_bigwig(
File "/scenicplus/lib/python3.8/site-packages/pyranges/pyranges.py", line 5339, in to_bigwig
result = _to_bigwig(self, path, chromosome_sizes, rpm, divide, value_col, dryrun)
File "/scenicplus/lib/python3.8/site-packages/pyranges/out.py", line 203, in _to_bigwig
bw.addEntries(chromosomes, starts, ends=ends, values=values)
RuntimeError: The entries you tried to add are out of order, precede already added entries, or otherwise use illegal values.
Please correct this and try again.`

my script:
fragments_dict = {'Inj_5DPI_A211213013': 'Inj_5DPI_A211213013_5col_sorted.bed.gz'}
chromsizes=pd.read_csv('size.txt', sep='\t')
chromsizes = chromsizes[['name','length']]
chromsizes.columns = ['Chromosome','End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
chromsizes=pr.PyRanges(chromsizes)
from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = meta, variable = 'Celltype_Final', sample_id_col = 'dataset', chromsizes = chromsizes, bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'), path_to_fragments = fragments_dict, n_cpu = 1, normalize_bigwig = True, remove_duplicates = True, _temp_dir = os.path.join(tmp_dir, 'r'), split_pattern = '-')

Version (please complete the following information):

Python: 3.8.13
SCENIC+: '0.1.dev447+gd4fd733.d20221019'
pycisTopic:'1.0.2.dev9+gaf3977c.d20221019'

issue with RAY package while running a jupyter notebook

Hi Authors,

Thank you for your work in developing this package.

I have an issue while trying to run the 10x multiome pbmc tutorial.

Working with -

python 3.8 - version 3.8.13
scenicplus.__version__
'0.1.dev447+gd4fd733'

I downloaded the ipynb file for the tutorial and executing it independently on our server. We have an LSF based schedule (bsub commands) and I using nbconvert to run the script.

This is the command I am running -

 bsub -W 5:00 -M 128000 -n 1 -o jup_try.out -e jup_try.err jupyter nbconvert \
--to notebook \
--allow-errors=True \
--ExecutePreprocessor.timeout=-1 \
--execute pbmc_multiome_tutorial.ipynb

Essentially the script gets stuck at building the run_cgs_models() step - (code block 39 in the notebook below).

I tried both multicore/processor execution and single core/processor. I changed the "n_cpu" parameter to 10, 5, and 1. Concurrently I also changed the "-n" flag in the above bsub command to match the number of n_cpu. i.e. if n_cpu = 5, then bsub -n 5 and so forth. It seems the RAY package throws a timeout error.

Here is the output jupyter notebook with the results of each step running in "single processor/cpu mode"

notebook_out

Any help troubleshooting this would be appreciated!

run_scenicplus hanging

Hi,
My scenicplus run seems to hang for a while, any suggestion what I should do? Below are my command and last few lines of message.
from scenicplus.wrappers.run_scenicplus import run_scenicplus
try:
run_scenicplus(
scplus_obj = scplus_obj,
variable = ['GEX_celltype'],
species = 'mmusculus',
assembly = 'mm10',
tf_file = 'allTFs_mm.txt',
save_path = 'scenicplus',
biomart_host = biomart_host,
upstream = [1000, 150000],
downstream = [1000, 150000],
calculate_TF_eGRN_correlation = True,
calculate_DEGs_DARs = True,
export_to_loom_file = True,
export_to_UCSC_file = True,
path_bedToBigBed = '/mctp/share/users/yupingz/tools/',
n_cpu = 12,
_temp_dir = os.path.join(tmp_dir, 'ray_spill'))
except Exception as e:
#in case of failure, still save the object
dill.dump(scplus_obj, open('scenicplus/scplus_obj.pkl', 'wb'), protocol=-1)
raise(e)

##########################################
022-09-19 04:19:28,716 WARNING services.py:1882 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67100672 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2022-09-19 04:19:32,119 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [23:47<00:00, 101.97s/it]
2022-09-19 04:43:23,234 GSEA INFO Subsetting TF2G adjacencies for TF with motif.
2022-09-19 04:43:35,257 WARNING services.py:1882 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67100672 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2022-09-19 04:43:39,455 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265
2022-09-19 04:43:40,380 GSEA INFO Running GSEA...
initializing: 0%| | 15/22043 [00:06<2:40:25, 2.29it/s](_ray_run_gsea_for_e_module pid=247665) /opt/venv/lib/python3.8/site-packages/gseapy/algorithm.py:71: RuntimeWarning: divide by zero encountered in true_divide
(_ray_run_gsea_for_e_module pid=247665) norm_tag = 1.0/sum_correl_tag
(_ray_run_gsea_for_e_module pid=247665) /opt/venv/lib/python3.8/site-packages/gseapy/algorithm.py:74: RuntimeWarning: invalid value encountered in multiply
(_ray_run_gsea_for_e_module pid=247665) RES = np.cumsum(tag_indicator * correl_vector * norm_tag - no_tag_indicator * norm_no_tag, axis=axis)
initializing: 0%| | 21/22043 [00:09<2:59:16, 2.05it/s](_ray_run_gsea_for_e_module pid=247658) /opt/venv/lib/python3.8/site-packages/gseapy/algorithm.py:71: RuntimeWarning: divide by zero encountered in true_divide
(_ray_run_gsea_for_e_module pid=247658) norm_tag = 1.0/sum_correl_tag
(_ray_run_gsea_for_e_module pid=247658) /opt/venv/lib/python3.8/site-packages/gseapy/algorithm.py:74: RuntimeWarning: invalid value encountered in multiply
(_ray_run_gsea_for_e_module pid=247658) RES = np.cumsum(tag_indicator * correl_vector * norm_tag - no_tag_indicator * norm_no_tag, axis=axis)
initializing: 0%|▏ | 23/22043 [00:10<2:52:50, 2.12it/s](_ray_run_gsea_for_e_module pid=247661) /opt/venv/lib/python3.8/site-packages/gseapy/algorithm.py:71: RuntimeWarning: divide by zero encountered in true_divide
(_ray_run_gsea_for_e_module pid=247661) norm_tag = 1.0/sum_correl_tag
(_ray_run_gsea_for_e_module pid=247661) /opt/venv/lib/python3.8/site-packages/gseapy/algorithm.py:74: RuntimeWarning: invalid value encountered in multiply
(_ray_run_gsea_for_e_module pid=247661) RES = np.cumsum(tag_indicator * correl_vector * norm_tag - no_tag_indicator * norm_no_tag, axis=axis)
initializing: 85%|████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 18689/22043 18:51<01:09, 48.10it/s Spilled 3791 MiB, 9979 objects, write throughput 966 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
initializing: 85%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 18796/22043 18:52<00:33, 97.48it/s Spilled 7001 MiB, 18433 objects, write throughput 1480 MiB/s.
initializing: 89%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊ | 19594/22043 19:54<03:35, 11.36it/s Spilled 8213 MiB, 21535 objects, write throughput 1311 MiB/s.
initializing: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 22043/22043 [23:22<00:00, 15.72it/s]
Running using 12 cores: 32%|████████████████████████████████████▏ | 7202/22299 [3:44:32<713:05:46, 170.04s/it]

Can I use time series data with multiple replicates per time point

I have time series data with multiple replicates per time point

I understand that multiple samples can be used but can I inter-compare between different samples from different time points?

Also, would it be okay to use multiomics data and seperate scRNA+scATAC data with label transfer together.

Thanks for the wonderful tool!

How to run the run_pycistarget function locally？

Hello, I found that running run_pycistarget requires networking. How can I run it locally, because the HPC cannot connect to external websites.

run_pycistarget(
...     region_sets = region_sets,
...     species = 'homo_sapiens',
...     save_path = os.path.join(work_dir, 'motifs'),
...     ctx_db_path = rankings_db,
...     dem_db_path = scores_db,
...     path_to_motif_annotations = motif_annotation,
...     run_without_promoters = True,
...     biomart_host = 'http://www.ensembl.org',  #need network
...     n_cpu = 8,
...     _temp_dir = tmp_dir,
...     annotation_version = 'v10nr_clust',
...     )

运行报错
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url

TimeoutError when running export_pseudobulk

Hello,

Thank you for developing this awesome package! When going through the pbmc_multiome_tutorial.ipynb file, I encounter the following error when running export_pseudobulk(). This occurs whether I run the tutorial on my Mac or on a HPC node. I found this relevant issue, but there doesn't seem to be a solution yet. Any help would be greatly appreciated!

TimeoutError                              Traceback (most recent call last)
~/opt/anaconda3/envs/xgb_env/lib/python3.7/site-packages/ray/node.py in __init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
    306                     self._plasma_store_socket_name,
--> 307                     self.redis_password,
    308                 )

~/opt/anaconda3/envs/xgb_env/lib/python3.7/site-packages/ray/_private/services.py in wait_for_node(redis_address, gcs_address, node_plasma_store_socket_name, redis_password, timeout)
    396             time.sleep(0.1)
--> 397     raise TimeoutError("Timed out while waiting for node to startup.")
    398 

TimeoutError: Timed out while waiting for node to startup.

During handling of the above exception, another exception occurred:

Exception                                 Traceback (most recent call last)
/var/folders/df/fyg_m9x17mg1h3hzz8rxmm540000gn/T/ipykernel_82417/1956818805.py in <module>
     11                  remove_duplicates = True,
     12                  _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
---> 13                  split_pattern = '-')

~/opt/anaconda3/envs/xgb_env/lib/python3.7/site-packages/pycisTopic/pseudobulk_peak_calling.py in export_pseudobulk(input_data, variable, chromsizes, bed_path, bigwig_path, path_to_fragments, sample_id_col, n_cpu, normalize_bigwig, remove_duplicates, split_pattern, use_polars, **kwargs)
    162     # Create pseudobulks
    163     if n_cpu > 1:
--> 164         ray.init(num_cpus=n_cpu, **kwargs)
    165         ray_handle = ray.wait(
    166             [

~/opt/anaconda3/envs/xgb_env/lib/python3.7/site-packages/ray/_private/client_mode_hook.py in wrapper(*args, **kwargs)
    103             if func.__name__ != "init" or is_client_mode_enabled_by_default:
    104                 return getattr(ray, func.__name__)(*args, **kwargs)
--> 105         return func(*args, **kwargs)
    106 
    107     return wrapper

~/opt/anaconda3/envs/xgb_env/lib/python3.7/site-packages/ray/worker.py in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, _enable_object_reconstruction, _redis_max_memory, _plasma_directory, _node_ip_address, _driver_object_store_memory, _memory, _redis_password, _temp_dir, _metrics_export_port, _system_config, _tracing_startup_hook, _node_name, **kwargs)
   1038         # isn't called.
   1039         _global_node = ray.node.Node(
-> 1040             head=True, shutdown_at_exit=False, spawn_reaper=True, ray_params=ray_params
   1041         )
   1042     else:

~/opt/anaconda3/envs/xgb_env/lib/python3.7/site-packages/ray/node.py in __init__(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
    309             except TimeoutError:
    310                 raise Exception(
--> 311                     "The current node has not been updated within 30 "
    312                     "seconds, this could happen because of some of "
    313                     "the Ray processes failed to startup."

Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.

Cannot export pseudobulk files

Hello, I don't receive an error when executing the following, but when comparing the output to that on the Scenic+ tutorial I can see the pseudobulk files haven't been created.
Do you know why this is? Thanks!

from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
             variable = 'celltype',                                                                     
             sample_id_col = 'sample_id',
             chromsizes = chromsizes,
             bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'),  
             bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),
             path_to_fragments = fragments_dict,                                                        
             n_cpu = 8,                                                                                
             normalize_bigwig = True,
             remove_duplicates = True,
             _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
             split_pattern = '-')

2022-11-03 16:07:53,397 cisTopic INFO Reading fragments from pbmc_tutorial/data/atac_fragments.tsv.gz
2022-11-03 16:09:26,040 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265/
(export_pseudobulk_ray pid=703) 2022-11-03 16:09:35,151 cisTopic INFO Creating pseudobulk for endocrine
(export_pseudobulk_ray pid=702) 2022-11-03 16:09:35,151 cisTopic INFO Creating pseudobulk for ductal

Specifically, I cannot see "export_pseudobulk_ray...cisTopic INFO ductal done! etc

run_cistarget fails when using multiple region_sets

Discussed in #54

^{Originally posted by SidG13 October 19, 2022}
Hi all,

I'm transferring this discussion to an issue since I believe it's a bug. Thanks!

DAR calculation results in empty dictionary values for some keys

Hi, this isn't an issue with the program itself but I'm just trying to understand better ways to move forward with my data.

I'm running SCENIC+ on my own multiome data and it looks fine until I calculate DARs. The imputed sparsity is very low which may be one issue (running the default command):
Imputed accessibility sparsity: 0.00243018816125784

And some of my cell clusters (clusters 0 and 1) are empty in the dictionary value:

>>>print(markers_dict)

{'0': Empty DataFrame
 Columns: [Log2FC, Adjusted_pval, Contrast]
 Index: [],
 '1': Empty DataFrame
 Columns: [Log2FC, Adjusted_pval, Contrast]
 Index: [],
 '2':                            Log2FC Adjusted_pval Contrast
 mi7:4818657-4819157      0.769073      0.000001        2
 ma6:66568746-66569246    0.768514      0.000001        2
 ma6:29137929-29138429    0.766684      0.000001        2
 ma1:278987328-278987828  0.766288      0.000001        2
 ma5:3335808-3336308      0.765712      0.000001        2
 ...                           ...           ...      ...
 mi4:7649213-7649713      0.586276      0.000001        2
 mi1:17598053-17598553    0.586209      0.000001        2
 mi1:16983098-16983598    0.586181      0.000001        2
 ma3:152108281-152108781  0.586084      0.000001        2
 ma1:147379789-147380289  0.585201      0.000001        2
 
 [2419 rows x 3 columns],
 '3':                            Log2FC Adjusted_pval Contrast
 mi7:4818657-4819157      1.839034           0.0        3
 ma6:29137929-29138429    1.834527           0.0        3
 ma6:66568746-66569246    1.831466           0.0        3
 Z:99232385-99232885      1.829914           0.0        3
 ma5:3335808-3336308      1.824418           0.0        3
 ...                           ...           ...      ...
 mi1:18492676-18493176    0.586122           0.0        3
 ma1:139951997-139952497   0.58608           0.0        3
 ma2:83561973-83562473    0.585666           0.0        3
 ma4:33619315-33619815    0.585487           0.0        3
 ma1:227069338-227069838  0.585128           0.0        3
 
 [10916 rows x 3 columns],
 '4':                            Log2FC Adjusted_pval Contrast
 ma6:66568746-66569246     0.74331           0.0        4
 mi7:4818657-4819157      0.742966           0.0        4
 ma6:29137929-29138429    0.741914           0.0        4
 ma5:3335808-3336308      0.740059           0.0        4
 Z:99232385-99232885      0.738471           0.0        4
 ...                           ...           ...      ...
 ma2:127339096-127339596  0.585864           0.0        4
 ma6:52456453-52456953    0.585728           0.0        4
 mi3:16386390-16386890    0.585616           0.0        4
 ma7:35807057-35807557    0.585456           0.0        4
 ma3:107703765-107704265  0.585026           0.0        4
 
 [2622 rows x 3 columns],
 '5':                          Log2FC Adjusted_pval Contrast
 ma1:86627541-86628041  1.970056           0.0        5
 ma2:94271354-94271854  1.954057           0.0        5
 mi1:9052675-9053175    1.941548           0.0        5
 mi4:15972036-15972536  1.941548           0.0        5
 mi8:2047445-2047945    1.941118           0.0        5
 ...                         ...           ...      ...
 ma6:49717359-49717859  0.584988           0.0        5
 ma6:63083984-63084484  0.584988           0.0        5
 ma6:64046189-64046689  0.584988           0.0        5
 mi1:15376840-15377340  0.584988           0.0        5
 mi5:2253119-2253619    0.584988           0.0        5
 
 [46195 rows x 3 columns],
 '6':                            Log2FC Adjusted_pval Contrast
 ma5:63752539-63753039    1.111123           0.0        6
 ma2:143088687-143089187  1.104998           0.0        6
 ma2:182001025-182001525  1.103168           0.0        6
 mi4:16031232-16031732    1.101949           0.0        6
 ma2:119763231-119763731  1.097899           0.0        6
 ...                           ...           ...      ...
 Z:86427065-86427565      0.585042           0.0        6
 ma2:101515606-101516106  0.585042           0.0        6
 ma1:142062650-142063150  0.585022           0.0        6
 ma2:128449346-128449846   0.58502           0.0        6
 ma6:64347065-64347565    0.585002           0.0        6
 
 [39226 rows x 3 columns],
 '7':                          Log2FC Adjusted_pval Contrast
 ma1:86627541-86628041  1.572187           0.0        7
 ma2:94271354-94271854  1.565921           0.0        7
 mi1:9052675-9053175    1.558166           0.0        7
 mi4:15972036-15972536  1.558166           0.0        7
 mi8:2047445-2047945    1.553827           0.0        7
 ...                         ...           ...      ...
 ma1:39354640-39355140  0.585174      0.000008        7
 ma4:1732112-1732612    0.585075       0.00002        7
 ma2:67501656-67502156  0.585058      0.000015        7
 ma2:72275760-72276260  0.585058      0.000015        7
 ma3:74508849-74509349     0.585           0.0        7
 
 [44975 rows x 3 columns]}

Having empty values will cause downstream issues. I was wondering if anyone had any input on why this might be happening or if there are reasonable ways to fix this without affecting the data interpretation too much? I'm not too familiar with cistopic so any help with this would be much appreciated! Thanks

Test data in scenicplus and pycisTopic jupyter notebook seem lost.

Hi,

Here I report that test datasets (10x_multiome_brain_Seurat.loom and other input files) in both scenicplus (https://github.com/aertslab/scenicplus/blob/main/docs/Scenicplus_step_by_step-RTD.ipynb) and
pycisTopic (https://pycistopic.readthedocs.io/en/latest/Single_sample_workflow-RTD.html) jupyter
notebooks seem lost. Perhaps I miss some information within that notebook? I understand one can carry out seurat analysis from the downloaded dataset, but it would be great if the author can support a ready-to-use one ^_^. Thanks a lot!

Best,
Meijiao

merge_cistromes bug if not using extended motif annotation

Before you refactored the code in merge_cistromes (commit 4a0e09f) everything was running fine, but now I got the following bug.

First lines in cistromes.py, merge_cistromes():

    menr = scplus_obj.menr
    # Get signatures from Homer/Cistarget outputs
    signatures = _get_signatures_as_dict(_signatures_to_iter(menr))

    #split direct and indirect signatures
    signatures_direct = {x: signatures[x] for x in signatures.keys() if not 'extended' in x}
    signatures_extend = {x: signatures[x] for x in signatures.keys() if     'extended' in x}
    
    #merge regions by TF name
    merged_signatures_direct = _merge_dict_of_signatures(signatures_direct, suffix = '')
    merged_signatures_extend = _merge_dict_of_signatures(signatures_extend, suffix = '_extended')

The last line here will cause a bug if signatures_extend is empty (len(signatures_extend)==0), which it should be if we only include motifs which are directly annotated if I understand everything correctly.

Maybe signatures_extend is never supposed to be empty, in that case the bug might be related to my previous steps (i.e. the cistarget database I am using, and the setting I used to run pycistarget).

Here is the traceback:

ValueError                                Traceback (most recent call last)
Input In [48], in <cell line: 1>()
----> 1 merged_signatures_extend = _merge_dict_of_signatures(signatures_extend, suffix = '_extended')

File ~/miniconda3/envs/grn_tools_new/lib/python3.8/site-packages/scenicplus/cistromes.py:40, in _merge_dict_of_signatures(d, suffix)
     38 def _merge_dict_of_signatures(d, suffix = ''):
     39     arr_keys_signatures = np.array(list(d.keys()))
---> 40     grouper = Groupby([x.split('_')[0] for x in arr_keys_signatures])
     41     merged_signatures = {}
     42     for TF, idx in zip(grouper.keys, grouper.indices):

File ~/miniconda3/envs/grn_tools_new/lib/python3.8/site-packages/scenicplus/utils.py:279, in Groupby.__init__(self, keys)
    277 def __init__(self, keys):
    278     self.keys, self.keys_as_int = np.unique(keys, return_inverse=True)
--> 279     self.n_keys = max(self.keys_as_int) + 1
    280     self.set_indices()

ValueError: max() arg is an empty sequence

Using SCENIC+ in jupyter-notebook on a HPC

Hi,

Congratulations with the development of Scenic+, it looks like a really useful tool and I can't wait to try it out.

Describe the bug
I am trying to run the tutorial notebooks on a HPC cluster, but this is causing problems with the ray parallelization for me. My notebooks run through an interactive SLURM job. Do you have any explanation available on how to get Ray to work in this context? Many thanks in advance.

To Reproduce
SLURM submission script:

#!/bin/bash
#SBATCH -p cclake
#SBATCH -A gottgens-sl2-cpu
#SBATCH -N 1
#SBATCH -n 12

get/set tunneling info

XDG_RUNTIME_DIR=""
ipnport=$(shuf -i8000-9999 -n1)
ipnip=$(hostname -i)

Start jupyter

jupyter-lab --no-browser --port=$ipnport --ip=$ipnip

Then in Jupyter-notebook ( this or any other command that has the n_pcu option) :

run_scenicplus(
    scplus_obj = scplus_obj,
    variable = ['GEX_celltype'],
    species = 'hsapiens',
    assembly = 'hg38',
    tf_file = 'pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt',
    save_path = os.path.join(work_dir, 'scenicplus'),
    biomart_host = biomart_host,
    upstream = [1000, 150000],
    downstream = [1000, 150000],
    calculate_TF_eGRN_correlation = True,
    calculate_DEGs_DARs = True,
    export_to_loom_file = True,
    export_to_UCSC_file = True,
    path_bedToBigBed = 'pbmc_tutorial',
    n_cpu = 5,
    _temp_dir = os.path.join(tmp_dir, 'ray_spill'))

It seems to run with n_cpu=None, but this takes a really long time.

Error output
2022-08-22 09:17:00,608 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8267/

TimeoutError Traceback (most recent call last)
File ~/miniconda3/envs/scenicplus/lib/python3.8/site-packages/ray/node.py:303, in Node.init(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
302 try:
--> 303 ray._private.services.wait_for_node(
304 self.redis_address,
305 self.gcs_address,
306 self._plasma_store_socket_name,
307 self.redis_password,
308 )
309 except TimeoutError:

File ~/miniconda3/envs/scenicplus/lib/python3.8/site-packages/ray/_private/services.py:397, in wait_for_node(redis_address, gcs_address, node_plasma_store_socket_name, redis_password, timeout)
396 time.sleep(0.1)
--> 397 raise TimeoutError("Timed out while waiting for node to startup.")

TimeoutError: Timed out while waiting for node to startup.

During handling of the above exception, another exception occurred:

Exception Traceback (most recent call last)
Input In [18], in <cell line: 2>()
20 except Exception as e:
21 #in case of failure, still save the object
22 dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)
---> 23 raise(e)

Input In [18], in <cell line: 2>()
1 from scenicplus.wrappers.run_scenicplus import run_scenicplus
2 try:
----> 3 run_scenicplus(
4 scplus_obj = scplus_obj,
5 variable = ['GEX_celltype'],
6 species = 'hsapiens',
7 assembly = 'hg38',
8 tf_file = 'pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt',
9 save_path = os.path.join(work_dir, 'scenicplus'),
10 biomart_host = biomart_host,
11 upstream = [1000, 150000],
12 downstream = [1000, 150000],
13 calculate_TF_eGRN_correlation = True,
14 calculate_DEGs_DARs = True,
15 export_to_loom_file = True,
16 export_to_UCSC_file = True,
17 path_bedToBigBed = 'pbmc_tutorial',
18 n_cpu = 5,
19 _temp_dir = os.path.join(tmp_dir, 'ray_spill'))
20 except Exception as e:
21 #in case of failure, still save the object
22 dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)

File /rds/project/rds-SDzz0CATGms/users/bt392/software/github/scenicplus/src/scenicplus/wrappers/run_scenicplus.py:141, in run_scenicplus(scplus_obj, variable, species, assembly, tf_file, save_path, biomart_host, upstream, downstream, region_ranking, gene_ranking, simplified_eGRN, calculate_TF_eGRN_correlation, calculate_DEGs_DARs, export_to_loom_file, export_to_UCSC_file, tree_structure, path_bedToBigBed, n_cpu, _temp_dir)
139 if 'region_to_gene' not in scplus_obj.uns.keys():
140 log.info('Inferring region to gene relationships')
--> 141 calculate_regions_to_genes_relationships(scplus_obj,
142 ray_n_cpu = n_cpu,
143 _temp_dir = _temp_dir,
144 importance_scoring_method = 'GBM',
145 importance_scoring_kwargs = GBM_KWARGS)
147 if 'TF2G_adj' not in scplus_obj.uns.keys():
148 log.info('Inferring TF to gene relationships')

File /rds/project/rds-SDzz0CATGms/users/bt392/software/github/scenicplus/src/scenicplus/enhancer_to_gene.py:661, in calculate_regions_to_genes_relationships(SCENICPLUS_obj, search_space_key, mask_expr_dropout, genes, importance_scoring_method, importance_scoring_kwargs, correlation_scoring_method, ray_n_cpu, add_distance, key_added, inplace, **kwargs)
658 log.info(
659 f'Calculating region to gene importances, using {importance_scoring_method} method')
660 start_time = time.time()
--> 661 region_to_gene_importances = _score_regions_to_genes(SCENICPLUS_obj,
662 search_space=search_space,
663 mask_expr_dropout=mask_expr_dropout,
664 genes=genes,
665 regressor_type=importance_scoring_method,
666 regressor_kwargs=importance_scoring_kwargs,
667 ray_n_cpu=ray_n_cpu,
668 **kwargs)
669 log.info('Took {} seconds'.format(time.time() - start_time))
671 # calculate region to gene correlation

File /rds/project/rds-SDzz0CATGms/users/bt392/software/github/scenicplus/src/scenicplus/enhancer_to_gene.py:534, in _score_regions_to_genes(SCENICPLUS_obj, search_space, mask_expr_dropout, genes, regressor_type, ray_n_cpu, regressor_kwargs, **kwargs)
532 ACC_df = SCENICPLUS_obj.to_df(layer='ACC')
533 if ray_n_cpu != None:
--> 534 ray.init(num_cpus=ray_n_cpu, **kwargs)
535 try:
536 jobs = []

File ~/miniconda3/envs/scenicplus/lib/python3.8/site-packages/ray/_private/client_mode_hook.py:105, in client_mode_hook..wrapper(*args, **kwargs)
103 if func.name != "init" or is_client_mode_enabled_by_default:
104 return getattr(ray, func.name)(*args, **kwargs)
--> 105 return func(*args, **kwargs)

File ~/miniconda3/envs/scenicplus/lib/python3.8/site-packages/ray/worker.py:1039, in init(address, num_cpus, num_gpus, resources, object_store_memory, local_mode, ignore_reinit_error, include_dashboard, dashboard_host, dashboard_port, job_config, configure_logging, logging_level, logging_format, log_to_driver, namespace, runtime_env, storage, _enable_object_reconstruction, _redis_max_memory, _plasma_directory, _node_ip_address, _driver_object_store_memory, _memory, _redis_password, _temp_dir, _metrics_export_port, _system_config, _tracing_startup_hook, _node_name, **kwargs)
997 ray_params = ray._private.parameter.RayParams(
998 node_ip_address=node_ip_address,
999 raylet_ip_address=raylet_ip_address,
(...)
1033 node_name=_node_name,
1034 )
1035 # Start the Ray processes. We set shutdown_at_exit=False because we
1036 # shutdown the node in the ray.shutdown call that happens in the atexit
1037 # handler. We still spawn a reaper process in case the atexit handler
1038 # isn't called.
-> 1039 _global_node = ray.node.Node(
1040 head=True, shutdown_at_exit=False, spawn_reaper=True, ray_params=ray_params
1041 )
1042 else:
1043 # In this case, we are connecting to an existing cluster.
1044 if num_cpus is not None or num_gpus is not None:

File ~/miniconda3/envs/scenicplus/lib/python3.8/site-packages/ray/node.py:310, in Node.init(self, ray_params, head, shutdown_at_exit, spawn_reaper, connect_only)
303 ray._private.services.wait_for_node(
304 self.redis_address,
305 self.gcs_address,
306 self._plasma_store_socket_name,
307 self.redis_password,
308 )
309 except TimeoutError:
--> 310 raise Exception(
311 "The current node has not been updated within 30 "
312 "seconds, this could happen because of some of "
313 "the Ray processes failed to startup."
314 )
315 node_info = ray._private.services.get_node_to_connect_for_driver(
316 self.redis_address,
317 self.gcs_address,
318 self._raylet_ip_address,
319 redis_password=self.redis_password,
320 )
321 self._ray_params.node_manager_port = node_info.node_manager_port

Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.

Version (please complete the following information):

Python: 3.8.13
SCENIC+: 0.1.dev435+g5987737.d20220821
Ray 3.8.13

[BUG] Handling genes with only 1 region in search space

Information

On the link inference, the program suddenly stops on certain genes with this error:

Traceback (most recent call last):
  File "/data/leuven/software/biomed/skylake_centos7/2018a/software/Python/3.7.4-GCCcore-6.4.0/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/leuven/software/biomed/skylake_centos7/2018a/software/Python/3.7.4-GCCcore-6.4.0/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus/src/scenicplus/cli/benchmarking.py", line 170, in <module>
    main()
  File "/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus/src/scenicplus/cli/benchmarking.py", line 166, in main
    args.func(args)
  File "/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus/src/scenicplus/cli/benchmarking.py", line 80, in region_to_gene_command
    _temp_dir = temp_dir
  File "/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus/src/scenicplus/enhancer_to_gene.py", line 479, in calculate_regions_to_genes_relationships
    **kwargs)
  File "/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus/src/scenicplus/enhancer_to_gene.py", line 419, in score_regions_to_genes
    regions_to_genes = {gene: regions_to_gene for gene, regions_to_gene in zip(genes_to_use, regions_to_genes)}
UnboundLocalError: local variable 'regions_to_genes' referenced before assignment

Doing some test, I realized this happens for genes with only 1 region in the search space (1 feature); the real error it is not printed because it is within a try [https://github.com/aertslab/scenicplus/blob/main/src/scenicplus/enhancer_to_gene.py, 399-419], but this is a reproducible test:

import numpy as np
from scipy.stats import pearsonr, spearmanr
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, ExtraTreesRegressor
RF_KWARGS = {
    'n_jobs': 1,
    'n_estimators': 1000,
    'max_features': 'sqrt'
}
SKLEARN_REGRESSOR_FACTORY = {
    'RF': RandomForestRegressor,
    'ET': ExtraTreesRegressor,
    'GBM': GradientBoostingRegressor
}
SCIPY_CORRELATION_FACTORY = {
    'PR': pearsonr,
    'SR': spearmanr
}
#RUNS
test=score_regions_to_single_gene(imp_acc.mtx[0:5,0:5].todense(), np.array(dgem.iloc[0,0:5]), 'RF', RF_KWARGS)
#ERROR
test=score_regions_to_single_gene(imp_acc.mtx[0,0:5].todense(), np.array(dgem.iloc[0,0:5]), 'RF', RF_KWARGS)

Which outputs:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-40-64b45ad4144d> in <module>
     16     'SR': spearmanr
     17 }
---> 18 test=score_regions_to_single_gene(imp_acc.mtx[0,0:5].todense(), np.array(dgem.iloc[0,0:5]), 'RF', RF_KWARGS)
     19 test

<ipython-input-8-6fcb3afe2133> in score_regions_to_single_gene(X, y, regressor_type, regressor_kwargs)
     16                                                     regressor_kwargs = regressor_kwargs,
     17                                                     tf_matrix = X,
---> 18                                                     target_gene_expression = y)
     19             #get importance scores for each feature
     20             feature_importance = arboreto_core.to_feature_importances(  regressor_type = regressor_type, 

~/.local/lib/python3.7/site-packages/arboreto/core.py in fit_model(regressor_type, regressor_kwargs, tf_matrix, target_gene_expression, early_stop_window_length, seed)
    125         target_gene_expression = target_gene_expression.A.flatten()
    126 
--> 127     assert tf_matrix.shape[0] == target_gene_expression.shape[0]
    128 
    129 

AssertionError:

Solution
Adding a transposition when there is only one region solves it (returns a importance of 1):

#ERROR
test=score_regions_to_single_gene(imp_acc.mtx[0,0:5].todense(), np.array(dgem.iloc[0,0:5]), 'RF', RF_KWARGS)
#WORKS
test=score_regions_to_single_gene(imp_acc.mtx[0,0:5].T.todense(), np.array(dgem.iloc[0,0:5]), 'RF', RF_KWARGS)

I will test this further and push if all goes well, but this situation may need to be handled in other steps too (e.g. binarization), so I will keep the issue open until all is tested.

Cheers!

Question about GSEA in the eGRN compilation part

I have some questions about the eGRN compilation part.
Is GSEA calculated for each TF-region-gene module? Which genes are in the ranking list L?
My understanding is that the ranking list is the importance score of all downstream genes corresponding to a transcription factor, and set is the downstream genes in the module corresponding to this TF. I don't know if my understanding is correct.
I am not very clear about the detailed process of this part. I hope you can help me out.
Thanks

[FEATURE REQUEST] Custom colors for loom

Is your feature request related to a problem? Please describe.
Add custom colors for categorical variables in loom files (applicable to SCENIC+/pycisTopic)

Describe the solution you'd like
See Sara's notebook

An error during calculate_TFs_to_genes_relationships()

Describe the bug
In the SCENIC+ object, my expression matrix was stored in a form of a sparse matrix by default.

During calculate_TFs_to_genes_relationships() step expression matrix in a form of a pandas dataframe is required (TF_to_gene.py module, line 333). The function tries to call it with:

ex_matrix = pd.DataFrame(
        scplus_obj.X_EXP, index=scplus_obj.cell_names, columns=scplus_obj.gene_names)

However, this approach is not suitable for sparse matrixes and results in error: ValueError: Shape of passed values is (9307, 1), indices imply (9307, 27477)

I bypassed this error by using the default method .to_df(layer="EXP"):
ex_matrix = scplus_obj.to_df(layer="EXP")

It might be a good idea to fix this dangerous spot.

Memory efficiency in make_rankings

To store the rankings in make_rankings (eregulon_enrichment.py) you initialize a matrix of zeros with datatype np.float64 (double), where np.int32 should work as well and save a lot of memory. I am referring to lines 82-83 (

scenicplus/src/scenicplus/eregulon_enrichment.py

Line 82 in 1ab4169

np.zeros((len(scplus_obj.region_names),

) and 89 (

scenicplus/src/scenicplus/eregulon_enrichment.py

Line 89 in 1ab4169

np.zeros((len(scplus_obj.gene_names), len(scplus_obj.cell_names))),

)

load own filtered gene matrix and cell type annotations?

Hi Seppe,

Is there a direct way to use our own filtered scRNAseq data, in the form of a filtered gene matrix along with a csv or similar file containing cellbarcode to cell type mapping, along with two columns for UMAP1 and UMAP2 coordinates?

I created my own adata object and loaded the filtered gene matrix using sc.read_10x_h5. However, can you suggest a way to annotate the cells using my mapping table with umap coordinates?

--
Sid

Provide files used for tutorial SCENIC+ step-by-step in the human cerebellum

I am following the tutorial SCENIC+ step-by-step in the human cerebellum, and I dont see where the linked file
/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/Imputed_accessibility.pklis available anywhere. The same goes for /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl, /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/seurat/10x_multiome_brain_Seurat.loom, and the various files within /staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/.

Alternatively, the SCENIC+ object could be made available so one can start from there, maybe?

Is it possible to provide these files somewhere so the vignette can be followed directly, have I overseen it?
Thanks!

expected string or bytes-like object when running export_pseudobulk

Hello,
I got the following error when running export_pseudobulk() to generate pseudobulk ATAC-seq profiles. My data is seperate scRNA-seq and scATAC-seq from different cells but the same sample. All cell types have been annotated. Data types of "sample_id" and "cluster" columns are string.
Do you have any idea to solve that issue?
Many thanks in advance.

TypeError Traceback (most recent call last)
Input In [39], in <cell line: 5>()
3 ray.shutdown()
4 sys.stderr = open(os.devnull, "w") # silence stderr
----> 5 bw_paths, bed_paths = export_pseudobulk(input_data = cell_data_ATAC,
6 variable = 'cluster', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
7 sample_id_col = 'sample_id',
8 chromsizes = chromsizes,
9 bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored
10 bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
11 path_to_fragments = fragments_dict, # location of fragment files
12 n_cpu = 8, # specify the number of cores to use, we use ray for multi processing
13 normalize_bigwig = True,
14 remove_duplicates = True,
15 _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
16 split_pattern = '-')
17 sys.stderr = sys.stderr

File /vsc-hard-mounts/leuven-data/328/vsc32848/miniconda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/pseudobulk_peak_calling.py:128, in export_pseudobulk(input_data, variable, chromsizes, bed_path, bigwig_path, path_to_fragments, sample_id_col, n_cpu, normalize_bigwig, remove_duplicates, split_pattern, use_polars, **kwargs)
122 fragments_df = fragments_df.loc[
123 fragments_df["Name"].isin(cell_data["barcode"].tolist())
124 ]
125 else:
126 fragments_df = fragments_df.loc[
127 fragments_df["Name"].isin(
--> 128 prepare_tag_cells(cell_data.index.tolist(), split_pattern)
129 )
130 ]
131 fragments_df_dict[sample_id] = fragments_df
133 # Set groups

File /vsc-hard-mounts/leuven-data/328/vsc32848/miniconda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/utils.py:183, in prepare_tag_cells(cell_names, split_pattern)
181 def prepare_tag_cells(cell_names, split_pattern="___"):
182 if split_pattern == "-":
--> 183 new_cell_names = [
184 re.findall(r"^[ACGT]-[0-9]+-", x)[0].rstrip("-")
185 if len(re.findall(r"^[ACGT]-[0-9]+-", x)) != 0
186 else x
187 for x in cell_names
188 ]
189 new_cell_names = [
190 re.findall(r"^\w*-[0-9]", new_cell_names[i])[0].rstrip("-")
191 if (len(re.findall(r"^\w-[0-9]*", new_cell_names[i])) != 0)
(...)
194 for i in range(len(new_cell_names))
195 ]
196 else:

File /vsc-hard-mounts/leuven-data/328/vsc32848/miniconda/envs/scenicplus/lib/python3.8/site-packages/pycisTopic/utils.py:185, in (.0)
181 def prepare_tag_cells(cell_names, split_pattern="___"):
182 if split_pattern == "-":
183 new_cell_names = [
184 re.findall(r"^[ACGT]-[0-9]+-", x)[0].rstrip("-")
--> 185 if len(re.findall(r"^[ACGT]-[0-9]+-", x)) != 0
186 else x
187 for x in cell_names
188 ]
189 new_cell_names = [
190 re.findall(r"^\w*-[0-9]", new_cell_names[i])[0].rstrip("-")
191 if (len(re.findall(r"^\w-[0-9]*", new_cell_names[i])) != 0)
(...)
194 for i in range(len(new_cell_names))
195 ]
196 else:

File /vsc-hard-mounts/leuven-data/328/vsc32848/miniconda/envs/scenicplus/lib/python3.8/re.py:241, in findall(pattern, string, flags)
233 def findall(pattern, string, flags=0):
234 """Return a list of all non-overlapping matches in the string.
235
236 If one or more capturing groups are present in the pattern, return
(...)
239
240 Empty matches are included in the result."""
--> 241 return _compile(pattern, flags).findall(string)

TypeError: expected string or bytes-like object

Issue with scenicplus.wrappers

Hi,

I've been working my way through the pbmc tutorial but have encountered a problem in the section "Motif enrichment analysis using pycistarget". When I try to import the run_pycistarget() command using
from scenicplus.wrappers.run_pycistarget import run_pycistarget
I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'scenicplus.wrappers'

From what I can see, the scenicplus package is imported without giving any errors. This is the session info:

>>> session_info.show(html=False) 

PIL                 9.2.0
adjustText          NA
gensim              4.2.0
harmonypy           NA
igraph              0.10.1
lda                 2.0.0
leidenalg           0.9.0
matplotlib          3.6.0
numpy               1.22.0
pandas              1.5.0
pybiomart           0.2.0
pycisTopic          1.0.2.dev9+gaf3977c
pycistarget         1.0.2.dev9+g0ed8289
pyranges            0.0.117
pyscenic            0.12.0+7.gc120979
ray                 2.0.0
requests            2.28.1
scanpy              1.9.1
scenicplus          NA
scipy               1.9.1
seaborn             0.12.0
session_info        1.0.0
sklearn             1.0.2
tmtoolkit           0.11.2
umap                0.5.3

IPython             8.0.1
jupyter_client      7.1.2
jupyter_core        4.9.1
notebook            6.4.8

Python 3.9.1 (default, Dec 11 2020, 06:28:49) [Clang 10.0.0 ]
macOS-10.16-x86_64-i386-64bit

Session information updated at 2022-10-14 08:00
>>>

Thank you in advance!

Exception: The embedding should have two columns. in _export_minimal_loom

When exporting a loom file from scenicplus the expection Exception: The embedding should have two columns is thrown.
This is caused by the PCA dimensionality reduction which has more than two columns, in this case we should take the first two.

(will commit a fix soon, writing this issue so I don't forget).

aertslab / scenicplus Goto Github PK

scenicplus's Introduction

SCENIC+ single-cell eGRN inference

Documentation

Installing

Questions?

References

scenicplus's People

Contributors

Stargazers

Watchers

Forkers

scenicplus's Issues

Error output

Discussed in #54

get/set tunneling info

Start jupyter

Error output 2022-08-22 09:17:00,608 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8267/

Recommend Projects

Recommend Topics

Recommend Org

Error output
2022-08-22 09:17:00,608 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8267/