Git Product home page Git Product logo

cnmf's Introduction

Consensus NMF (cNMF)

cNMF is a pipeline for inferring gene expression programs from scRNA-Seq. It takes a count matrix (N cells X G genes) as input and produces a (K x G) matrix of gene expression programs (GEPs) and a (N x K) matrix specifying the usage of each program for each cell in the data. Read more about the method in the publication and check out examples on simulated data and PBMCs.

We have also created a tutorial for running cNMF from R. See the Rmd notebook or the compiled html for this.

Installation

cNMF has been tested with Python 3.7 and 3.10 and requires scikit-learn>=1.0, scanpy>=1.8, and AnnData>=0.9

You can install with pip:

pip install cnmf

If you want to use the batch correction preprocessing, you also need to install the Python implementation of Harmony and scikit-misc

pip install harmonypy
pip install scikit-misc

Running cNMF

cNMF can be run from the command line without any parallelization using the example commands below:

cnmf prepare --output-dir ./example_data --name example_cNMF -c ./example_data/counts_prefiltered.txt -k 5 6 7 8 9 10 11 12 13 --n-iter 100 --seed 14

cnmf factorize --output-dir ./example_data --name example_cNMF --worker-index 0 --total-workers 1

cnmf combine --output-dir ./example_data --name example_cNMF

cnmf k_selection_plot --output-dir ./example_data --name example_cNMF

cnmf consensus --output-dir ./example_data --name example_cNMF --components 10 --local-density-threshold 0.01 --show-clustering

Or alternatively, the same steps can be run from within a Python environment using the commands below:

from cnmf import cNMF
cnmf_obj = cNMF(output_dir="./example_data", name="example_cNMF")
cnmf_obj.prepare(counts_fn="./example_data/counts_prefiltered.txt", components=np.arange(5,14), n_iter=100, seed=14)
cnmf_obj.factorize(worker_i=0, total_workers=1)
cnmf_obj.combine()
cnmf_obj.k_selection_plot()
cnmf_obj.consensus(k=10, density_threshold=0.01)
usage, spectra_scores, spectra_tpm, top_genes = cnmf_obj.load_results(K=10, density_threshold=0.01)

For the Python environment approach, usage will contain the usage matrix with each cell normalized to sum to 1. spectra_scores contains the gene_spectra_scores output (aka Z-score unit gene expression matrix), spectra_tpm contains the GEP spectra in units of TPM and top_genes contains an ordered list of the top 100 associated genes for each program.

Output data files will all be available in the ./example_data/example_cNMF directory including:

  • Z-score unit gene expression program matrix - example_data/example_cNMF/example_cNMF.gene_spectra_score.k_10.dt_0_01.txt
  • TPM unit gene expression program matrix - example_data/example_cNMF/example_cNMF.gene_spectra_tpm.k_10.dt_0_01.txt
  • Usage matrix - example_data/example_cNMF/example_cNMF.usages.k_10.dt_0_01.consensus.txt
  • K selection plot - example_data/example_cNMF/example_cNMF.k_selection.png
  • Clustergram diagnostic plot - example_data/example_cNMF/example_cNMF.clustering.k_10.dt_0_01.pdf

Some usage notes:

  • Parallelization: The factorize step can be parallelized with the --total-workers flag and then submitting multiple jobs, one per worker, indexed starting by 0. For example:
cnmf factorize --output-dir ./example_data --name example_cNMF --worker-index 0 --total-workers 3 &
cnmf factorize --output-dir ./example_data --name example_cNMF --worker-index 1 --total-workers 3 &
cnmf factorize --output-dir ./example_data --name example_cNMF --worker-index 2 --total-workers 3 &

would break the factorization jobs up into 3 batches and submit them independently. This can be used with compute clusters to run the factorizations in parallel (see tutorials for example).

  • Input data: Input data can be provided in 3 ways:
      1. as a scanpy file ending in .h5ad containg counts as the data feature. See the PBMC dataset tutorial for an example of how to generate the Scanpy object from the data provided by 10X. Because Scanpy uses sparse matrices by default, the .h5ad data structure can take up much less memory than the raw counts matrix and can be much faster to load.
      1. as a raw tab-delimited text file containing row labels with cell IDs (barcodes) and column labels as gene IDs
      1. as a 10x-Genomics-formatted mtx directory. You provide the path to the counts.mtx file or counts.mtx.gz file to counts_fn. It expects there to be barcodes.tsv and genes.tsv in the directory as well

See the tutorials or Stepwise_Guide.md for more details

Integration of technical variables and batches

We have implemented a pipeline to integrate batch variables prior to running cNMF and to handle ADTs in CITE-Seq. It uses an adaptation of Harmony that corrects the underlying count matrix rather than principal components. We describe it in our recent preprint. See the batch correction tutorial as well for an example.

We use a separate Preprocess class to run batch correction. You pass in an AnnData object, as well as harmony_vars, a list of the names of variables to correct correspond to columns in the AnnData obs attribute. You also specify an output file base name to save the results to like below:

from cnmf import cNMF, Preprocess
#Initialize the Preprocess object
p = Preprocess(random_seed=14)

#Batch correct the data and save the corrected high-variance gene data to adata_c, and the TPM normalized data to adata_tpm 
(adata_c, adata_tpm, hvgs) = p.preprocess_for_cnmf(adata, harmony_vars=['Sex', 'Sample'], n_top_rna_genes = 2000, librarysize_targetsum= 1e6,
                                                    save_output_base='./example_islets/batchcorrect_example_sex')

#Then run cNMF passing in the corrected counts file, tpm_fn, and HVGs as inputs
cnmf_obj_corrected = cNMF(output_dir='./example_islets', name='BatchCorrected')
cnmf_obj_corrected.prepare(counts_fn='./example_islets/batchcorrect_example.Corrected.HVG.Varnorm.h5ad',
                           tpm_fn='./example_islets/batchcorrect_example.TP10K.h5ad',
                           genes_file='./example_islets/batchcorrect_example.Corrected.HVGs.txt',
                           components=[15], n_iter=20, seed=14, num_highvar_genes=2000)

#Then proceed with the rest of cNMF as normal

Change log

New in version 1.5

  • Fixed bug in detecting and printing cells with 0 counts of overdispersed genes
  • Added option in load_results() to return normalized or unnormalized usage.
  • Added a Preprocess class to batch correct data prior to cNMF. See the added Tutorial analyze_batcheffectcorrect_BaronEtAl.ipynb to illustrate its basic usage.
  • Now accepts 10x formatted .mtx directories (containing counts.mtx, barcodes.tsv, and genes.tsv files)

New in version 1.4

  • Usage is re-fit a final time from gene_spectra_tpm which increases accuracy in simulations
  • Use cnmf_obj.load_results(K=, density_threshold=) to obtain usage, spectra_scores, spectra_tpm, and top_genes matrices
  • cnmf_obj.combine() now has a skip_missing_files=True/False option to skip incomplete factorize iterations
  • GEPs are now ordered by maximum total usage
  • Now detects and errors when 0 counts of HVGs with interpretable error message

New in version 1.3

  • Installation via pip
  • Object oriented interface for Python users and command line script option via cnmf

New in version 1.2

  • Increased the threshold for ignoring genes with low mean expression for determining high-variance genes from a TPM of 0.01 to 0.5. Some users were identifying uninterpretable programs with very low usage except in a tiny number of cells. We suspect that this was due to including genes as high-variance that are detected in a small number of cells. This change in the default parameter will help offset that problem in most cases.
  • Updated import of NMF for compatibility with scikit-learn versions >22
  • Colorbar for heatmaps included with consensus matrix plot

New in version 1.1

  • Now operates by default on sparse matrices. Use --densify option in prepare step if data is not sparse
  • Now takes Scanpy AnnData object files (.h5ad) as input
  • Now has option to use KL divergence beta_loss instead of Frobenius. Frobenius is the default because it is much faster.
  • Includes a Docker file for creating a Docker container to run cNMF in parallel with cloud compute
  • Includes a tutorial on a simple PBMC dataset
  • Other minor fixes

cnmf's People

Contributors

dylkot avatar michelle-curtis avatar rlannes avatar scottgigante-immunai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

cnmf's Issues

bulk RNAseq

Hi!
I was wandering whether this nice tool could also be applied to bulk RNAseq data, with a number of samples ranging from 100 to 1000.
Thank you very much!

Fcntl.c: loadable library and perl binaries are mismatched (got handshake key 0xdb80080, needed 0xde00080)

When I run the code as nohup parallel python /sibcb1/mxqianlab1/zfli/apps/cNMF/cnmf.py factorize --output-dir /sibcb1/mxqianlab1/zfli/MM_analysis/cnmf/filter_tumor/ZHQG_tumor --name ZHQG --worker-index {} ::: 0 1 2 3 4 5 6 7 8 9 10 11, the nohup.out always report Fcntl.c: loadable library and perl binaries are mismatched (got handshake key 0xdb80080, needed 0xde00080).
Can you help me with this issue?

multi-proccessing from script?

Hi,
first, thank you for providing this tools to the community.
I was wondering if there is any way to use multiprocessing from inside a script?
Looking at the code when using factorize, it parse the parameters and return a list of execution for the i-worker.
But I can't pass a list of i-worker, basically meaning that for in script I have to do analyses sequentially.

Is there any way to achieve multiprocessing from inside a python script? (more convenient)
I ended up hacking a bit using the following:

from multiprocessing import Pool 

def factorize_mp_signature(args):
    args[2].factorize(worker_i=args[0],  total_workers=args[1])
    

def wrapper_factorize_mp(total_workers, cnmf):
    list_args = [(x, total_workers, cnmf) for x in range(total_workers)]

    with Pool(total_workers) as p:
        p.map(factorize_mp_signature, list_args)
        p.join()

wrapper_factorize_mp(10, cnmf_obj)

Do you think that is appropriate? If you accept pull request, I could make one.
By having a dedicated function, I believe this is easier to integrate and won't compromise the rest of your code base.
And adding it as a method of the cNMF class I could pass self directly in wrapper_factorize_mp simplifying the API.

stability/error diagnostic plot interpretation

Hello! Thank you so much for making such a wonderful package.
I was hoping to receive some support for the stability/error diagnostic plot interpretation to select K.
For example, given the following plot:
image

Why was K=24 selected instead of K=22?
I know the paper affirms that programs are pretty stable across a range of K's but I was interested in the team's explanation.
Thank you!!

applying cNMF to multiple datasets?

How would you recommend to proceed with multiple samples? In the original manuscript, you conclude that it is better to perform any kind of batch correction before applying cNMF, correct?

Documentation issue with multiple workers and factorization?

Hi,

In the documentation and examples, it says
" However, if you specified more than 1 total worker, you would need to run the commands for those workers as well with separate commands, E.g.:

python ./cnmf.py factorize --output-dir ./example_data --name example_cNMF --worker-index 1
python ./cnmf.py factorize --output-dir ./example_data --name example_cNMF --worker-index 2
...
"

When I look at the code, the default number of workers is 1.
parser.add_argument('--total-workers', type=int, help='[all] Total number of workers to distribute jobs to', default=1)

This gets passed into the function:
cnmf_obj.run_nmf(worker_i=args.worker_index, total_workers=args.total_workers)

This same strategy appears in the simulated data notebook:

qsub -cwd -b y -l h_vmem=2g,h_rt=3:00:00 -o ./log -e ./log -N cnmf -t 1-5 'python ../cnmf.py factorize --output-dir ./simulated_example_data --name example_cNMF --worker-index $SGE_TASK_ID'

This results in each worker running all the factorization jobs. It might make sense to update the examples to pass in the --total-workers argument for the UGER factorization examples. With that added argument, the code runs fine.

I was trying to figure out why a) The code is taking a while to run and b) factorization files were being overwritten with data from newer timestamps when from the code you write the output a single time per K/iteration.

where is cnmf.py located

Dear author,

Does conda have to be upgraded before install the virtual environment? I did not.

Here is my commands
$ conda create -n cnmf_env2 --yes --channel bioconda --channel conda-forge --channel defaults python==3.7 fastcluster==1.1.26 matplotlib==3.3.2 numpy==1.19.2 palettable==3.3.0 pandas==1.1.3 scipy==1.5.2 scikit-learn==0.23.2 pyyaml==5.3.1 scanpy==1.6.0 parallel
Collecting package metadata (repodata.json): done
Solving environment: done
........
==> WARNING: A newer version of conda exists. <==
current version: 4.9.2
latest version: 4.10.1

Please update conda by running

$ conda update -n base conda

Package Plan

environment location: /rsrch3/home/itops/ryao/.conda/envs/cnmf_env2

added / updated specs:
- fastcluster==1.1.26
- matplotlib==3.3.2
- numpy==1.19.2
- palettable==3.3.0
- pandas==1.1.3
- parallel
- python==3.7
- pyyaml==5.3.1
- scanpy==1.6.0
- scikit-learn==0.23.2
- scipy==1.5.2

Installation goes without an error message, the packages installed in my local .conda environment.

after activate the conda environment,
(cnmf_env2) [ryao@ldragon2 cnmf_env2]$ ls -lrt /rsrch3/home/itops/ryao/.conda/envs/cnmf_env2
total 205
drwx------ 4 ryao rists 4096 May 28 15:09 man
drwx------ 3 ryao rists 4096 May 28 15:09 ssl
drwx------ 3 ryao rists 4096 May 28 15:09 var
drwx------ 2 ryao rists 4096 May 28 15:09 compiler_compat
drwx------ 2 ryao rists 4096 May 28 15:09 lib64
drwx------ 6 ryao rists 4096 May 28 15:09 etc
drwx------ 3 ryao rists 4096 May 28 15:09 doc
drwx------ 36 ryao rists 16384 May 28 15:09 include
drwx------ 15 ryao rists 4096 May 28 15:09 plugins
drwx------ 28 ryao rists 32768 May 28 15:10 lib
drwx------ 4 ryao rists 4096 May 28 15:10 libexec
drwx------ 97 ryao rists 4096 May 28 15:10 mkspecs
drwx------ 2 ryao rists 4096 May 28 15:10 phrasebooks
drwx------ 14 ryao rists 4096 May 28 15:10 qml
drwx------ 2 ryao rists 16384 May 28 15:10 translations
drwx------ 27 ryao rists 4096 May 28 15:10 share
drwx------ 2 ryao rists 16384 May 28 15:10 bin
drwx------ 2 ryao rists 16384 May 28 15:10 conda-meta

(cnmf_env2) [ryao@ldragon2 cnmf_env2]$ find . -name cnmf.py
but I can't find the file, thus the first step fails:

(cnmf_env2) [ryao@ldragon2 ~]$ python cnmf.py --help
(null): can't open file 'cnmf.py': [Errno 2] No such file or directory

Please advise, Thank you!
Rong Yao

NaNs or similar during final steps

Hi cNMF authors,

I am having some trouble with the last two steps. The error is:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Sometimes it affects stability plotting, sometimes the consensus step, and sometimes both or neither. Here is a minimal example where both are affected.

cnmf_nan.zip

Thanks for making this great tool public!

Is there a way to run the cNMF package within R?

I am very interested in applying this cNMF package to single cell datasets within the R programming language,

  1. does a version of cNMF exist that is written in R?
  2. can the current cNMF package be effectively accessed and run within the R programing environment?

jitdebug

Hello, thanks for this fantastic tool. I'm running cNMF on my single cell data but met with the error at cnmf prepare say that "TypeError: create_target_machine() got an unexpected keyword argument 'jitdebug''. What was the problem here?

Thanks,
Bofei

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Hi,
Thanks for developing this user-friendly tool. I met an question when i runing the code :
python3 ./cNMF/cnmf.py prepare --output-dir ./output_cnmf/p121_2 --name p121_2 -c ./p121_2_counts.txt -k 4 5 6 7 8 9 10 11 12 13 --n-iter 300 --total-workers 8 --seed 14 --numgenes 2000

The error is :
/home/abc/miniconda3/envs/cnmf_env/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py:806: UserWarning: Revieved a view of an AnnData. Making a copy.
view_to_actual(adata)
Traceback (most recent call last):
File "./cNMF/cnmf.py", line 868, in
high_variance_genes_filter=highvargenes)
File "./cNMF/cnmf.py", line 324, in get_norm_counts
examples = norm_counts.obs.index[zerocells]
File "/home/abc/miniconda3/envs/cnmf_env/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4111, in getitem
result = getitem(key)
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

I sincerely hope to get your kind help.

use prenormalized matrix for cNMF

hii i have a pre normalized matrix to compute NMF, if i want to use this matrix do i have to input both the "-c" parameter and "--tpm" in the cnmf prepare function?

How to isolate RPL/RPS/MT- genes from an "identity" program?

I'm reading your paper about consensus non-negative matrix factorization (cNMF)to distill biologically interpretable gene programs. Thank you very much for your work and the tool you developed.

I'm very interested in interpreting my single cell data with your tool. But questions arised in pratice which I also noticed in your tutorial provided here . (Tutorial for running cNMF on the 10X PBMC dataset used in the SEURAT and SCANPY tutorials.)

First, I found you defined GEP1 in Out[60] as a T-cell identity program. Although I did see few genes like CD3E, CD3D in GEP1 and I believed the usages of the program across cells in In [58] is highly concordant with the presentation of T-cell population, It's hard not to notice the over-representation of ribosomal protein genes (PRL/RPS...) in GEP1.
To my knowledge and in the language of your paper, RPL/RPS genes form an "activity program" in charge of protein synthesis and should be widely used by different type of cells. It's a little bit weird to see T-cell marker genes and RPL genes mixed together in a GEP.

Second, I noticed the same issue in my data. In short, I have two tumor samples (A & B). I assumed:
Sample A has "identity program" 1 and "identity program" 2 (in abbr A1, A2), and
Sample B has "identity program" 2 (in abbr B2) .
In pratice with cNMF, I found:
A1 has "1" marker genes with RPL/RPS genes ;
A2 has "2" marker genes.
B2 has "2" marker genes and RPL/RPS genes.
I further performed coorelation and clustering on those 3 programs. Although I want to see programs with similar identity (A1 and A2) highly correlated with each other and cluster together but instead the result showed B2 has higher Pearson correlation with A1 than with A2, and A1 B2 clustered together.
I guess it is because RPL/RPS dominate A1 and B2, which makes them look like protein synthesis "activity program" rather than 2 distinct "identity program".

It wouldn't be a big deal if I just want to determine cell identity in one sample, but it caused some trouble when I analysed multiple samples and put all the GEPs togther to find something in common. Like A1 and B2 cluster together for their similarity in activity, but what I really want to see is A2 and B2 cluster together for their similarity in identity.

I wonder if RPL/RPS genes could be seperated from A1 and form an independent activity program, leaving a pure A1 identity program with "1" marker genes. I think it will make the program easier to interprete and more comparable between samples. But currently I don't know how to achieve that goal with cNMF.
Is removing ^RPS/^RPL/^MPRS/^MRPL genes (or ^MT- genes) from the count matrix a sensible preprocessing step in cNMF?

I'm not an expert in machine learning and statistics science and I'm not sure if my issue sounds naive to you. Any comments and suggestions are highly appreciated! Thank you!

Best.
Wang.

scikit-learn version and dtype issues

Hi,

Thank you for the useful tool!
This is just a bug report, but cnmf.py factorize failed with scikit-learn v0.24.2. I found the keyword arguments alpha_H and alpha_W to sklearn.decomposition.non_negative_factorization() in several parts of the source code that may be supported only by scikit-learn v1. I updated scikit-learn and it worked. Though v0.23.2 is specified in the manual, it'd be better to fix that part if I'm correct.

Also, the dtype=float64 issue showed up again. Thanks to the previous discussions around #9, I put

norm_counts.X = norm_counts.X.astype(np.float64)

to line 631 and now it's working.

Nan issue when linearly regressing out batch and zeroing negative values

Hi, I wanted to try linearly regressing out batch from a TPM counts matrix and then zeroing negative values as the input for cNMF. I am getting the following error: ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). It's a bit strange because my TPM matrix does not have NA values and I flagged it as a TPM matrix in the prepare command. I can post my command below:

cnmf prepare --output-dir ./ --name $PARAM2 -c $PARAM1 -k 35 40 45 50 55 60 --n-iter 500 --seed 14 --numgenes 2000 --tpm $PARAM1 --total-workers 50

This issue is linked to this issue I posted earlier: #47
Also linked to this issue: #1

when running consensus step, i got a ValueError

command:
python ./cNMF/cnmf.py consensus --output-dir subEpithelial --name subEpithelial_cNMF --components 10 --local-density-threshold 0.01 --show-clustering

Traceback (most recent call last):
File "/softs/cNMF/cnmf.py", line 756, in
cnmf_obj.consensus(k, args.local_density_threshold, args.local_neighborhood_size, args.show_clustering, args.stats_
File "/softs/cNMF/cnmf.py", line 523, in consensus
_, spectra_tpm = self._nmf(tpm.T, nmf_kwargs=fit_tpm_nmf_kwargs, topic_labels=np.arange(1,k+1))
File "/softs/cNMF/cnmf.py", line 285, in _nmf
(W, H, niter) = non_negative_factorization(X.values, **nmf_kwargs)
File "/softs/conda/anaconda/envs/cnmf/lib/python3.6/site-packages/sklearn/decomposition/nmf.py", line 10orization
_check_init(H, (n_components, n_features), "NMF (input H)")
File "/softs/conda/anaconda/envs/cnmf/lib/python3.6/site-packages/sklearn/decomposition/nmf.py", line 60
A = check_array(A)
File "/softs/conda/anaconda/envs/cnmf/lib/python3.6/site-packages/sklearn/utils/validation.py", line 568
allow_nan=force_all_finite == 'allow-nan')
File "/softs/conda/anaconda/envs/cnmf/lib/python3.6/site-packages/sklearn/utils/validation.py", line 56,
raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

this error i slove by edit cnmf.py:
line494:
norm_usages = rf_usages.div(rf_usages.sum(axis=1), axis=0)

changed:
norm_usages = rf_usages.div(rf_usages.sum(axis=1), axis=0)
norm_usages = norm_usages.fillna(0)

Frobenius error

Hi,
I've runned cNMF on my dataset with 200 iteration, in selecting K i decided to start from k = 10 (the analysis was runned between 2<k<20).
K= 10 have in my case the greater stability 0.875 (i know that could not be the best but i take that as a good start point). For what concerne the error it has high value 8.15 with respect to your work (see the plot).
cNMF values
I'm not understanding very well the nature of the error computation and how to read the error values, they decrease but looking at my values they are too high? I read the referenced paper but i still have some difficulties. could you help me with the interpretation?
Thanks

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Hi,
I am getting this error, I tried reinstalling but couldn't resolve this issue. I have removed all the low counts and even tried giving TPMs, but failed in running this successfully.
Please let me know how best I can resolve this, I am not a programmer, I am a cancer biologist and trust me I have gone through each line of the code but failed to understand where I am going wrong. I request you to have a look at this, it's a beautiful tool and will help us in resolving so many things.
Thank you.
Dr Ateeq


sc.pp.filter_cells(adata, min_genes=200) # filter cells with fewer than 200 genes
sc.pp.filter_cells(adata, min_counts=200)  # This is a weaker threshold than above. It is just to population the n_counts column in adata
sc.pp.filter_genes(adata, min_cells=3) # filter genes detected in fewer than 3 cells

sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e6) # TPM normalisation

python /Users/akhaliq/Desktop/nmf/script/cnmf.py prepare --output-dir ./ --name cNMF_epi -c tpm_counts.tsv -k 5 6 7 8 9 10 11 --n-iter 500 --seed 14 --numgenes 2000 --tpm tpm_counts.tsv
/Users/akhaliq/Desktop/nmf/script/cnmf.py:806: FutureWarning: X.dtype being converted to np.float32 from int64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  var=pd.DataFrame(index=input_counts.columns))
/Users/akhaliq/Desktop/nmf/script/cnmf.py:831: FutureWarning: X.dtype being converted to np.float32 from int64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass `AnnData(X, dtype=X.dtype, ...)` to get the future behavour.
  var=pd.DataFrame(index=tpm.columns))
/Users/akhaliq/miniconda3/envs/cnmf_env/lib/python3.7/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)
Traceback (most recent call last):
  File "/Users/akhaliq/Desktop/nmf/script/cnmf.py", line 853, in <module>
    high_variance_genes_filter=highvargenes)
  File "/Users/akhaliq/Desktop/nmf/script/cnmf.py", line 328, in get_norm_counts
    examples = norm_counts.obs.index[zerocells]
  File "/Users/akhaliq/miniconda3/envs/cnmf_env/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 4307, in __getitem__
    result = getitem(key)
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Best,
Dr AMK

get_highvar_genes_sparse function

Hi,
This function uses two variables that are not defined: "gene_counts_fano" and "gene_counts_mean" . Shouldn't these be "gene_mean" and "gene_fano" instead?
Thanks.

Error at combining stage: invalid literal/lengths set

Hi @dylkot,

Thank you for putting together a great tutorial for your cNMF package. I am encountering the following error at the cnmf_obj.combine() step:

`
Combining factorizations for k=5.
Combining factorizations for k=6.
Combining factorizations for k=7.
Combining factorizations for k=8.

error Traceback (most recent call last)
/var/folders/sd/b0wjprrj19bdz5sw2smwp62m0000gq/T/ipykernel_73413/3939318147.py in
----> 1 cnmf_obj.combine()

~/opt/anaconda3/lib/python3.9/site-packages/cnmf/cnmf.py in combine(self, components)
373
374 for k in ks:
--> 375 self.combine_nmf(k)
376
377

~/opt/anaconda3/lib/python3.9/site-packages/cnmf/cnmf.py in combine_nmf(self, k, remove_individual_iterations)
596 for i,p in run_params_subset.iterrows():
597
--> 598 spectra = load_df_from_npz(self.paths['iter_spectra'] % (p['n_components'], p['iter']))
599 if combined_spectra is None:
600 combined_spectra = np.zeros((n_iter, k, spectra.shape[1]))

~/opt/anaconda3/lib/python3.9/site-packages/cnmf/cnmf.py in load_df_from_npz(filename)
33 def load_df_from_npz(filename):
34 with np.load(filename, allow_pickle=True) as f:
---> 35 obj = pd.DataFrame(**f)
36 return obj
37

~/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/npyio.py in getitem(self, key)
252 if magic == format.MAGIC_PREFIX:
253 bytes = self.zip.open(key)
--> 254 return format.read_array(bytes,
255 allow_pickle=self.allow_pickle,
256 pickle_kwargs=self.pickle_kwargs)

~/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/format.py in read_array(fp, allow_pickle, pickle_kwargs)
773 read_count = min(max_read_count, count - i)
774 read_size = int(read_count * dtype.itemsize)
--> 775 data = _read_bytes(fp, read_size, "array data")
776 array[i:i+read_count] = numpy.frombuffer(data, dtype=dtype,
777 count=read_count)

~/opt/anaconda3/lib/python3.9/site-packages/numpy/lib/format.py in _read_bytes(fp, size, error_template)
902 # done about that. note that regular files can't be non-blocking
903 try:
--> 904 r = fp.read(size - len(data))
905 data += r
906 if len(r) == 0 or len(data) == size:

~/opt/anaconda3/lib/python3.9/zipfile.py in read(self, n)
920 self._offset = 0
921 while n > 0 and not self._eof:
--> 922 data = self._read1(n)
923 if n < len(data):
924 self._readbuffer = data

~/opt/anaconda3/lib/python3.9/zipfile.py in _read1(self, n)
996 elif self._compress_type == ZIP_DEFLATED:
997 n = max(n, self.MIN_READ_SIZE)
--> 998 data = self._decompressor.decompress(data, n)
999 self._eof = (self._decompressor.eof or
1000 self._compress_left <= 0 and

error: Error -3 while decompressing data: invalid literal/lengths set`

The factorize step ran successfully with no errors or warnings. When I run the combine step, it looks like it successfully combines factorizations for k=5 through k=7, but the error occurs when trying to combine factorizations for either k=8 or k=9. How should I best proceed?

ValueError: negative dimensions are not allowed

I am attempting to deploy a private instance. I am getting the following error

i can run my other data success, so i believe there is something wrong with my matrix.
code
`replicate_spectra=df
l2_norms = (replicate_spectra**2).sum(axis=1).apply(np.sqrt)
l2_spectra = replicate_spectra.div(l2_norms, axis=0)
topics_dist = squareform(fast_euclidean(l2_spectra.values))
kmeans_model = KMeans(n_clusters=51, n_init=10, random_state=1)
kmeans_model.fit(l2_spectra)
kmeans_cluster_labels = pd.Series(kmeans_model.labels_+1, index=l2_spectra.index)
spectra_order = get_clustergram_ordering(topics_dist, kmeans_cluster_labels)
topics_dist = topics_dist[spectra_order, :][:, spectra_order]
kmeans_cluster_labels = kmeans_cluster_labels.iloc[spectra_order]

Permute the cluster order so that the colors aren't sequential

permmap = get_shuffle(51, 5)
kmeans_cluster_labels_perm = kmeans_cluster_labels.apply(lambda x: permmap[x])
(heatmapax, topax, leftax, heatmapfig, heatmapcbar, fig) = labeled_heatmap(topics_dist, kmeans_cluster_labels_perm.values,
1, figsize=(3.,3.), dpi=300)
heatmapax.set_xlabel('NMF Components', fontsize=12)
leftax.set_ylabel('NMF Components', fontsize=12)
topax.set_title('Pre-filtered Clustering', fontsize=13)`

ValueError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26520/2907498122.py in
6 kmeans_model.fit(l2_spectra)
7 kmeans_cluster_labels = pd.Series(kmeans_model.labels_+1, index=l2_spectra.index)
----> 8 spectra_order = get_clustergram_ordering(topics_dist, kmeans_cluster_labels)
9 topics_dist = topics_dist[spectra_order, :][:, spectra_order]
10 kmeans_cluster_labels = kmeans_cluster_labels.iloc[spectra_order]

~\AppData\Local\Temp/ipykernel_26520/2720665459.py in get_clustergram_ordering(topic_dist, kmeans_cluster_labels)
8 cl_dist = squareform(topics_dist[cl_filter, :][:, cl_filter])
9 cl_dist[cl_dist < 0] = 0 #Rarely get floating point arithmetic issues
---> 10 cl_link = linkage(cl_dist, 'average')
11 cl_leaves_order = leaves_list(cl_link)
12

E:\conda\miniconda\envs\cnmf_env\lib\site-packages\fastcluster.py in linkage(X, method, metric, preserve_input)
243 X = pdist(X, metric=metric)
244 X = array(X, dtype=double, copy=False, order='C', subok=True)
--> 245 Z = empty((N-1,4))
246 if N > 1:
247 linkage_wrap(N, X, Z, mthidx[method])

ValueError: negative dimensions are not allowed

whats wrong with my data? thank you!

cNMF produces warnings when writing npz files.

Hi,

I'm trying to run cNMF using the conda env as explained in the readme, and I'm encountering a warning. I believe there's a very easy fix for this that I'm happy to submit a pull request for if you're interested.

  1. The warning (factorize):
    UserWarning: metadata on a dtype may be saved or ignored, but will raise if saved when read. Use another form of storage.
    d['descr'] = dtype_to_descr(array.dtype)

Reproduce the error (at Broad on UGER interactive node):

Create conda env [don't include jupyter notebook]

conda create -n cnmf_env_baseline --yes --channel bioconda --channel conda-forge --channel defaults python==3.7 fastcluster==1.1.26 matplotlib==3.3.2 numpy==1.19.2 palettable==3.3.0 pandas==1.1.3 scipy==1.5.2 scikit-learn==0.23.2 pyyaml==5.3.1 scanpy==1.6.0 parallel && conda clean --yes --all

conda activate cnmf_env_baseline

Run cNMF steps:

python ${cNMF_dir}/cnmf.py prepare --output-dir ${outDir} --name ${outName} -c /path/to/small.transposed_filtered_dge.txt -k 5 6 7
--n-iter 10 --total-workers 1 --seed 14 --numgenes 2000

python ${cNMF_dir}/cnmf.py factorize --output-dir ${outDir} --name ${outName} --worker-index 0

  1. The warning:
    When running factorize, it looks like there's a problem with save_df_to_npz, which shows up as a warning:

UserWarning: metadata on a dtype may be saved or ignored, but will raise if saved when read. Use another form of storage.
d['descr'] = dtype_to_descr(array.dtype)

In testing and slightly modifying your code, there's a quite fix to the save_df_to_npz function:
np.savez_compressed(filename, data=list(obj.values), index=list(obj.index.values), columns=list(obj.columns))

This converts the nparray to a simple list, to avoid the problem with encoding the dtype metadata.

This warning occurs in other functions (eg: consensus). I'm happy to share the input data with you directly if that's helpful.

k_selection_plot and consensus - TypeError: H should have the same dtype as X.

Hello,

I'm encountering the following error while running k_selection_plot.

I'm working with RNA-Seq data from multiple ~homogeneous tissue samples, rather than scRNA-Seq.

Here are my commands leading up to calling k_selection_plot:

python $CNMF prepare --output-dir $OUTPUT_DIR \
--name $OUTPUT_PREFIX \
-c $COUNTS \
--tpm $TPM \
-k $K_LIST \
--n-iter $N_ITER \
--total-workers $TOTAL_WORKERS \
--seed $SEED \
--densify

# This ran as a SLURM array job
python $CNMF factorize --output-dir $OUTPUT_DIR \
--name $OUTPUT_PREFIX \
--worker-index $WORKER_ID

python $CNMF combine --output-dir $OUTPUT_DIR \
--name $OUTPUT_PREFIX

And here is the traceback for the error after the call:

python $CNMF k_selection_plot --output-dir $OUTPUT_DIR \
--name $OUTPUT_PREFIX
Traceback (most recent call last):
  File "/cellar/users/dlaub/.programs/cnmf/cnmf.py", line 883, in <module>
    cnmf_obj.k_selection_plot()
  File "/cellar/users/dlaub/.programs/cnmf/cnmf.py", line 696, in k_selection_plot
    stats.append(self.consensus(k, skip_density_and_return_after_stats=True).stats)
  File "/cellar/users/dlaub/.programs/cnmf/cnmf.py", line 556, in consensus
    nmf_kwargs=refit_nmf_kwargs)
  File "/cellar/users/dlaub/.programs/cnmf/cnmf.py", line 410, in _nmf
    (usages, spectra, niter) = non_negative_factorization(X, **nmf_kwargs)
  File "/cellar/users/dlaub/anaconda3/envs/py3/lib/python3.7/site-packages/sklearn/utils/validation.py", line 73, in inner_f
    return f(**kwargs)
  File "/cellar/users/dlaub/anaconda3/envs/py3/lib/python3.7/site-packages/sklearn/decomposition/_nmf.py", line 1044, in non_negative_factorization
    "{}.".format(H.dtype))
TypeError: H should have the same dtype as X. Got H.dtype = float64.

Could I get some help with this? Thanks!

error in the tutorial: "high is out of bounds for int32"

Hello,
When running the tutorial notebook analyze_pbmc_example_data.ipynb
I get the following error:

ValueError Traceback (most recent call last)
~\AppD```

ata\Local\Temp\ipykernel_25820\2126166754.py in
1 ## Prepare the data, I.e. subset to 2000 high-variance genes, and variance normalize
----> 2 cnmf_obj.prepare(counts_fn=countfn, components=np.arange(5,11), n_iter=20, seed=14, num_highvar_genes=2000)

~\PycharmProjects\cNMF\src\cnmf\cnmf.py in prepare(self, counts_fn, components, n_iter, densify, tpm_fn, seed, beta_loss, num_highvar_genes, genes_file)
358
359 self.save_norm_counts(norm_counts)
--> 360 (replicate_params, run_params) = self.get_nmf_iter_params(ks=components, n_iter=n_iter, random_state_seed=seed, beta_loss=beta_loss)
361 self.save_nmf_iter_params(replicate_params, run_params)
362

~\PycharmProjects\cNMF\src\cnmf\cnmf.py in get_nmf_iter_params(self, ks, n_iter, random_state_seed, beta_loss)
484
485 np.random.seed(seed=random_state_seed)
--> 486 nmf_seeds = np.random.randint(low=1, high=(2**32)-1, size=n_runs)
487
488 replicate_params = []

mtrand.pyx in numpy.random.mtrand.RandomState.randint()

_bounded_integers.pyx in numpy.random._bounded_integers._rand_int32()

ValueError: high is out of bounds for int32

The issue can be solved by adding "dtype=np.int64" at line 486 in cNMF\src\cnmf\cnmf.py

  nmf_seeds = np.random.randint(low=1, high=(2**32)-1, size=n_runs, dtype=np.int64)

Ron

Running cNMF for multiple batches?

Dear Dylan,

Do you have any recommendations for running cNMF for a dataset with multiple batches?
Does it make sense to run cNMF separately for each batch and then perform quantile normalization on the GEP matrix?

Regards,
Mikhael

which matrix should NMF use in single cell RNA seq data to find diferential gene program?

Hi there,

I'm new to scRNA-seq(use the seurat pipeline to analysis) and nmf.

Recently, I'm going to do nmf in the scRNA-seq to find the diferent programs(like markers for some cells).

But I don't know which matrix should me use to do nmf, normalized counts or scaled counts?

And how to choose the factorization rank in nmf?

Does anyone have experiences? Thanks for your help!

cnmf 1.4 is not on pip

The latest version on pip seems to be 1.3.4 so the function cnmf_obj.load_results part of the tutorial cannot be used.

How to assess the similarity between programs across samples

Hi cNMF team,
Thanks for developing a nice tools! I extracted different programs in serveral samples and want to assess the program similarity. I noticed there are two gene expression program matrix (Zscore and TPM). Which expression matrix is suitable for calculating the correlation between programs in different samples.

How to comprehend the usage matrix for cells?

Hi authors!
Thanks for creating such a useful tool!
I have a question concerning the calculation of usage matrix.

  1. Did you consider all of genes in the input matrix (that is highly variable genes) when imputing usage matrix? Will I get different results when using input matrixs including 2000 or 5000 highly variable genes ?
  2. In your paper, I've read words as follows.
    We hypothesized that we could infer activity GEPs directly from variation in single-cell expression profiles using matrix factorization. In this context, matrix factorization would model the gene expression data matrix as the product of two lower rank matrices, one encoding the relative contribution of each gene to each program, and a second specifying the proportions in which the programs are combined for each cell. We refer to the second matrix as a ‘usage’ matrix as it specifies how much each GEP is ‘used’ by each cell in the dataset
    What does "the proportions in which the programs are combined for each cell" mean? It seems that the sum of all programs in a cell does not equal to 100%.

Looking forward to your reply.

How to avoid the batch effect?

Hi cNMF authors,
I noticed that you strongly recommend using the original matrix as the input object.
Will the batch effect affect the result?
Best
Shizhe Yu

Log Transformation

Hello,

Thank you for the amazing tool!

Will I see substantially worse results if I input log transformed tpm data?

k_selection_plot Error: Last 2 dimensions of the array must be square

Hey there--thanks for developing this! Really excited to explore it, but I keep hitting an error during k_selection_plot

I exported a count matrix (1775 cells in rows, genes in columns) from Seurat (only cells/genes with sum != 0) as a tab-delimited text file. The first few steps seem to work fine

python ~/Software/cNMF/cnmf.py prepare --output-dir ~/Downloads/cnmf_test --name test_cNMF -c ~/Downloads/id8_exp.txt -k 3 4 5 6 7  --n-iter 100 --total-workers 1 --seed 14 --numgenes 2000
python ~/Software/cNMF/cnmf.py factorize --output-dir ~/Downloads/cnmf_test --name test_cNMF --worker-index 0
python ~/Software/cNMF/cnmf.py combine --output-dir ~/Downloads/cnmf_test --name test_cNMF

But then I hit an error during the k_selection_plot run:

python ~/Software/cNMF/cnmf.py k_selection_plot --output-dir ~/Downloads/cnmf_test --name test_cNMF
Traceback (most recent call last):
  File "/Users/dpcook/Software/cNMF/cnmf.py", line 883, in <module>
    cnmf_obj.k_selection_plot()
  File "/Users/dpcook/Software/cNMF/cnmf.py", line 696, in k_selection_plot
    stats.append(self.consensus(k, skip_density_and_return_after_stats=True).stats)
  File "/Users/dpcook/Software/cNMF/cnmf.py", line 562, in consensus
    prediction_error = ((norm_counts.X.todense() - rf_pred_norm_counts)**2).sum().sum()
  File "/Users/dpcook/miniconda3/envs/cnmf_env/lib/python3.6/site-packages/numpy/matrixlib/defmatrix.py", line 233, in __pow__
    return matrix_power(self, other)
  File "<__array_function__ internals>", line 6, in matrix_power
  File "/Users/dpcook/miniconda3/envs/cnmf_env/lib/python3.6/site-packages/numpy/linalg/linalg.py", line 626, in matrix_power
    _assertNdSquareness(a)
  File "/Users/dpcook/miniconda3/envs/cnmf_env/lib/python3.6/site-packages/numpy/linalg/linalg.py", line 213, in _assertNdSquareness
    raise LinAlgError('Last 2 dimensions of the array must be square')
numpy.linalg.LinAlgError: Last 2 dimensions of the array must be square

Any thoughts about what could be causing this? I can't find anything strange with the expression matrix, but maybe I'm missing something.

In case it helps to see the matrix, I've uploaded it to a Google Drive here

I appreciate the help!

David

The consense step showing below error

Hi developer,

python3.7

Traceback (most recent call last):
File "/share/nas1/Data/PipeLine/Other/NMF/NMF-py/v2/bin/cNMF-master/cnmf.py", line 908, in
cnmf_obj.consensus(k, args.local_density_threshold, args.local_neighborhood_size, args.show_clustering)
File "/share/nas1/Data/PipeLine/Other/NMF/NMF-py/v2/bin/cNMF-master/cnmf.py", line 629, in consensus
cl_dist = squareform(topics_dist[cl_filter, :][:, cl_filter])
File "/share/nas1/Data/software/mini_conda/Miniconda3/envs/cNMF-python/lib/python3.7/site-packages/scipy/spatial/distance.py", line 2184, in squareform
is_valid_dm(X, throw=True, name='X')
File "/share/nas1/Data/software/mini_conda/Miniconda3/envs/cNMF-python/lib/python3.7/site-packages/scipy/spatial/distance.py", line 2260, in is_valid_dm
'symmetric.') % name)
ValueError: Distance matrix 'X' must be symmetric.
Traceback (most recent call last):
File "/share/nas1/Data/PipeLine/Other/NMF/NMF-py/v2/bin/cNMF-master/cnmf.py", line 908, in
cnmf_obj.consensus(k, args.local_density_threshold, args.local_neighborhood_size, args.show_clustering)
File "/share/nas1/Data/PipeLine/Other/NMF/NMF-py/v2/bin/cNMF-master/cnmf.py", line 629, in consensus
cl_dist = squareform(topics_dist[cl_filter, :][:, cl_filter])
File "/share/nas1/Data/software/mini_conda/Miniconda3/envs/cNMF-python/lib/python3.7/site-packages/scipy/spatial/distance.py", line 2184, in squareform
is_valid_dm(X, throw=True, name='X')
File "/share/nas1/Data/software/mini_conda/Miniconda3/envs/cNMF-python/lib/python3.7/site-packages/scipy/spatial/distance.py", line 2260, in is_valid_dm
'symmetric.') % name)
ValueError: Distance matrix 'X' must be symmetric

How can i fix this issue?
Best
hanhuihong

cNMF factorization resume

Hello @dylkot,

Thank you for making the amazing tool!

I am running cNMF on the HPC cluster. The job got cancelled and it was in the middle of a factorization. Is it possible to resume it or do I need to factorize all the components again from the start?

For example, in the cnmf_tmp folder, I see the factorizations for components 10-29. The original run was set to factorize my data for 10-35 components. Would it be fine if I started a cnmf run for 30-35 components to finish the process? Would it affect the downstream steps (K Selection Plot, combine, etc)?

when running consensus step, i got a Aborted (core dumped) for qt platform

error log:

This application failed to start because it could not find or load the Qt platform plugin "xcb"

Reinstalling the application may fix this problem.
Aborted (core dumped)

slove by:

conda install qt
conda install pyqt

and then edit the cnmf.py, in the header add follow line

os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = '/softs/conda/anaconda/plugins/platforms'
os.environ['FONTCONFIG_FILE'] = "/etc/fonts/fonts.conf"
os.environ['FONTCONFIG_PATH'] = "/etc/fonts/"

Step 5 Consensus command - TypeError: H should have the same dtype as X

Hi authors!

First off, thanks for creating such a useful program. I am eager to use it!

I ran into an error that seems related to the one mentioned in #9 (comment) and #8 (comment). In those issues, the error TypeError: H should have the same dtype as X arose when trying to create the k selection plot using the command python ./cnmf.py k_selection_plot. I added norm_counts.X = norm_counts.X.astype(np.float64) to cnmf.py line 503, as suggested by Pentayouth here, and this resolved the issue.

I have now run into the same problem with the command in step 5 of the README.

Command: python3 ./cnmf.py consensus --output-dir ./example_data --name example_cNMF --components 10 --local-density-threshold 0.01 --show-clustering

Traceback:

(base) vpn1722517526:example_PBMC jasmineliannemah$ python3 ./cnmf.py consensus --output-dir ./example_data --name example_cNMF --components 10 --local-density-threshold 0.01 --show-clustering
/Library/Python/3.7/site-packages/sklearn/utils/deprecation.py:143: FutureWarning: The sklearn.decomposition.nmf module is deprecated in version 0.22 and will be removed in version 0.24. The corresponding classes / functions should instead be imported from sklearn.decomposition. Anything that cannot be imported from sklearn.decomposition is now part of the private API.
warnings.warn(message, FutureWarning)
Traceback (most recent call last):
File "./cnmf.py", line 882, in
cnmf_obj.consensus(k, args.local_density_threshold, args.local_neighborhood_size, args.show_clustering)
File "./cnmf.py", line 558, in consensus
nmf_kwargs=refit_nmf_kwargs)
File "./cnmf.py", line 410, in _nmf
(usages, spectra, niter) = non_negative_factorization(X, **nmf_kwargs)
File "/Library/Python/3.7/site-packages/sklearn/utils/validation.py", line 72, in inner_f
return f(**kwargs)
File "/Library/Python/3.7/site-packages/sklearn/decomposition/_nmf.py", line 1044, in non_negative_factorization
"{}.".format(H.dtype))
TypeError: H should have the same dtype as X. Got H.dtype = float64.

Any idea how to resolve this issue? Thanks!

Parallelisation over multiple nodes

Hi,

I was wondering whether it is supported to use GNU parallel over multiple nodes on a UGER rather than assuming that each node only has a single core?

ValueError: shapes (7,78103) and (56262,999) not aligned: 78103 (dim 1) != 56262 (dim 0)

Hello,

I was following one of the tutorials for running cNMF on my own data set. My dataset has 78103 observations and on NAs or inf counts. To see if I could get the pipeline going I set the parameters to pretty small values to start.

I successfully managed to prepare the cnmf object with

cnmf_obj.prepare(counts_fn = 'Data/scvi_counts_adata.h5ad', components = np.arange(7,9), n_iter = 5, seed = 14, num_highvar_genes = 1000
and factorize with
cnmf_obj.factorize(worker_i = 0, total_workers = 1)

However when running
cnmf_obj.consensus(k=7, density_threshold=2.00, show_clustering=True, close_clustergram_fig=False)

I got the following error

ValueError                                Traceback (most recent call last)
/var/folders/06/kt3wlz1x285bzg98r_byq_480000gn/T/ipykernel_4464/2526977280.py in <module>
----> 1 cnmf_obj.consensus(k=selected_K, density_threshold=density_threshold, show_clustering=True, close_clustergram_fig=False)

~/opt/anaconda3/envs/cnmf_env/lib/python3.7/site-packages/cnmf/cnmf.py in consensus(self, k, density_threshold, local_neighborhood_size, show_clustering, skip_density_and_return_after_stats, close_clustergram_fig)
    705             norm_tpm = norm_tpm.astype(np.float64)
    706 
--> 707         usage_coef = fast_ols_all_cols(rf_usages.values, norm_tpm)
    708         usage_coef = pd.DataFrame(usage_coef, index=rf_usages.columns, columns=tpm.var.index)
    709 

~/opt/anaconda3/envs/cnmf_env/lib/python3.7/site-packages/cnmf/cnmf.py in fast_ols_all_cols(X, Y)
     51 def fast_ols_all_cols(X, Y):
     52     pinv = np.linalg.pinv(X)
---> 53     beta = np.dot(pinv, Y)
     54     return(beta)
     55 

<__array_function__ internals> in dot(*args, **kwargs)

ValueError: shapes (7,78103) and (56262,999) not aligned: 78103 (dim 1) != 56262 (dim 0)

I'm not sure where this 56262 number is coming from... Any ideas here? I tried to trace back to this norm_tpm object but I got lost and wanted to see if anyone had any thoughts here.

Thanks!

Batch Effect cnmf

Hello,

Do you recommend I apply batch effect correction on the data before inputting it into cnmf?

invalid value encountered in true_divide error

Hi,
While running cnmf_obj.consensus on my data, I encountered this error:
cnmf.py:701: RuntimeWarning: invalid value encountered in true_divide
norm_tpm = (np.array(tpm.X.todense()) - tpm_stats['__mean'].values) / tpm_stats['__std'].values
What causes this problem?
thanks!

how to perform cNMF on non-typical gene expression data

HI
I want to do NMF on my single cell data. However, my data is not traditional count data, for each transcript isoform in each cell, the value is generated by: count_of_isoform1 / count_of_all_isoform_of_the_same_gene.
So it's indeed a ration between 0-1. I want to know how to use such data in cNMF. I've tried feed the "ratio matrix" to --tpm, but it throws out en error:
Traceback (most recent call last): File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 783, in <module> if args.counts.endswith('.h5ad'): AttributeError: 'NoneType' object has no attribute 'endswith' Traceback (most recent call last): File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 853, in <module> cnmf_obj.run_nmf(worker_i=args.worker_index, total_workers=args.total_workers) File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 449, in run_nmf run_params = load_df_from_npz(self.paths['nmf_replicate_parameters']) File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 33, in load_df_from_npz with np.load(filename, allow_pickle=True) as f: File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/miniconda3/envs/cnmf_env/lib/python3.6/site-packages/numpy/lib/npyio.py", line 428, in load fid = open(os_fspath(file), "rb") FileNotFoundError: [Errno 2] No such file or directory: '/home/ruibinxi_pkuhpc/lustre1/shiyang/APA/ICC_cnmf_out_pas_new/I02T/cnmf_tmp/I02T.nmf_params.df.npz' Traceback (most recent call last): File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 856, in <module> run_params = load_df_from_npz(cnmf_obj.paths['nmf_replicate_parameters']) File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 33, in load_df_from_npz with np.load(filename, allow_pickle=True) as f: File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/miniconda3/envs/cnmf_env/lib/python3.6/site-packages/numpy/lib/npyio.py", line 428, in load fid = open(os_fspath(file), "rb") FileNotFoundError: [Errno 2] No such file or directory: '/home/ruibinxi_pkuhpc/lustre1/shiyang/APA/ICC_cnmf_out_pas_new/I02T/cnmf_tmp/I02T.nmf_params.df.npz' Traceback (most recent call last): File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 883, in <module> cnmf_obj.k_selection_plot() File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 692, in k_selection_plot run_params = load_df_from_npz(self.paths['nmf_replicate_parameters']) File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 33, in load_df_from_npz with np.load(filename, allow_pickle=True) as f: File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/miniconda3/envs/cnmf_env/lib/python3.6/site-packages/numpy/lib/npyio.py", line 428, in load fid = open(os_fspath(file), "rb") FileNotFoundError: [Errno 2] No such file or directory: '/home/ruibinxi_pkuhpc/lustre1/shiyang/APA/ICC_cnmf_out_pas_new/I02T/cnmf_tmp/I02T.nmf_params.df.npz' Traceback (most recent call last): File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 869, in <module> run_params = load_df_from_npz(cnmf_obj.paths['nmf_replicate_parameters']) File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/project_2019_11/NMF2/cnmf.py", line 33, in load_df_from_npz with np.load(filename, allow_pickle=True) as f: File "/home/ruibinxi_pkuhpc/lustre1/huangsiyuan/miniconda3/envs/cnmf_env/lib/python3.6/site-packages/numpy/lib/npyio.py", line 428, in load fid = open(os_fspath(file), "rb") FileNotFoundError: [Errno 2] No such file or directory: '/home/ruibinxi_pkuhpc/lustre1/shiyang/APA/ICC_cnmf_out_pas_new/I02T/cnmf_tmp/I02T.nmf_params.df.npz' DeprecationWarning: 'source deactivate' is deprecated. Use 'conda deactivate'.
I wonder if it's because I don't give the -c (count) file to cNMF.
Could you please give some advice? Thanks

Should we do scale before run cnmf?

hi, here some questions.
Compared with the code in your paper ,the PBMC tutorial did not do any normalize or scale before run cNMF. All your examples in your paper did TPM-scaled.
Should we do scale before run cnmf? which method is better?

Extension suggestions: regularization for NMF and consensus clustering

Hi Dylan,

Thanks for sharing this useful software!

I had a few quick questions:

  1. Have you ever experimented with the regularization="components" option in the call to non_negative_factorization to enforce more sparsity in the gene loadings? If so, do you have any suggestions for alpha or l1_ratio?

  2. Have you ever considered trying multiple values of k in the consensus clustering? For example, even if you have a 20-factor model, if you only get some factors on some fraction of restarts, you could imagine that the clustering of the l2_spectra might actually be better with some value of k that is slightly smaller or higher than k (e.g. you could only have 19 stable factors that you would want to keep or alternatively maybe there are actually 21 factors that reliably show up). If I look at something like the silhouette_score I sometimes get maximum values with values that are slightly higher than k but I haven't tried to use those alternative clusterings of the l2_spectra for any downstream applications.

  3. Is there a reason why the loadings are rescaled by their norms in

    cNMF/cnmf.py

    Line 509 in 5418be4

    l2_spectra = (merged_spectra.T/np.sqrt((merged_spectra**2).sum(axis=1))).T
    but then are scaled by their sums in median_spectra?

    cNMF/cnmf.py

    Line 542 in 5418be4

    median_spectra = (median_spectra.T/median_spectra.sum(1)).T

TPM Flag

Hi, how do you run the TPM flag? I tried running it the following ways and got an error (would you mind specifying this in the tutorial?):


cnmf prepare --output-dir ./ --name cNMF_initial -c exp.tsv -k 10 15 20 25 30 35 --n-iter 500 --seed 14 --numgenes 2000 --tpm TPM --total-workers 40

cnmf prepare --output-dir ./ --name cNMF_initial -c exp.tsv -k 10 15 20 25 30 35 --n-iter 500 --seed 14 --numgenes 2000 --tpm TPM --total-workers 40

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.