scvae / scvae Goto Github PK

View Code? Open in Web Editor NEW

79.0 5.0 25.0 2.18 MB

Deep learning for single-cell transcript counts

License: Apache License 2.0

Python 100.00%

machine-learning deep-learning genomics single-cell

scvae's Introduction

scVAE: Single-cell variational auto-encoders

scVAE is a command-line tool for modelling single-cell transcript counts using variational auto-encoders.

Install scVAE using pip for Python 3.6 and 3.7:

$ python3 -m pip install scvae

scVAE can then be used to train a variational auto-encoder on a data set of single-cell transcript counts:

$ scvae train transcript_counts.tsv

And the resulting model can be evaluated on the same data set:

$ scvae evaluate transcript_counts.tsv

For more details, see the documentation, which include a user guide and a short tutorial.

scvae's People

Contributors

Stargazers

Watchers

scvae's Issues

urllib.error.HTTPError: HTTP Error 403: Forbidden

hi! when I run the example
scvae train 10x-PBMC-PP --split-data-set -m GMVAE -r negative_binomial -l 100 -H 100 100 -w 200 -e 500
there comes the error：urllib.error.HTTPError: HTTP Error 403: Forbidden.
could anyone give some suggestion? Thanks

Error while performing evaluation

I used the following code for training the model:

$ scvae train "/content/drive/My Drive/Datasets/SRA779509_SRS3805247.json" --split-data-set --splitting-method random --splitting-fraction 0.8 -l 10 -H 200 100 -w 5 -e 5 -M "/content/drive/My Drive/Datasets/model" -r zero_inflated_negative_binomial -q unit_variance_gaussian

This created a folder in the specified directory that contained all the log output and checkpoint files and tfevents files for the training and validation sets.

However, upon running the following code for evaluating the model,

$ scvae evaluate "/content/drive/My Drive/Datasets/SRA779509_SRS3805247.json" --analyses-directory "/content/drive/My Drive/Datasets/analyses" --prediction-method kmeans --decomposition-methods tSNE --split-data-set -l 10 -H 200 100 -w 5 --splitting-fraction 0.8
I got this exception (after splitting of dataset):

...
Model
═════

tcmalloc: large alloc 1177214976 bytes == 0xb302a000 @  0x7f184e3f11e7 0x7f184bed75e1 0x7f184bf3bc78 0x7f184bf3edb8 0x7f184bf3f395 0x7f184bfd665d 0x50a635 0x50bfb4 0x507d64 0x509a90 0x50a48d 0x50bfb4 0x507d64 0x509a90 0x50a48d 0x50cd96 0x507d64 0x509a90 0x50a48d 0x50cd96 0x507d64 0x509a90 0x50a48d 0x50cd96 0x509758 0x50a48d 0x50bfb4 0x507d64 0x509a90 0x50a48d 0x50cd96
/usr/local/lib/python3.6/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead.
  import pandas.util.testing as tm
Traceback (most recent call last):
  File "/usr/local/bin/scvae", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/scvae/cli.py", line 1225, in main
    status = arguments.func(**vars(arguments))
  File "/usr/local/lib/python3.6/dist-packages/scvae/cli.py", line 416, in evaluate
    raise Exception("Cannot analyse model when it has not been trained.")
Exception: Cannot analyse model when it has not been trained.

I would like to know how do we ensure that the model that we trained in the first command is used for evaluation. Do we have to specify any other directory where the model is saved in during evaluation?

Thanks!

OSError: [Errno 22] during scvae evaluate

Hello,

When I tried to use scvae with a loom data, everything was fine during the following step:
'scvae evaluate "xxx.loom" --split-data-set'

However, it showed the following error when I executed 'scvae evaluate "xxx.loom" --split-data-set':

#########################
←[1mReconstructions
╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌
←[0m
Plotting profile comparisons.
Traceback (most recent call last):
File "c:\programdata\miniconda3\lib\runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "c:\programdata\miniconda3\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\ProgramData\Miniconda3\Scripts\scvae.exe_main.py", line 7, in
File "c:\programdata\miniconda3\lib\site-packages\scvae\cli.py", line 1222, in main
status = arguments.func(**vars(arguments))
File "c:\programdata\miniconda3\lib\site-packages\scvae\cli.py", line 555, in evaluate
analyses_directory=analyses_directory
File "c:\programdata\miniconda3\lib\site-packages\scvae\analyses\analyses.py", line 1155, in analyse_results
directory=profile_comparisons_directory
File "c:\programdata\miniconda3\lib\site-packages\scvae\analyses\figures\saving.py", line 92, in save_figure
figure.savefig(figure_path)
File "c:\programdata\miniconda3\lib\site-packages\matplotlib\figure.py", line 2180, in savefig
self.canvas.print_figure(fname, **kwargs)
File "c:\programdata\miniconda3\lib\site-packages\matplotlib\backend_bases.py", line 2089, in print_figure
**kwargs)
File "c:\programdata\miniconda3\lib\site-packages\matplotlib\backends\backend_agg.py", line 530, in print_png
cbook.open_file_cm(filename_or_obj, "wb") as fh:
File "c:\programdata\miniconda3\lib\contextlib.py", line 112, in enter
return next(self.gen)
File "c:\programdata\miniconda3\lib\site-packages\matplotlib\cbook_init.py", line 447, in open_file_cm
fh, opened = to_filehandle(path_or_file, mode, True, encoding)
File "c:\programdata\miniconda3\lib\site-packages\matplotlib\cbook_init_.py", line 432, in to_filehandle
fh = open(fname, flag, encoding=encoding)
OSError: [Errno 22] Invalid argument: 'analyses\gas8000\split-random_0.9\no_preprocessing\VAE\gaussian\poisson-l_2-h_100-mc_1-iw_1-kl-bn\e_200-mc_1-iw_1\profile_comparisons\profile_comparison-"cell_113865"-linear-sorted.png'
########################

Is there any reason causing the problem?

Thanks
Po

ZeroInflated distribution

In the implementation of ZeroInflated distribution link

 if not isinstance(dist, distribution.Distribution):
      raise TypeError(
          "dist must be a Distribution instance"
          " but saw: %s" % dist)

Since all distribution classes have been switched to use tensorflow_probability Distribution, this TypeError happened inappropriately

Secondly, there is Exception when I try sampling from the ZeroInflated

Traceback (most recent call last):
  File "./main.py", line 1026, in <module>
    main(**vars(arguments))
  File "./main.py", line 269, in main
    results_directory = results_directory
  File "/data1/libs/scVAE/models/variational_autoencoder.py", line 203, in __init__
    self.model_graph()
  File "/data1/libs/scVAE/models/variational_autoencoder.py", line 680, in model_graph
    self.p_x_given_z.sample()
  File "/data1/libs/miniconda3/envs/ai/lib/python3.5/site-packages/tensorflow/python/ops/distributions/distribution.py", line 766, in sample
    return self._call_sample_n(sample_shape, seed, name)
  File "/data1/libs/miniconda3/envs/ai/lib/python3.5/site-packages/tensorflow/python/ops/distributions/distribution.py", line 745, in _call_sample_n
    samples = self._sample_n(n, seed, **kwargs)
  File "/data1/libs/scVAE/distributions/zero_inflated.py", line 206, in _sample_n
    pi_samples = self.pi.sample(n, seed=seed)
AttributeError: 'Tensor' object has no attribute 'sample

This is the description of my ZeroInflated distribution parameters:

p_x_given_z: tfp.distributions.ZeroInflated("X_TILDE/ZeroInflated_1/", batch_shape=(?, 32738), event_shape=(), dtype=float32)
   allow_nan_stats : True
   dist : tfp.distributions.NegativeBinomial("X_TILDE/NegativeBinomial/", batch_shape=(?, 32738), event_shape=(), dtype=float32)
   name : ZeroInflated
   pi : Tensor("X_TILDE/PI/clip_by_value:0", shape=(?, 32738), dtype=float32)
   validate_args : False

The command I used to run:

./main.py -i 10x-PBMC-PP -m VAE -r zero_inflated_negative_binomial -l 100 -H 101 102 -e 500 --decomposition-methods pca tsne

Could you suggest some fix for the issue?
might be some inconsistent in my environment ?

Best Regards,
TrungNT

Seurat objects

Is it possible to cluster cells in seurat data object (count data) with scVAE?
Is it possible to apply it on scaled data? Like if cell types are already regressed out from gene expressions?

getting actual points in tSNE reconstructions

Hi,
I am using scvae, but I really need actual points in the plots in reconstructions, and the result for clustering. I'm not sure if I didn't find it, or it is not provided. Thanks in advance!

10x Issue

scVAE claims to be able to work on 10x genomics data. Looking at your code it seems like it excepts either hdf5 or tar.gz files. It is unclear the format to use to train models based on 10x data on the command line.

I tried scvae train [sample].tar.gz --format 10x, which gave error

ValueError: Data format already specified in metadata and cannot be changed (is tar.gz; wanted 10x).

Could you give an example of how to apply scVAE to 10x cell ranger output?

how I get early stop model and Z?

I want to stop model when early stop,but -e must be required and default is 200? what should I do?

Working with SCVAE to analyse single cell RNA seq data

Hi, i was hoping to run scvae but im having a problems with the running of files, i was hoping to reach out and ask for help with trouble shooting

Getting data from latent z-space

Hello,

I have run "scvae train" and "scvae evaluate" on a dataset. Evaluation returned many .png images, including the plot of the latent z-space, but I am interested in getting the actual points in the latent z-space (i.e. in a tsv/csv file). How can I get this data?

Thank you

ValueError: Data format already specified in metadata and cannot be changed (is `gz`; wanted `matrix_fbe`).

Hello，thanks for scvae~~~

I encounter a problem

I want to train for TCGA/XENA transcript data, download data from page xena

alias proxychains='/data/wenyuhao/software/proxychains/bin/proxychains4 -f /data/wenyuhao/software/proxychains/proxychains.conf'
proxychains wget https://toil-xena-hub.s3.us-east-1.amazonaws.com/download/tcga_expected_count.gz

tcga_expected_count.gz is TSV file(gzip compressed) with transcript as rows and samples as columns,so I use --format matrix_fbe

└─(16:36:15 on master ✭)──> zcat tcga_expected_count.gz | head -10 | csvcut -t -c 1-10 | csvformat -T                                      ──(Tue,Oct03)─┘
sample  TCGA-S9-A7J2-01 TCGA-G3-A3CH-11 TCGA-EK-A2RE-01 TCGA-D5-5538-01 TCGA-IQ-A61O-01 TCGA-AB-2863-03 TCGA-G9-6499-11 TCGA-C8-A1HL-01 TCGA-EW-A2FS-01
ENST00000548312.5       1.8074  1.4906  1.6781  1.8914  3.2388  3.2126  2.7929  0.0000  0.0000
ENST00000483781.5       1.2449  0.0000  1.5850  0.0000  0.0000  2.3074  0.0000  0.0000  0.0000
ENST00000535093.1       0.0000  0.0000  0.0000  0.0000  0.0000  0.0     0.0000  0.0000  0.0000
ENST00000338863.11      8.7317  8.0738  11.9031 10.7103 10.3966 0.0     11.8987 9.0007  9.6170
ENST00000570899.1       0.0000  0.0000  1.7698  0.0000  2.5236  0.0     1.7268  2.0179  0.0000
ENST00000556831.5       4.2979  1.7485  2.8953  1.1763  1.1827  4.0009  0.0000  4.8006  3.8490
ENST00000625998.2       9.4804  0.0000  7.6738  8.2678  7.7274  7.4629  11.2444 0.6041  10.0544
ENST00000583693.5       8.9606  5.0670  5.7057  5.6453  5.3402  6.7979  5.7063  4.7777  7.3460
ENST00000580812.1       3.7398  0.0000  4.9026  3.9345  2.2510  2.1667  3.4060  0.0000  2.8032

but It raise Exception

└─(16:13:17 on master ✭)──> scvae train tcga_expected_count.gz -H 1000 500 -l 256 -e 500 -w 200 --learning-rate 0.001 --split-data-set --splitting-fraction 0.9  --format matrix_fbe
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
Data
════

Traceback (most recent call last):
  File "/data/wenyuhao/anaconda3/envs/scvae/bin/scvae", line 8, in <module>
    sys.exit(main())
  File "/data/wenyuhao/anaconda3/envs/scvae/lib/python3.7/site-packages/scvae/cli.py", line 1230, in main
    status = arguments.func(**vars(arguments))
  File "/data/wenyuhao/anaconda3/envs/scvae/lib/python3.7/site-packages/scvae/cli.py", line 161, in train
    noisy_preprocessing_methods=noisy_preprocessing_methods
  File "/data/wenyuhao/anaconda3/envs/scvae/lib/python3.7/site-packages/scvae/data/data_set.py", line 187, in __init__
    data_format
ValueError: Data format already specified in metadata and cannot be changed (is `gz`; wanted `matrix_fbe`).

when I use without format,it also raise Exception

└─(16:32:21 on master ✭)──> scvae train tcga_expected_count.gz -H 1000 500 -l 256 -e 500 -w 200 --learning-rate 0.001 --split-data-set --splitting-fraction 0.9                     
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
Data
════

Data set:
    title: tcga_expected_count
    feature selection: none
    example filter: none
    processing methods: none

Splitting:
    method: random
    fraction: 90.0 %

Copying values for full set.
Data set copied (2.3 s).

Loading original data set.
Traceback (most recent call last):
  File "/data/wenyuhao/anaconda3/envs/scvae/bin/scvae", line 8, in <module>
    sys.exit(main())
  File "/data/wenyuhao/anaconda3/envs/scvae/lib/python3.7/site-packages/scvae/cli.py", line 1230, in main
    status = arguments.func(**vars(arguments))
  File "/data/wenyuhao/anaconda3/envs/scvae/lib/python3.7/site-packages/scvae/cli.py", line 166, in train
    method=splitting_method, fraction=splitting_fraction)
  File "/data/wenyuhao/anaconda3/envs/scvae/lib/python3.7/site-packages/scvae/data/data_set.py", line 1101, in split
    self.load()
  File "/data/wenyuhao/anaconda3/envs/scvae/lib/python3.7/site-packages/scvae/data/data_set.py", line 770, in load
    data_format=self.data_format
  File "/data/wenyuhao/anaconda3/envs/scvae/lib/python3.7/site-packages/scvae/data/loading.py", line 111, in load_original_data_set
    data_format))
ValueError: Data format `gz` not recognised.

python/scvae version

└─(16:13:59 on master ✭)──> scvae --version                                                                                            1 ↵ ──(Tue,Oct03)─┘
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.
WARNING:root:Limited tf.compat.v2.summary API due to missing TensorBoard installation.

└─(16:31:10)──> python --version                                                                                                           ──(Tue,Oct03)─┘
Python 3.7.12

Looking forward to your reply, thank you~~~

Data format h5 not recognised,and 10x wanted (but I set format="10x"

Hi,
When I feed the single cell 10X data in .h5 format,and I got the error:ValueError: Data format h5 not recognised.How can I solve this issue?

Training and validation loss becomes NaN

While training the model, the ELBO of the training data or validation data or both sometimes becomes nan as it progresses, as shown below:

Epoch 99 (10.4 s):
    Evaluating model.
    Training set (3.42 s): ELBO: -1709.1, ENRE: -1692.1, KL: 16.986.
    Validation set (840 ms): ELBO: nan, ENRE: -inf, KL: 17.461.
    Saving model parameters.
    Model parameters saved (434 ms).

Step 4658 (213 ms): -1675.1.
Step 4662 (222 ms): -1843.4.
Step 4667 (216 ms): -1464.6.
Step 4672 (215 ms): -1736.6.
Step 4677 (219 ms): -1577.7.
Step 4681 (220 ms): -1729.2.
Step 4686 (221 ms): -1591.9.
Step 4691 (221 ms): -1784.
Step 4695 (217 ms): -1624.3.
Step 4700 (118 ms): -1770.2.

Epoch 100 (10.5 s):
    Evaluating model.
    Training set (3.4 s): ELBO: nan, ENRE: -inf, KL: 17.066.
    Validation set (858 ms): ELBO: nan, ENRE: -inf, KL: 17.572.
    Saving model parameters.

Due to this, the evaluation step also fails mostly as the ELBO of the test set goes to nan.

Best-model evaluation
---------------------

Evaluating trained model for run 13 on original values.
    Test set (1.36 s):  ELBO: nan, ENRE: -inf, KL: 20.367.
/usr/local/lib/python3.6/dist-packages/scvae/data/data_set.py:537: RuntimeWarning: invalid value encountered in true_divide
  self.normalised_count_sum = self.count_sum / self.count_sum.max()

Data set        mean   std. dev.   dispersion   minimum    maximum   sparsity
original       0.6013     25.922     1117.5     0            15631        0.87985
reconstructed  inf        nan        nan        3.3546e-10   inf          0.92008

and the following error shows up:
ValueError: supplied range of [3.3546437849807376e-10, inf] is not finite

I am currently trying to increase/decrease the number of epochs to bypass this issue. Otherwise, is there any other way?

I would also like to know if scVAE, when implemented on Google Colab, automatically makes use of GPU or we have to manually change it. I get a Colab warning that GPU is not utilised while training and it recommends to connect to local runtime.

Loading HDF5 datasets crash because data_dictionary["example names"] is set to None

Hi, first of all, great work! Looking forward to experimenting with the model.

I had an issue loading a dataset in HDF5 format as described in the docs: it kept crashing when splitting data because data_dictionary["example names"] was set to None. I think the cause is that return list_of_names statement is missing in the following function:

scvae/scvae/data/loaders.py

Line 711 in 2f590ca

def _find_list_of_names(list_name_guesses, kind):

does scvae support pytorch and scanpy?

Multiple h5 for processing and tissue integration

Hi,

Thank you for developing this tool. It looks really promising.
I am trying to analyse multiple datasets of the same tissue and I would like to process them at once. So I have a couple of questions:

How do I feed .h5 files to scVAE?
Would it be possible to perform dataset integration a la Seurat (RunMultiCCA) with scVAE?

Looking forward to hearing from you.

AttributeError: 'NoneType' object has no attribute 'points_to_pixels'

Hi,
I got the following error while running the first example
(main.py -i 10x-PBMC-PP -m GMVAE -r negative_binomial -l 100 -H 100 100 -e 500 --decomposition-methods pca tsne)

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'. Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
Traceback (most recent call last):
File "main.py", line 1260, in
main(**vars(arguments))
File "main.py", line 352, in main
temporary_log_directory = temporary_log_directory
File "/Users/bogumil/projects/leukemia/scVAE_analysis/scVAE/models/gaussian_mixture_variational_autoencoder.py", line 2113, in train
results_directory = self.base_results_directory
File "/Users/bogumil/projects/leukemia/scVAE_analysis/scVAE/analysis.py", line 702, in analyseIntermediateResults
results_directory = results_directory)
File "/Users/bogumil/projects/leukemia/scVAE_analysis/scVAE/analysis.py", line 5101, in saveFigure
adjustFigureForLegend(figure)
File "/Users/bogumil/projects/leukemia/scVAE_analysis/scVAE/analysis.py", line 5145, in adjustFigureForLegend
legend_size = legend.get_window_extent()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/matplotlib/legend.py", line 988, in get_window_extent
return self._legend_box.get_window_extent(renderer=renderer)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/matplotlib/offsetbox.py", line 250, in get_window_extent
w, h, xd, yd, offsets = self.get_extent_offsets(renderer)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/matplotlib/offsetbox.py", line 360, in get_extent_offsets
dpicor = renderer.points_to_pixels(1.)
AttributeError: 'NoneType' object has no attribute 'points_to_pixels'

TypeError: init() got an unexpected keyword argument 'required'

hello, when I run your code "scvae train 10x-PBMC-PP --split-data-set -m GMVAE -r negative_binomial -l 100 -H 100 100 -w 200 -e 500", the error occurs like below:
Traceback (most recent call last):
File "/data/chenliang/anaconda3/envs/tensorflow/bin/scvae", line 10, in
sys.exit(main())
File "/data/chenliang/anaconda3/envs/tensorflow/lib/python3.6/site-packages/scvae/cli.py", line 697, in main
help="commands", dest="command", required=True)
File "/data/chenliang/anaconda3/envs/tensorflow/lib/python3.6/argparse.py", line 1707, in add_subparsers
action = parsers_class(option_strings=[], **kwargs)
TypeError: init() got an unexpected keyword argument 'required'

It seems to there has one bug in your cli.py file. Therefore can you tell me how i should do next? Thanks!

What does the loom format dataset look like specifically?

hello. recently I read your paper and it is a good work. I want to use your framework to obtain the candidate cluster of my dataset, but it seems to be not easy since I do not know the specific data form that I should input. Usually, we just need to input the count data(cell by gene) and labels(if have). But in your code, it seems to do some prepared work about data. Thus can you give me some advice specific if I want to use your framework? Or directly use your github code instead the scvae package? I am confused about the input data format. Thanks!

How to run the commands for mnist dataset ?

how to run the commands for mnist dataset ?

scvae / scvae Goto Github PK

scvae's Introduction

scVAE: Single-cell variational auto-encoders

scvae's People

Contributors

Stargazers

Watchers

Forkers

scvae's Issues

Recommend Projects

Recommend Topics

Recommend Org