scverse / scvi-tools Goto Github PK

Deep probabilistic analysis of single-cell and spatial omics data

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

scrna-seq variational-bayes variational-autoencoder cite-seq single-cell-genomics single-cell-rna-seq deep-generative-model human-cell-atlas scverse deep-learning

scvi-tools's Introduction

scvi-tools (single-cell variational inference tools) is a package for probabilistic modeling and analysis of single-cell omics data, built on top of PyTorch and AnnData.

Analysis of single-cell omics data

scvi-tools is composed of models that perform many analysis tasks across single-cell, multi, and spatial omics data:

Dimensionality reduction
Data integration
Automated annotation
Factor analysis
Doublet detection
Spatial deconvolution
and more!

In the user guide, we provide an overview of each model. All model implementations have a high-level API that interacts with Scanpy and includes standard save/load functions, GPU acceleration, etc.

Rapid development of novel probabilistic models

scvi-tools contains the building blocks to develop and deploy novel probablistic models. These building blocks are powered by popular probabilistic and machine learning frameworks such as PyTorch Lightning and Pyro. For an overview of how the scvi-tools package is structured, you may refer to the codebase overview page.

We recommend checking out the skeleton repository as a starting point for developing and deploying new models with scvi-tools.

Basic installation

For conda,

conda install scvi-tools -c conda-forge

and for pip,

pip install scvi-tools

Please be sure to install a version of PyTorch that is compatible with your GPU (if applicable).

Resources

Tutorials, API reference, and installation guides are available in the documentation.
For discussion of usage, check out our forum.
Please use the issues to submit bug reports.
If you'd like to contribute, check out our contributing guide.
If you find a model useful for your research, please consider citing the corresponding publication.

Reference

If you use scvi-tools in your work, please cite

A Python library for probabilistic analysis of single-cell omics data

Adam Gayoso, Romain Lopez, Galen Xing, Pierre Boyeau, Valeh Valiollah Pour Amiri, Justin Hong, Katherine Wu, Michael Jayasuriya, Edouard Mehlman, Maxime Langevin, Yining Liu, Jules Samaran, Gabriel Misrachi, Achille Nazaret, Oscar Clivio, Chenling Xu, Tal Ashuach, Mariano Gabitto, Mohammad Lotfollahi, Valentine Svensson, Eduardo da Veiga Beltrame, Vitalii Kleshchevnikov, Carlos Talavera-López, Lior Pachter, Fabian J. Theis, Aaron Streets, Michael I. Jordan, Jeffrey Regier & Nir Yosef

Nature Biotechnology 2022 Feb 07. doi: 10.1038/s41587-021-01206-w.

along with the publicaton describing the model used.

You can cite the scverse publication as follows:

The scverse project provides a computational ecosystem for single-cell omics data analysis

Isaac Virshup, Danila Bredikhin, Lukas Heumos, Giovanni Palla, Gregor Sturm, Adam Gayoso, Ilia Kats, Mikaela Koutrouli, Scverse Community, Bonnie Berger, Dana Pe’er, Aviv Regev, Sarah A. Teichmann, Francesca Finotello, F. Alexander Wolf, Nir Yosef, Oliver Stegle & Fabian J. Theis

Nature Biotechnology 2023 Apr 10. doi: 10.1038/s41587-023-01733-8.

scvi-tools is part of the scverse project (website, governance) and is fiscally sponsored by NumFOCUS. Please consider making a tax-deductible donation to help the project pay for developer time, professional services, travel, workshops, and a variety of other needs.

scvi-tools's People

Contributors

Stargazers

Watchers

Forkers

rintukutum benjamesbabala doerlbh mgood2 pedrofale liviust irenexzwen edouard360 jamestwebber sagoyal2 maxime-langevin mincheoly indianwolverine adamgayoso linnarsson-lab djacobowitz adrianbzg gwaybio maaskola dengwx11 zorrodong gyd1990 hejian41 japrin vals cartal johnreid jun-lizst rlebron88 chitrita hefv57 cgreene peiwenliu18 triyangle xiaoxiaoh16 jacobkimmel ssehztirom yufengchenk arpis xuanheiiis mengchengyao deto tianxiaonyu michelraulet davek44 drsamu nbahti ryuu90 mortonjt akmazad abigail-wood tuqiang2014 jvikkula jimmayxu zzygyx9119 pierreboyeau esadr alarivarmann raevskymichail maggishaggy allencellmodeling oscarclivio kyleyxw aa-m-sa maichmueller liuwujijay yunchen-yang eleozzr ignasigm dnbaker m0hammadl zihao12 kedark2 ewail mattjones315 fuchanghe chanjed eugot xiuyuma mtreppner hernet xihuyan iyhaoo x-bioinformatics aitical natnaelt copper-yu jarckry zouter cloudfora sysu-sph-antonio ethankinchan hy395 parkcc anazaret opnumten galenxing rlebron-bioinfo gvrocha shulp2211

scvi-tools's Issues

Loom for RETINA dataset

Create a new class called LoomDataset that uses the loompy library to load an arbitrary dataset in the loom format.

https://github.com/linnarsson-lab/loompy

As a test case --- and so we can easily use it --- convert the RETINA dataset to loom format and upload it to the YosefLab/scVI-data repository.

run_benchmarks uses the same VAE architecture for all datasets

It looks like run_benchmarks uses the same VAE architecture (e.g. default n_layers) for every datasets:
https://github.com/YosefLab/scVI/blob/cbd6f41bc6e40bbb0c1082ec9bf1d5b32bcd3f7d/scvi/benchmark.py#L38-L39

Instead, let's instantiate the VAE class with the number of layers etc from Table 2:
https://www.biorxiv.org/content/biorxiv/early/2018/03/30/292037.full.pdf

[CLOSED] clustering

Issue by jeff-regier
Wednesday Apr 04, 2018 at 03:57 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/14

To start, maybe figure out when VaDE does/doesn't work.

https://arxiv.org/pdf/1611.05148.pdf

[CLOSED] normalizing flows

Issue by jeff-regier
Monday Apr 02, 2018 at 21:11 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/11

Normalizing flows should let us better approximate the posterior distribution. Sylvester normalizing flows seems like the first thing to try.

https://arxiv.org/pdf/1803.05649.pdf

move load_dataset from init.py to run_benchmarks.py

Let's move the load_dataset function from datasets/__init__.py to run_benchmarks.py. Then let's only call that function in run_benchmarks.py. Everywhere else load_dataset is called (e.g. in unit tests), just instantiate the right dataset class directly. For example, rather than

gene_dataset = load_datasets('cortex')

gene_dataset = CortexDataset()

[CLOSED] fixed makefile test

Issue by romain-lopez
Monday Apr 02, 2018 at 22:39 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/12

romain-lopez included the following code: https://github.com/YosefLab/scVI-dev/pull/12/commits

Beta-Poisson generative model

Issue by jeff-regier
Friday Mar 30, 2018 at 17:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/7

We can still use the reparameterization trick to differentiate through a beta random variable:
https://arxiv.org/pdf/1805.08498v1.pdf

[CLOSED] Working Benchmarks

Issue by Edouard360
Thursday Apr 05, 2018 at 05:39 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/16

Solved the model’s last issues.
Moved all the benchmark logic in a single file: run_benchmarks.py. Now the tests only use a toy dataset, and run benchmarks after training on 1 epoch. Also can be run from command line.
Added the contrib folder for python scripts for loading/preprocessing.

PS: Sorry I used about 5 Travis builds to correct minor errors... (In particular make lint didn't warn for python files in new folder contrib)

Edouard360 included the following code: https://github.com/YosefLab/scVI-dev/pull/16/commits

[CLOSED] new model of the dispersion parameter

Issue by jeff-regier
Friday Mar 30, 2018 at 16:59 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/5

run groups of benchmarks with `run_benchmarks.py`

Currently run_benchmarks.py just runs one benchmark per call. It'd be nice if it could, optionally, run groups of benchmarks.

For example,

./run_benchmarks.py --annotation

might run all the annotation (semi-supervised) benchmarks, and then print out a nice table afterwards with all the annotation results for all the dataset.

And

./run_benchmarks.py --harmonization

might run all the unsupervised harmonization benchmarks.

And

./run_benchmarks.py --basic

might run all of the original seven benchmarks.

And

./run_benchmarks.py --all --epoch 1

would be useful for testing that everything works.

One way to implement this without a mess of if statements (c.f., load_dataset) might be to create a new Benchmark class. It would have as fields all the arguments to the train(), including one dataset instance (e.g. a CbmcDataset) and one model instance (e.g. a VAE instance). These Benchmark objects could then be organized into groups.

[CLOSED] added travis status to readme

Issue by jeff-regier
Thursday Apr 05, 2018 at 17:10 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/17

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/17/commits

tox-based testing

https://tox.readthedocs.io/en/latest/examples.html

[CLOSED] enable flake8 syntax checking

Issue by jeff-regier
Friday Mar 30, 2018 at 20:55 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/9

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/9/commits

save_path instead of unit_test

Anywhere we're currently using a unit_test flag, let's make save_path an argument instead. By default save_path = "data/" but for unit tests call, for example, run_benchmarks("cortex", save_path = "tests/data/").

And in retina.py, for example, rather than

    def __init__(self, unit_test=False):

use

    def __init__(self, save_path="data/"):

use GeneDataset.download for multiple urls

The method GeneDataset.download will download multiple urls now. So can we delete the code below and instead use GeneDataset.download for all the downloading?

https://github.com/YosefLab/scVI/blob/6f2a934d0fc62ef174e8a45eeddde3f5a3625e93/scvi/dataset/hemato.py#L32-L53

proper log likelihood calculation (and assess alternate variational distribution for library size)

v0.1.4

Let's push v0.1.14 to pip and conda after finishing #42 and #62 .

Unfortunately I don't think anyone has written a conda recipe yet for anndata, so we'll have to create one and push it along with ours.

[CLOSED] marginal log likelihood benchmark

Issue by jeff-regier
Friday Mar 30, 2018 at 16:59 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/3

invertible likelihood

Issue by jeff-regier
Sunday Apr 01, 2018 at 21:50 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/10

Modeling p(x | z) with an invertible conditional distribution may give us a latent space that is more interpretable.

https://arxiv.org/pdf/1611.05209.pdf

call imputation in scVI-dev.ipynb notebook

Let's change the value returned by imputation to distance_list here, rather than it's median:
https://github.com/YosefLab/scVI/blob/1c01132164b3cbebe0de06057e1ed6652602ee89/scvi/metrics/imputation.py#L26

Then, after that minor change, we can revise the "Checking imputation accuracy" section of scvi-dev.ipynb to make it easier to follow, by removing all the code in that section and just calling imputation(vae, rate=0.3).

refactoring: VAEC and SVAEC as subclasses of VAE

There's a lot of duplicated code in the VAE, VAEC, and SVAEC classes. To get rid of it, lets make VAEC and SVAEC subclasses of VAE. In addition to avoiding duplicated code (which is hard to maintain), it's nice conceptually to express our semi-supervised models as specializations or our unsupervised scVI model.

Also, I wonder about whether SVAEC could inherit from VAEC, or whether even we need both SVAEC and VAEC.

training progress bar for notebooks

The main suggestion we've gotten is to display progress bars rather than a lot of text output.

https://github.com/tqdm/tqdm#ipython-jupyter-integration

refactoring: remove use_cuda from models

It's better if the models (i.e. VAE, VAEC, SVEAC) don't "know" whether they are using cuda or not. In the __init__ method for these models, we can just load everything on the CPU. Then, call the .cuda() method (e.g. vae.cuda() right after instantiating the model. It's kind of a detail, but the distinction between the model and what chip it runs on is a helpful one to maintain.

See

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/models/vae.py#L46-L48

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/models/vaec.py#L40-L43

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/models/svaec.py#L47-L50

remote loom datasets

Let's change run_benchmarks.py, as well as the run_benchmarks function, so they can take as an argument the URL of an arbitrary loom file. e.g.,

./run_benchmarks.py --url http://loom.linnarssonlab.org/dataset/cellmetadata/Previously%20Published/Cortex.loom

In this case, the output should be the same as running

./run_benchmarks.py --dataset cortex

We also want to test that it works with

http://loom.linnarssonlab.org/dataset/cellmetadata/osmFISH/osmFISH_SScortex_mouse_all_cells.loom

add ADT counts (proteins markers) to PbmcDataset and CbmcDataset

As described in #40

[CLOSED] conda

Issue by jeff-regier
Wednesday Apr 04, 2018 at 22:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/15

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/15/commits

notebook of data loading examples

Create a notebook named examples/data_loading.ipynb that shows users all the ways they can load data into scVI. They can load

a loom file
a csv file
a 10x file
any of our "built in" datasets (list them, and give a little information about each of them)

[CLOSED] added .cpu() so run_benchmark won't crash when cuda is available

Issue by jeff-regier
Friday Apr 06, 2018 at 18:21 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/19

I was getting this error from run_benchmarks.py. Easily fixed.

Traceback (most recent call last):
  File "run_benchmarks.py", line 19, in <module>
    run_benchmarks(gene_dataset, n_epochs=args.epochs)
  File "/home/jeff/git/scVI-dev/scvi/benchmark.py", line 32, in run_benchmarks
    imputation_score = imputation(vae, gene_dataset)
  File "/home/jeff/git/scVI-dev/scvi/imputation.py", line 45, in imputation
    mae = imputation_error(px_rate.data.numpy(), X, i, j, ix)
RuntimeError: can't convert CUDA tensor to numpy (it doesn't support GPU arrays). Use .cpu() to move the tensor to host memory first.

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/19/commits

additional datasets

@imyiningliu -- I think Maxime, Eddie, and Chenling are all working with datasets now that aren't yet "wrapped" by scVI. It'd be great if you could talk with all three of them to find out what datasets they're using, and add one class (that inherits from GeneExpressionDataset) for each dataset they plan to keep using. They may already have some code, which they haven't committed yet, that you can start with. We'd also want some documentation for each dataset, and a unit test if the new dataset requires a non-trivial amount of code.

These new datasets may have some characteristics that are different from the dataset we've seen so far.
-- Maxime's smFISH datasets have position information. I think he had some ideas for how to modify GeneExpressionDataset to include that information.
-- Eddie's pbmc donor data may already be accessible through the Dataset10x class.
-- Chenling's data from the simulator isn't available on at any public url yet, and maybe it's too early to make it public. But if not, we could share it through our scVI-dev repo.

Also, I'm interested in getting access through scVI to the dataset mentioned in this paper: https://www.nature.com/articles/nmeth.4636

use Dataset10x for brain_small

See #37

BrainSmallDataset should inherit from Dataset10X (like the way RetinaDataset inherits from LoomDataset)

refactoring: combine training method

There's a lot of code duplicated across our four methods for training models: train, train_semi_supervised_jointly, train_semi_supervised_alternately, and train_classifier. How about combining all these methods into one method, train, and passing it additional arguments (e.g., boolean flags) to control how it behaves?

h5 format from 10x_genomics

For large datasets, e.g. 1.3 million cells in mousr brain dataset

is there a support for h5 format?

Thanks

cortex imputation error is too high

In the paper we report that imputation error for cortex is around 2.2 but when I run the current version of scVI imputation error is much higher.

jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset cortex --epochs 200
File data/expression.bin already downloaded
Preprocessing Cortex data
training: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:29<00:00,  6.69it/s]
Total runtime for 201 epochs is: 29.987644910812378 seconds for a mean per epoch runtime of 0.14919226323787252 seconds.
Best ll was : 1288.2259785091362
Log-likelihood Train: 1261.4138140255177
Log-likelihood Test: 1289.2478976328903
Imputation score on test (MAE) is: 4.208409309387207

[CLOSED] Importance Weighted Autoencoder

Issue by jeff-regier
Friday Mar 30, 2018 at 17:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/8

https://arxiv.org/pdf/1509.00519.pdf

[CLOSED] imputation benchmark

Issue by jeff-regier
Friday Mar 30, 2018 at 16:59 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/4

IPython notebook defines unused variable latent_dimension

A very minor issue in the example IPython notebook:

latent_dimension = 10

is defined but the following model definition does not include n_latent=latent_dimension, thus changing the value defined in the code has no effect. I think the same may be true of batch_size.

Thank you for this algorithm!

conda

Make scVI available through conda.

Should we use the bioconda channel rather than the default channel?

https://bioconda.github.io/

add AnnData support

Let's support reading from AnnData files.

scverse/anndata#30 (comment)

Negative Binomial parameterization

Hello,

Could you share the form you are using for the negative binomial distribution? I find it weird that there is not factorial in the likelihood function.

documentation

Documentation, generated with Sphinx, and posted at https://scvi.readthedocs.io/

We primarily want to document the functions that Chenling used in the demo notebook.

Examples:
http://scanpy.readthedocs.io/en/latest/

hemato -- File b'data/bBM.spring_and_pba.csv' does not exist

Looks like one of the files for hemato isn't getting downloaded automatically:

jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset hemato
Downloading file at data/bBM.raw_umifm_counts.csv.gz
Downloading file at data/data.zip
Preprocessing Hemato data
Traceback (most recent call last):
  File "./run_benchmarks.py", line 54, in <module>
    dataset = load_datasets(args.dataset, url=args.url)
  File "./run_benchmarks.py", line 26, in load_datasets
    gene_dataset = HematoDataset(save_path=save_path)
  File "/home/jeff/git/scVI/scvi/dataset/hemato.py", line 20, in __init__
    expression_data, gene_names = self.download_and_preprocess()
  File "/home/jeff/git/scVI/scvi/dataset/dataset.py", line 47, in download_and_preprocess
    return self.preprocess()
  File "/home/jeff/git/scVI/scvi/dataset/hemato.py", line 35, in preprocess
    spring_and_pba = pd.read_csv(self.save_path + self.spring_and_pba_filename)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 384, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'data/bBM.spring_and_pba.csv' does not exist

[CLOSED] ladder-VAE style training

Issue by jeff-regier
Friday Mar 30, 2018 at 17:00 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/issues/6

https://arxiv.org/pdf/1602.02282.pdf

More information

It would be great if you guys could share more insights in what's going on in the code and how to use it, maybe through a more complete README file?

imputation based on multiple samples

Our imputation results are based on just one sample from the variational distribution currently:

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/metrics/imputation.py#L24

That makes imputation not much of a metric for assessing changes to our model --- I'm surprised it works as well as it does. We should be sure to fix it before relying on imputation scores to guide any modeling decisions.

KeyError: 'cmbc'

Possibly stopped working after #54 ?

jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset cbmc
Traceback (most recent call last):
  File "./run_benchmarks.py", line 54, in <module>
    dataset = load_datasets(args.dataset, url=args.url)
  File "./run_benchmarks.py", line 22, in load_datasets
    gene_dataset = CiteSeqDataset('cmbc', save_path=save_path)
  File "/home/jeff/git/scVI/scvi/dataset/cite_seq.py", line 17, in __init__
    s = available_datasets[name]
KeyError: 'cmbc'

[CLOSED] Lgamma / Cuda / Dataset / Dependencies

Issue by Edouard360
Friday Apr 06, 2018 at 07:10 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/18

Remove approx for log_zinb_positive since we don’t need it anymore (we have fully working lgamma)+ add compiling and dependency.
Added cuda support (distinction) - thanks Maxime.
Remove sklearn dependency. Removed the train_test_split. Using SubsetRamdomSampler could be another option but is yet complicated.
Create dataset module and subclasses for each dataset, inheriting the same base class; with downloading/preprocessing
Move benchmark logic in scvi/benchmark.py

Edouard360 included the following code: https://github.com/YosefLab/scVI-dev/pull/18/commits