Git Product home page Git Product logo

scverse / scvi-tools Goto Github PK

View Code? Open in Web Editor NEW
1.2K 27.0 341.0 142.34 MB

Deep probabilistic analysis of single-cell and spatial omics data

Home Page: http://scvi-tools.org/

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%
scrna-seq variational-bayes variational-autoencoder cite-seq single-cell-genomics single-cell-rna-seq deep-generative-model human-cell-atlas scverse deep-learning

scvi-tools's Introduction

scvi-tools

Stars PyPI PyPIDownloads CondaDownloads Docs Build Coverage

scvi-tools (single-cell variational inference tools) is a package for probabilistic modeling and analysis of single-cell omics data, built on top of PyTorch and AnnData.

Analysis of single-cell omics data

scvi-tools is composed of models that perform many analysis tasks across single-cell, multi, and spatial omics data:

  • Dimensionality reduction
  • Data integration
  • Automated annotation
  • Factor analysis
  • Doublet detection
  • Spatial deconvolution
  • and more!

In the user guide, we provide an overview of each model. All model implementations have a high-level API that interacts with Scanpy and includes standard save/load functions, GPU acceleration, etc.

Rapid development of novel probabilistic models

scvi-tools contains the building blocks to develop and deploy novel probablistic models. These building blocks are powered by popular probabilistic and machine learning frameworks such as PyTorch Lightning and Pyro. For an overview of how the scvi-tools package is structured, you may refer to the codebase overview page.

We recommend checking out the skeleton repository as a starting point for developing and deploying new models with scvi-tools.

Basic installation

For conda,

conda install scvi-tools -c conda-forge

and for pip,

pip install scvi-tools

Please be sure to install a version of PyTorch that is compatible with your GPU (if applicable).

Resources

  • Tutorials, API reference, and installation guides are available in the documentation.
  • For discussion of usage, check out our forum.
  • Please use the issues to submit bug reports.
  • If you'd like to contribute, check out our contributing guide.
  • If you find a model useful for your research, please consider citing the corresponding publication.

Reference

If you use scvi-tools in your work, please cite

A Python library for probabilistic analysis of single-cell omics data

Adam Gayoso, Romain Lopez, Galen Xing, Pierre Boyeau, Valeh Valiollah Pour Amiri, Justin Hong, Katherine Wu, Michael Jayasuriya, Edouard Mehlman, Maxime Langevin, Yining Liu, Jules Samaran, Gabriel Misrachi, Achille Nazaret, Oscar Clivio, Chenling Xu, Tal Ashuach, Mariano Gabitto, Mohammad Lotfollahi, Valentine Svensson, Eduardo da Veiga Beltrame, Vitalii Kleshchevnikov, Carlos Talavera-López, Lior Pachter, Fabian J. Theis, Aaron Streets, Michael I. Jordan, Jeffrey Regier & Nir Yosef

Nature Biotechnology 2022 Feb 07. doi: 10.1038/s41587-021-01206-w.

along with the publicaton describing the model used.

You can cite the scverse publication as follows:

The scverse project provides a computational ecosystem for single-cell omics data analysis

Isaac Virshup, Danila Bredikhin, Lukas Heumos, Giovanni Palla, Gregor Sturm, Adam Gayoso, Ilia Kats, Mikaela Koutrouli, Scverse Community, Bonnie Berger, Dana Pe’er, Aviv Regev, Sarah A. Teichmann, Francesca Finotello, F. Alexander Wolf, Nir Yosef, Oliver Stegle & Fabian J. Theis

Nature Biotechnology 2023 Apr 10. doi: 10.1038/s41587-023-01733-8.

scvi-tools is part of the scverse project (website, governance) and is fiscally sponsored by NumFOCUS. Please consider making a tax-deductible donation to help the project pay for developer time, professional services, travel, workshops, and a variety of other needs.

scvi-tools's People

Contributors

adamgayoso avatar anazaret avatar canergen avatar cgreene avatar chenlingantelope avatar davek44 avatar edouard360 avatar gabmis avatar galenxing avatar imyiningliu avatar jeff-regier avatar jules-samaran avatar justjhong avatar marianogabitto avatar martinkim0 avatar maxime-langevin avatar mjayasur avatar munfred avatar njbernstein avatar oscarclivio avatar pierreboyeau avatar pre-commit-ci[bot] avatar rk900 avatar romain-lopez avatar talashuach avatar triyangle avatar vals avatar vitkl avatar watiss avatar wukathy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scvi-tools's Issues

Loom for RETINA dataset

Create a new class called LoomDataset that uses the loompy library to load an arbitrary dataset in the loom format.

https://github.com/linnarsson-lab/loompy

As a test case --- and so we can easily use it --- convert the RETINA dataset to loom format and upload it to the YosefLab/scVI-data repository.

move load_dataset from __init__.py to run_benchmarks.py

Let's move the load_dataset function from datasets/__init__.py to run_benchmarks.py. Then let's only call that function in run_benchmarks.py. Everywhere else load_dataset is called (e.g. in unit tests), just instantiate the right dataset class directly. For example, rather than

gene_dataset = load_datasets('cortex')

do

gene_dataset = CortexDataset()

[CLOSED] Working Benchmarks

Issue by Edouard360
Thursday Apr 05, 2018 at 05:39 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/16


  • Solved the model’s last issues.

  • Moved all the benchmark logic in a single file: run_benchmarks.py. Now the tests only use a toy dataset, and run benchmarks after training on 1 epoch. Also can be run from command line.

  • Added the contrib folder for python scripts for loading/preprocessing.

PS: Sorry I used about 5 Travis builds to correct minor errors... (In particular make lint didn't warn for python files in new folder contrib)


Edouard360 included the following code: https://github.com/YosefLab/scVI-dev/pull/16/commits

run groups of benchmarks with `run_benchmarks.py`

Currently run_benchmarks.py just runs one benchmark per call. It'd be nice if it could, optionally, run groups of benchmarks.

For example,

./run_benchmarks.py --annotation

might run all the annotation (semi-supervised) benchmarks, and then print out a nice table afterwards with all the annotation results for all the dataset.

And

./run_benchmarks.py --harmonization

might run all the unsupervised harmonization benchmarks.

And

./run_benchmarks.py --basic

might run all of the original seven benchmarks.

And

./run_benchmarks.py --all --epoch 1

would be useful for testing that everything works.

One way to implement this without a mess of if statements (c.f., load_dataset) might be to create a new Benchmark class. It would have as fields all the arguments to the train(), including one dataset instance (e.g. a CbmcDataset) and one model instance (e.g. a VAE instance). These Benchmark objects could then be organized into groups.

save_path instead of unit_test

Anywhere we're currently using a unit_test flag, let's make save_path an argument instead. By default save_path = "data/" but for unit tests call, for example, run_benchmarks("cortex", save_path = "tests/data/").

And in retina.py, for example, rather than

    def __init__(self, unit_test=False):

use

    def __init__(self, save_path="data/"):

v0.1.4

Let's push v0.1.14 to pip and conda after finishing #42 and #62 .

Unfortunately I don't think anyone has written a conda recipe yet for anndata, so we'll have to create one and push it along with ours.

refactoring: VAEC and SVAEC as subclasses of VAE

There's a lot of duplicated code in the VAE, VAEC, and SVAEC classes. To get rid of it, lets make VAEC and SVAEC subclasses of VAE. In addition to avoiding duplicated code (which is hard to maintain), it's nice conceptually to express our semi-supervised models as specializations or our unsupervised scVI model.

Also, I wonder about whether SVAEC could inherit from VAEC, or whether even we need both SVAEC and VAEC.

refactoring: remove use_cuda from models

It's better if the models (i.e. VAE, VAEC, SVEAC) don't "know" whether they are using cuda or not. In the __init__ method for these models, we can just load everything on the CPU. Then, call the .cuda() method (e.g. vae.cuda() right after instantiating the model. It's kind of a detail, but the distinction between the model and what chip it runs on is a helpful one to maintain.

See

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/models/vae.py#L46-L48

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/models/vaec.py#L40-L43

https://github.com/YosefLab/scVI/blob/704ce19d80e5728542464e488968c136a241eb6f/scvi/models/svaec.py#L47-L50

remote loom datasets

Let's change run_benchmarks.py, as well as the run_benchmarks function, so they can take as an argument the URL of an arbitrary loom file. e.g.,

./run_benchmarks.py --url http://loom.linnarssonlab.org/dataset/cellmetadata/Previously%20Published/Cortex.loom

In this case, the output should be the same as running

./run_benchmarks.py --dataset cortex

We also want to test that it works with

http://loom.linnarssonlab.org/dataset/cellmetadata/osmFISH/osmFISH_SScortex_mouse_all_cells.loom

notebook of data loading examples

Create a notebook named examples/data_loading.ipynb that shows users all the ways they can load data into scVI. They can load

  • a loom file
  • a csv file
  • a 10x file
  • any of our "built in" datasets (list them, and give a little information about each of them)

[CLOSED] added .cpu() so run_benchmark won't crash when cuda is available

Issue by jeff-regier
Friday Apr 06, 2018 at 18:21 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/19


I was getting this error from run_benchmarks.py. Easily fixed.

Traceback (most recent call last):
  File "run_benchmarks.py", line 19, in <module>
    run_benchmarks(gene_dataset, n_epochs=args.epochs)
  File "/home/jeff/git/scVI-dev/scvi/benchmark.py", line 32, in run_benchmarks
    imputation_score = imputation(vae, gene_dataset)
  File "/home/jeff/git/scVI-dev/scvi/imputation.py", line 45, in imputation
    mae = imputation_error(px_rate.data.numpy(), X, i, j, ix)
RuntimeError: can't convert CUDA tensor to numpy (it doesn't support GPU arrays). Use .cpu() to move the tensor to host memory first.

jeff-regier included the following code: https://github.com/YosefLab/scVI-dev/pull/19/commits

additional datasets

@imyiningliu -- I think Maxime, Eddie, and Chenling are all working with datasets now that aren't yet "wrapped" by scVI. It'd be great if you could talk with all three of them to find out what datasets they're using, and add one class (that inherits from GeneExpressionDataset) for each dataset they plan to keep using. They may already have some code, which they haven't committed yet, that you can start with. We'd also want some documentation for each dataset, and a unit test if the new dataset requires a non-trivial amount of code.

These new datasets may have some characteristics that are different from the dataset we've seen so far.
-- Maxime's smFISH datasets have position information. I think he had some ideas for how to modify GeneExpressionDataset to include that information.
-- Eddie's pbmc donor data may already be accessible through the Dataset10x class.
-- Chenling's data from the simulator isn't available on at any public url yet, and maybe it's too early to make it public. But if not, we could share it through our scVI-dev repo.

Also, I'm interested in getting access through scVI to the dataset mentioned in this paper: https://www.nature.com/articles/nmeth.4636

refactoring: combine training method

There's a lot of code duplicated across our four methods for training models: train, train_semi_supervised_jointly, train_semi_supervised_alternately, and train_classifier. How about combining all these methods into one method, train, and passing it additional arguments (e.g., boolean flags) to control how it behaves?

cortex imputation error is too high

In the paper we report that imputation error for cortex is around 2.2 but when I run the current version of scVI imputation error is much higher.

jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset cortex --epochs 200
File data/expression.bin already downloaded
Preprocessing Cortex data
training: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 200/200 [00:29<00:00,  6.69it/s]
Total runtime for 201 epochs is: 29.987644910812378 seconds for a mean per epoch runtime of 0.14919226323787252 seconds.
Best ll was : 1288.2259785091362
Log-likelihood Train: 1261.4138140255177
Log-likelihood Test: 1289.2478976328903
Imputation score on test (MAE) is: 4.208409309387207

IPython notebook defines unused variable latent_dimension

A very minor issue in the example IPython notebook:

latent_dimension = 10

is defined but the following model definition does not include n_latent=latent_dimension, thus changing the value defined in the code has no effect. I think the same may be true of batch_size.

Thank you for this algorithm!

Negative Binomial parameterization

Hello,

Could you share the form you are using for the negative binomial distribution? I find it weird that there is not factorial in the likelihood function.

hemato -- File b'data/bBM.spring_and_pba.csv' does not exist

Looks like one of the files for hemato isn't getting downloaded automatically:

jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset hemato
Downloading file at data/bBM.raw_umifm_counts.csv.gz
Downloading file at data/data.zip
Preprocessing Hemato data
Traceback (most recent call last):
  File "./run_benchmarks.py", line 54, in <module>
    dataset = load_datasets(args.dataset, url=args.url)
  File "./run_benchmarks.py", line 26, in load_datasets
    gene_dataset = HematoDataset(save_path=save_path)
  File "/home/jeff/git/scVI/scvi/dataset/hemato.py", line 20, in __init__
    expression_data, gene_names = self.download_and_preprocess()
  File "/home/jeff/git/scVI/scvi/dataset/dataset.py", line 47, in download_and_preprocess
    return self.preprocess()
  File "/home/jeff/git/scVI/scvi/dataset/hemato.py", line 35, in preprocess
    spring_and_pba = pd.read_csv(self.save_path + self.spring_and_pba_filename)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/home/jeff/miniconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 384, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'data/bBM.spring_and_pba.csv' does not exist

More information

It would be great if you guys could share more insights in what's going on in the code and how to use it, maybe through a more complete README file?

KeyError: 'cmbc'

Possibly stopped working after #54 ?

jeff@dean ~/git/scVI $ ./run_benchmarks.py --dataset cbmc
Traceback (most recent call last):
  File "./run_benchmarks.py", line 54, in <module>
    dataset = load_datasets(args.dataset, url=args.url)
  File "./run_benchmarks.py", line 22, in load_datasets
    gene_dataset = CiteSeqDataset('cmbc', save_path=save_path)
  File "/home/jeff/git/scVI/scvi/dataset/cite_seq.py", line 17, in __init__
    s = available_datasets[name]
KeyError: 'cmbc'

[CLOSED] Lgamma / Cuda / Dataset / Dependencies

Issue by Edouard360
Friday Apr 06, 2018 at 07:10 GMT
Originally opened as https://github.com/YosefLab/scVI-dev/pull/18


  • Remove approx for log_zinb_positive since we don’t need it anymore (we have fully working lgamma)+ add compiling and dependency.

  • Added cuda support (distinction) - thanks Maxime.

  • Remove sklearn dependency. Removed the train_test_split. Using SubsetRamdomSampler could be another option but is yet complicated.

  • Create dataset module and subclasses for each dataset, inheriting the same base class; with downloading/preprocessing

  • Move benchmark logic in scvi/benchmark.py


Edouard360 included the following code: https://github.com/YosefLab/scVI-dev/pull/18/commits

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.