Git Product home page Git Product logo

scjoint's Introduction

scJoint

scJoint is a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semi-supervised framework and uses a neural network to simultaneously train labeled and unlabeled data, enabling label transfer and joint visualization in an integrative framework. For more information, please see scJoint manuscript: https://doi.org/10.1101/2020.12.31.424916.

scJoint is developed using PyTorch 1.0.0 and has been tested under both PyTorch 1.0.0 and 1.4.0. scJoint requires 1 GPU to run.

Tutorials

  • A step-by-step tutorial using CITE-seq and ASAP-seq PBMC data from control condition generated by Mimitou et al. 2020 (GSE156478) is demonstrated here: link
  • Tutorial for 10x Genomics data:
    • link to download example data: link
    • process data from SingleCellExperiment to scJoint's input link
    • scJoint integration analysis link
  • Tutorial for mouse primary motor cortex that integrates transcriptomics, chromatin accessbility and methylation: link, the data can be donwloaded from link

Installation

scJoint can be obtained by simply clonning the github repository:

git clone https://github.com/SydneyBioX/scJoint.git

The following python packages are required to be installed before running scJoint: h5py, torch, itertools, scipy, numpy, os, random, sys, time, and datetime.

Preparing intput for scJoint

scJoint's main function takes expression data in .npz format and cell type labels in .txt format. To prepare the input for scJoint, modifying dataset paths in process_db.py which:

  1. take .h5 files of expression matrix stored in matrix/data as input and generate .npz files for each expression matrix.
  2. transform .csv files of cell type labels to numeric and stored in .txt files; and output label_to_idx.txt file indicates the correpondence of the numeric labels and the cell type labels.

Note:

  1. The expression matrix for scRNA-seq data are the gene expression matrix (either normalised or raw data), and gene actvitiy matrix for scATAC-seq data.
  2. The cell type labels for scRNA-seq is required, while the labels for scATAC-seq is optional and will only be used in accuracy calculation.

Running scJoint

Edit config.py according to the data input (See Arguments section for more details).

In terminal, run

python main.py

The output will be saved in ./output folder.

Arguments

The script config.py indicate the arguments for scJoint, which needs to be modified according to the data.

Dataset information

  • DB: name of the study
  • number_of_class: Number of cell type in the training data (scRNA-seq data)
  • input_size: Number of genes in both training and test data
  • rna_paths: A list of file paths of the .npz files of scRNA-seq gene expression datasets
  • rna_labels: A list of file paths of the .txt files of scRNA-seq cell type inforamtion
  • atac_paths: A list of file paths of the .npz files of scATAC-seq gene activity expression datasets
  • atac_labels: A list of file paths of the .txt files of scATAC-seq cell type inforamtion (optional, if atac_labels are provided, accuracy after knn would be provided)
  • rna_protein_paths: A list of paths of the .npz files of protein expression data for CITE-seq data (optional)
  • atac_protein_paths: A list of paths of the .npz files of protein expression data for ASAP-seq data (optional)

Training config

  • use_cuda: Whether GPU is used

  • threads: Number of threads used (set as 1 by default)

  • batch_size: Batch size (set as 256 by default)

  • lr_stage1: Learning rate for stage 1

  • lr_stage3: Learning rate for stage 3

  • lr_decay_epoch: Number of epoch learning rate decay

  • epochs_stage1: Number of epochs for stage 1

  • epochs_stage3: Number of epochs for stage 3

  • p: The fraction of data pairs expected to have high cosine similarity scores (set as 0.8 by default)

  • embedding_size: Number of nodes in the embedding (hidden) layer (set as 64 by default)

  • momentum: Momentum for SGD (set as 0.9 by default)

  • center_weight: The weight for center loss (set as 1 by default)

  • with_crossentorpy: True indicates well differentiated cell type mode, False indicates to run trajectory mode of scJoint.

  • seed: seed to be used (set as none by default)

The configuration we used in our paper can be found in link.

Output

scJoint will output 4 types of .txt files:

  • _embeddings.txt: Output of embeddings layer for each dataset
  • _knn_predictions.txt: Predicted results of KNN for each scATAC-seq data (final predictions), where the numeric corresponding to the label_to_idx.txt file.
  • _knn_probs.txt: Probability of KNN predictions for each scATAC-seq data
  • _predictions.txt: Output of prediction layer for each dataset

Visualisation

To generate tSNE and UMAP plots for the output data using R, run the following codes in terminal

Rscript embedding_visualisation_R.R --output_dir output/ --input_dir data/ --TSNE TRUE --UMAP TRUE --proportion 1

where

  • output_dir: Directory of the output folder
  • input_dir: Directory of intput folder (where the label_to_idx.txt file is saved)
  • TSNE: TRUE/FALSE indicates whether to run TSNE
  • UMAP: TRUE/FALSE indicates whether to run UMAP
  • Proportion: proprotion of cells used in visualisation

Note:

  • The script assumes the output folder only have results from one study
  • Please install the following packages before running the embedding_visualisation_R.R script by running the following codes in R:
install.packages(c("ggplot2", "ggthemes", "scattermore", "ggpubr", "Rtsne", "uwot", "pals", "grDevices", "optparse"))

Output of embedding_visualisation_R.R:

  • TSNE and/or UMAP embedding will be generating in the output_dir folder: tsne_embedding.txt, umap_embedding.txt
  • Visualisation of TSNE and UMAP: TSNE_plot.pdf, UMAP_plot.pdf

Online app

scJoint is also available via superbio: https://app.superbio.ai/apps/114/.

Reference

Lin, Y., Wu, T.Y., Wan, S., Yang, J.Y., Wong, W.H. and Wang, Y.X., 2022. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nature Biotechnology, 40(5), pp.703-710.

scjoint's People

Contributors

yingxinlin avatar

Stargazers

 avatar  avatar Mouxing Young avatar maoyuanma avatar  avatar  avatar Fengjiao_Gong avatar Fan Zhang avatar Ronghan Li avatar Meitong Liu avatar Pengzhi Zhang avatar Henrik Lindehell avatar  avatar  avatar  avatar Bo Zhao avatar A.s. avatar Gungor Budak avatar WeihangZhangjp avatar Rongqing Yuan avatar Kane Toh avatar  avatar  avatar Diana_Wang avatar ike avatar Tobias Hohl avatar nicole avatar  avatar Chunxuan Shao avatar  avatar  avatar Wenjing Ma avatar rrw avatar Hui Wan avatar Michael Vinyard avatar Jiajun Yao avatar  avatar tpzou avatar Minxing Pang avatar Zhilin Long avatar Devika Agarwal avatar  avatar  avatar Mikhael D. Manurung avatar ZHAO Jia avatar  avatar Ricardo D'O. Albanus avatar ChanghuiYin avatar Ashish Jain avatar Shahnawaz Ali avatar hao dong avatar  avatar slp avatar Brian Chao avatar  avatar Pony avatar Yang Eric Li avatar Yue Jiang avatar

Watchers

Dario Strbenac avatar James Cloos avatar Pony avatar John Ormerod avatar Ellis Patrick avatar CSB Yang Laboratory avatar Samuel Muller avatar Shila Ghazanfar avatar  avatar

scjoint's Issues

ValueError: not enough values to unpack (expected 2, got 0)

Hi ,

I hope you're doing well. I wanted to inform you that I encountered an error while running the file main.py. The error occurs in the progress_bar function of the utils.py file. Specifically, the function fails to retrieve the terminal size, resulting in a "ValueError: not enough values to unpack" error.

Thank you very much for your work on this project and your attention to this matter.

Best regards,

image

How are cells paired?

Hi SydneyBioX researchers,

I am not sure if I understand your code very well. I am confused on one thing. How cells from RNA-seq and ATAC-seq are paired. Is it achieved through NNDR? This is where I get lost. Thank you so much.

object of type 'S4' is not subsettable in write_h5_scJoint

Hi,

I used the test rds data downloaded from the data_10x folder and tried to combine scATAC and scRNA data using the following scripts:

sce_10xPBMC_atac <- readRDS("/scratch/hy17471/software/scJoint/data_10x/sce_10xPBMC_atac.rds")
sce_10xPBMC_rna <- readRDS("/scratch/hy17471/software/scJoint/data_10x/sce_10xPBMC_rna.rds")

common_genes <- intersect(rownames(sce_10xPBMC_atac),
rownames(sce_10xPBMC_rna))
length(common_genes)

exprs_atac <- logcounts(sce_10xPBMC_atac[common_genes, ])
exprs_rna <- logcounts(sce_10xPBMC_rna[common_genes, ])

write_h5_scJoint(exprs_list = list(rna = exprs_rna,
atac = exprs_atac),
h5file_list = c(paste0(output_dir,'/exprs_10xPBMC_test_rna.h5'),
paste0(output_dir,'/exprs_10xPBMC_test_atac.h5')))

but it returns:
Error in .Primitive("[")(new("HDF5RealizationSink", dim = c(9841L, 15463L :
object of type 'S4' is not subsettable

Would you help to figure it out. Thanks

Best,
Haidong

Is it possible to use the tool on 4 samples?

Hello SydneyBioX lab!

I have a question about your tool! I have 4 data: scRNA WT, scATAC WT, scRNA MUT and scATAC MUT.
Would it be possible to use your tool on these 4 samples altogether? Or maybe is it better to integrate all the scRNA WT and MUT, integrate all the scATAC WT and MUT and then run your tool?

Thanks

an error in `model_stage1.train`

Hi, authors,
I got an error when I ran python3 main.py,just like the follow. I am confused of this problem,I would appreciate it if you could reply to me soon, Thanks!!
image

ValueError at the "write embeddings [stage 3]" step for RNA

Hello,

I am running scJoint on unpaired scRNA-seq and scATAC-seq data for label transfer from RNA to ATAC modalities. I get all the output correctly except for the RNA_embeddings because of the error below when running main.py :

"Traceback (most recent call last):
File "main.py", line 56, in
main()
File "main.py", line 47, in main
model_stage3.write_embeddings()
File "/Analysis/ATAC/scJoint/util/trainingprocess_stage3.py", line 210, in write_embeddings
rna_embedding = rna_embedding / norm(rna_embedding, axis=1, keepdims=True)
File "/anaconda3/lib/python3.8/site-packages/scipy/linalg/misc.py", line 140, in norm
a = np.asarray_chkfinite(a)
File "/anaconda3/lib/python3.8/site-packages/numpy/lib/function_base.py", line 488, in asarray_chkfinite
raise ValueError(
ValueError: array must not contain infs or NaNs"

Can you please help with that error ?

Thank you,

Kind regards,
Sébastien

settings for scJoint

Hi,

I am wondering, can somebody help me with choosing the settings in config.py ?
I have integrated scRNA-Seq data with several differing factors (different technologies, stim and ctrl, different Age). I also have scATAC data, but from a completely different batch of samples.
After running scJoint according to the 10X tutorial notebook, the two datasets do not look nicely overlayed in the tSNE-plot.
Is there maybe a possibility to use pre-calculated embedding-values for RNA, so that the clustering/plotting is the same as in Seurat and the ATAC cells will be put in this embedding?
Or can I change the resolution of the tSNE somehow?

As I am a newby to python and this kind of analysis, I'd appreciate any help.

Thanks!

CUDA error in stage 3

Hi scJoint team,

In our projects the scJoint worked smoothly in the stage 1 and 2, but threw error on the stage3.
It appeard to me that the all of the loss value are getting incredibly large.

Many thanks for your insights!

Following are the output for stage 3:


Training start [Stage3]
num_workers: 0
load npz matrix: /home/shao/Desktop/Projects/X/10x_ATACseq/scJoint/data_hp_wt/wt_hp_rna.npz
load npz matrix: /home/shao/Desktop/Projects/X/10x_ATACseq/scJoint/data_hp_wt/wt_hp_atac.npz
Epoch: 0
LR is set to 0.01
LR is set to 0.01
 [============ 29/29 =========>.]    Step: 157ms | Tot: 4s393ms | encoding_loss: 12.817, rna_loss: 59.222, center_loss: 5.881
Epoch: 1
 [============ 29/29 =========>.]    Step: 156ms | Tot: 4s366ms | encoding_loss: 919.316, rna_loss: 27529.147, center_loss: 120.546
Epoch: 2
 [================ 29/29 =====>.]    Step: 158ms | Tot: 4s339ms | encoding_loss: 11958.841, rna_loss: 204954.188, center_loss: 413.1Epoch: 3
 [=================== 29/29 ==>.]    Step: 156ms | Tot: 4s349ms | encoding_loss: 260700.759, rna_loss: 7364164.912, center_loss: 254Epoch: 4
 [========================== 29/29   Step: 158ms | Tot: 4s335ms | encoding_loss: 137907357.234, rna_loss: 5798996849.379, center_losEpoch: 5.433
 [============================>.] 29/29 p: 156ms | Tot: 4s368ms | encoding_loss: 25037452131.310, rna_loss: 874840569008.552, centerEpoch: 658529.145
 [============================>.]    Step:  29/29  Tot: 4s324ms | encoding_loss: 104636323585765.516, rna_loss: 3276948089484535.000Epoch: 7_loss: 34642031.474
 [============================>.]    Step: 157 29/29 t: 4s347ms | encoding_loss: 2187132071060303.500, rna_loss: 73972744484661184.0Epoch: 8er_loss: 203889508.828
 [============================>.]    Step: 158ms | T 29/29 02ms | encoding_loss: 972514208877650816.000, rna_loss: 46906742034446524Epoch: 9 center_loss: 4715636888.276
 [============================>.]    Step: 157ms | Tot: 4s3 29/29 encoding_loss: 361596709780774846464.000, rna_loss: 11707894052183Epoch: 10.000, center_loss: 68285345968.552
 [============================>.]    Step: 157ms | Tot: 4s388ms  29/29 ing_loss: 32664000407402539122688.000, rna_loss: 143521479561Epoch: 112480.000, center_loss: 123240449412.414
 [============================>.]    Step: 159ms | Tot: 4s444ms | en 29/29 loss: 8485904532940365594886144.000, rna_loss: 3706595062Epoch: 1209562368.000, center_loss: 125000001182.897
 [============================>.]    Step: 160ms | Tot: 4s436ms | encodin 29/29  1186645034441254863890284544.000, rna_loss: 4430101Epoch: 131753024004096.000, center_loss: 125000002877.793
 [============================>.]    Step: 160ms | Tot: 4s479ms | encoding_los 29/29 50627509908939357502308352.000, rna_loss: 43719Epoch: 14617924641928249344.000, center_loss: 125000002595.310
 [============================>.]    Step: 159ms | Tot: 4s428ms | encoding_loss: 745 29/29 26601367427467389698048.000, rna_loss: 23Epoch: 15674412783426934507831296.000, center_loss: 125000002312.828
 [============ 29/29 =========>.]    Step: 159ms | Tot: 4s458ms | encoding_loss: inf, rna_loss: inf, center_loss: nan               Epoch: 16er_loss: 125000001536.000 000, center_loss: 125000000853.333
Traceback (most recent call last):   Step: 161ms | Tot: 2s89ms | encoding_loss: inf, rna_loss: nan, center_loss: nan
  File "main.py", line 56, in <module>
    main()
  File "main.py", line 44, in main
    model_stage3.train(epoch)
  File "/home/shao/Desktop/Projects/X/10x_ATACseq/scJoint/util/trainingprocess_stage3.py", line 152, in train
    encoding_loss.backward(retain_graph=True)
  File "/home/shao/miniconda3/envs/env_scjoint/lib/python3.8/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/shao/miniconda3/envs/env_scjoint/lib/python3.8/site-packages/torch/autograd/__init__.py", line 97, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered

Stage 1 and knn worked fine

Start time:  09:20:49
Training start [Stage1]
num_workers: 0
load npz matrix: /home/shao/Desktop/Projects/X/10x_ATACseq/scJoint/data_hp_wt/wt_hp_rna.npz
load npz matrix: /home/shao/Desktop/Projects/X/10x_ATACseq/scJoint/data_hp_wt/wt_hp_atac.npz
Epoch: 0
LR is set to 0.01
LR is set to 0.01
 [============ 29/29 =========>.]    Step: 152ms | Tot: 4s236ms | encoding_loss: 5.734, rna_loss: 1.521
Epoch: 1
 [============ 29/29 =========>.]    Step: 154ms | Tot: 4s270ms | encoding_loss: 3.825, rna_loss: 0.552
Epoch: 2
 [============ 29/29 =========>.]    Step: 152ms | Tot: 4s250ms | encoding_loss: 3.643, rna_loss: 0.281
Epoch: 3
 [============ 29/29 =========>.]    Step: 155ms | Tot: 4s256ms | encoding_loss: 3.586, rna_loss: 0.172
Epoch: 4
 [============ 29/29 =========>.]    Step: 157ms | Tot: 4s297ms | encoding_loss: 3.541, rna_loss: 0.128
Epoch: 5
 [============ 29/29 =========>.]    Step: 154ms | Tot: 4s283ms | encoding_loss: 3.507, rna_loss: 0.091
Epoch: 6
 [============ 29/29 =========>.]    Step: 155ms | Tot: 4s290ms | encoding_loss: 3.483, rna_loss: 0.075
Epoch: 7
 [============ 29/29 =========>.]    Step: 153ms | Tot: 4s256ms | encoding_loss: 3.462, rna_loss: 0.057
Epoch: 8
 [============ 29/29 =========>.]    Step: 155ms | Tot: 4s303ms | encoding_loss: 3.454, rna_loss: 0.044
Epoch: 9
 [============ 29/29 =========>.]    Step: 156ms | Tot: 4s327ms | encoding_loss: 3.435, rna_loss: 0.038
Epoch: 10
 [============ 29/29 =========>.]    Step: 156ms | Tot: 4s351ms | encoding_loss: 3.426, rna_loss: 0.032
Epoch: 11
 [============ 29/29 =========>.]    Step: 162ms | Tot: 4s337ms | encoding_loss: 3.410, rna_loss: 0.027
Epoch: 12
 [============ 29/29 =========>.]    Step: 160ms | Tot: 4s334ms | encoding_loss: 3.399, rna_loss: 0.026
Epoch: 13
 [============ 29/29 =========>.]    Step: 159ms | Tot: 4s356ms | encoding_loss: 3.382, rna_loss: 0.022
Epoch: 14
 [============ 29/29 =========>.]    Step: 155ms | Tot: 4s311ms | encoding_loss: 3.370, rna_loss: 0.019
Epoch: 15
 [============ 29/29 =========>.]    Step: 155ms | Tot: 4s313ms | encoding_loss: 3.363, rna_loss: 0.017
Epoch: 16
 [============ 29/29 =========>.]    Step: 153ms | Tot: 4s284ms | encoding_loss: 3.347, rna_loss: 0.016
Epoch: 17
 [============ 29/29 =========>.]    Step: 152ms | Tot: 4s295ms | encoding_loss: 3.338, rna_loss: 0.015
Epoch: 18
 [============ 29/29 =========>.]    Step: 153ms | Tot: 4s305ms | encoding_loss: 3.339, rna_loss: 0.014
Epoch: 19
 [============ 29/29 =========>.]    Step: 154ms | Tot: 4s319ms | encoding_loss: 3.331, rna_loss: 0.014
Write embeddings
 [============ 26/26 =========>.]    Step: 40ms | Tot: 2s639ms | write embeddings and predictions for db:wt_hp_rna
 [============ 30/30 ==========>]    Step: 48ms | Tot: 3s94ms | write embeddings and predictions for db:wt_hp_atac
Stage 1 finished:  09:22:28
KNN
[KNN] Read RNA data
[KNN] Read ATAC data
[KNN] Build Space
[KNN] knn
KNN finished:  09:22:30

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.