sydneybiox / scjoint Goto Github PK

Python 2.63% R 0.27% Jupyter Notebook 97.10%

scjoint's Introduction

scJoint

scJoint is a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semi-supervised framework and uses a neural network to simultaneously train labeled and unlabeled data, enabling label transfer and joint visualization in an integrative framework. For more information, please see scJoint manuscript: https://doi.org/10.1101/2020.12.31.424916.

scJoint is developed using PyTorch 1.0.0 and has been tested under both PyTorch 1.0.0 and 1.4.0. scJoint requires 1 GPU to run.

Tutorials

A step-by-step tutorial using CITE-seq and ASAP-seq PBMC data from control condition generated by Mimitou et al. 2020 (GSE156478) is demonstrated here: link
Tutorial for 10x Genomics data:
- link to download example data: link
- process data from SingleCellExperiment to scJoint's input link
- scJoint integration analysis link
Tutorial for mouse primary motor cortex that integrates transcriptomics, chromatin accessbility and methylation: link, the data can be donwloaded from link

Installation

scJoint can be obtained by simply clonning the github repository:

git clone https://github.com/SydneyBioX/scJoint.git

The following python packages are required to be installed before running scJoint: h5py, torch, itertools, scipy, numpy, os, random, sys, time, and datetime.

Preparing intput for scJoint

scJoint's main function takes expression data in .npz format and cell type labels in .txt format. To prepare the input for scJoint, modifying dataset paths in process_db.py which:

take .h5 files of expression matrix stored in matrix/data as input and generate .npz files for each expression matrix.
transform .csv files of cell type labels to numeric and stored in .txt files; and output label_to_idx.txt file indicates the correpondence of the numeric labels and the cell type labels.

Note:

The expression matrix for scRNA-seq data are the gene expression matrix (either normalised or raw data), and gene actvitiy matrix for scATAC-seq data.
The cell type labels for scRNA-seq is required, while the labels for scATAC-seq is optional and will only be used in accuracy calculation.

Running scJoint

Edit config.py according to the data input (See Arguments section for more details).

In terminal, run

python main.py

The output will be saved in ./output folder.

Arguments

The script config.py indicate the arguments for scJoint, which needs to be modified according to the data.

Dataset information

DB: name of the study
number_of_class: Number of cell type in the training data (scRNA-seq data)
input_size: Number of genes in both training and test data
rna_paths: A list of file paths of the .npz files of scRNA-seq gene expression datasets
rna_labels: A list of file paths of the .txt files of scRNA-seq cell type inforamtion
atac_paths: A list of file paths of the .npz files of scATAC-seq gene activity expression datasets
atac_labels: A list of file paths of the .txt files of scATAC-seq cell type inforamtion (optional, if atac_labels are provided, accuracy after knn would be provided)
rna_protein_paths: A list of paths of the .npz files of protein expression data for CITE-seq data (optional)
atac_protein_paths: A list of paths of the .npz files of protein expression data for ASAP-seq data (optional)

Training config

use_cuda: Whether GPU is used
threads: Number of threads used (set as 1 by default)
batch_size: Batch size (set as 256 by default)
lr_stage1: Learning rate for stage 1
lr_stage3: Learning rate for stage 3
lr_decay_epoch: Number of epoch learning rate decay
epochs_stage1: Number of epochs for stage 1
epochs_stage3: Number of epochs for stage 3
p: The fraction of data pairs expected to have high cosine similarity scores (set as 0.8 by default)
embedding_size: Number of nodes in the embedding (hidden) layer (set as 64 by default)
momentum: Momentum for SGD (set as 0.9 by default)
center_weight: The weight for center loss (set as 1 by default)
with_crossentorpy: True indicates well differentiated cell type mode, False indicates to run trajectory mode of scJoint.
seed: seed to be used (set as none by default)

The configuration we used in our paper can be found in link.

Output

scJoint will output 4 types of .txt files:

_embeddings.txt: Output of embeddings layer for each dataset
_knn_predictions.txt: Predicted results of KNN for each scATAC-seq data (final predictions), where the numeric corresponding to the label_to_idx.txt file.
_knn_probs.txt: Probability of KNN predictions for each scATAC-seq data
_predictions.txt: Output of prediction layer for each dataset

Visualisation

To generate tSNE and UMAP plots for the output data using R, run the following codes in terminal

Rscript embedding_visualisation_R.R --output_dir output/ --input_dir data/ --TSNE TRUE --UMAP TRUE --proportion 1

where

output_dir: Directory of the output folder
input_dir: Directory of intput folder (where the label_to_idx.txt file is saved)
TSNE: TRUE/FALSE indicates whether to run TSNE
UMAP: TRUE/FALSE indicates whether to run UMAP
Proportion: proprotion of cells used in visualisation

Note:

The script assumes the output folder only have results from one study
Please install the following packages before running the embedding_visualisation_R.R script by running the following codes in R:

install.packages(c("ggplot2", "ggthemes", "scattermore", "ggpubr", "Rtsne", "uwot", "pals", "grDevices", "optparse"))

Output of embedding_visualisation_R.R:

TSNE and/or UMAP embedding will be generating in the output_dir folder: tsne_embedding.txt, umap_embedding.txt
Visualisation of TSNE and UMAP: TSNE_plot.pdf, UMAP_plot.pdf

Online app

scJoint is also available via superbio: https://app.superbio.ai/apps/114/.

Reference

Lin, Y., Wu, T.Y., Wan, S., Yang, J.Y., Wong, W.H. and Wang, Y.X., 2022. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nature Biotechnology, 40(5), pp.703-710.

scjoint's People

Contributors

Stargazers

Watchers

Forkers

bchao1 lu77777777 ashishjain1988 zchen1-ua helloworldlty sheetalgiri sparsepenn liuchunlei0430 shicheng-guo xingzhis mtoseef99 jerome-f robert9799 alleinemir lynn-nn-n

scjoint's Issues

ValueError: not enough values to unpack (expected 2, got 0)

Hi ,

I hope you're doing well. I wanted to inform you that I encountered an error while running the file main.py. The error occurs in the progress_bar function of the utils.py file. Specifically, the function fails to retrieve the terminal size, resulting in a "ValueError: not enough values to unpack" error.

Thank you very much for your work on this project and your attention to this matter.

Best regards,

How are cells paired?

Hi SydneyBioX researchers,

I am not sure if I understand your code very well. I am confused on one thing. How cells from RNA-seq and ATAC-seq are paired. Is it achieved through NNDR? This is where I get lost. Thank you so much.

object of type 'S4' is not subsettable in write_h5_scJoint

Hi,

I used the test rds data downloaded from the data_10x folder and tried to combine scATAC and scRNA data using the following scripts:

sce_10xPBMC_atac <- readRDS("/scratch/hy17471/software/scJoint/data_10x/sce_10xPBMC_atac.rds")
sce_10xPBMC_rna <- readRDS("/scratch/hy17471/software/scJoint/data_10x/sce_10xPBMC_rna.rds")

common_genes <- intersect(rownames(sce_10xPBMC_atac),
rownames(sce_10xPBMC_rna))
length(common_genes)

exprs_atac <- logcounts(sce_10xPBMC_atac[common_genes, ])
exprs_rna <- logcounts(sce_10xPBMC_rna[common_genes, ])

write_h5_scJoint(exprs_list = list(rna = exprs_rna,
atac = exprs_atac),
h5file_list = c(paste0(output_dir,'/exprs_10xPBMC_test_rna.h5'),
paste0(output_dir,'/exprs_10xPBMC_test_atac.h5')))

but it returns:
Error in .Primitive("[")(new("HDF5RealizationSink", dim = c(9841L, 15463L :
object of type 'S4' is not subsettable

Would you help to figure it out. Thanks

Best,
Haidong

Questions related to scanpy data

Hi, I wonder is it possible for us to utilize this model based on scanpy data? Thanks.

An Indentation error in knn.py

In the #test part of util/knn.py, the calculation of knn accuracy should be indented, maybe.

Is it possible to use the tool on 4 samples?

Hello SydneyBioX lab!

I have a question about your tool! I have 4 data: scRNA WT, scATAC WT, scRNA MUT and scATAC MUT.
Would it be possible to use your tool on these 4 samples altogether? Or maybe is it better to integrate all the scRNA WT and MUT, integrate all the scATAC WT and MUT and then run your tool?

Thanks

an error in `model_stage1.train`

Hi, authors,
I got an error when I ran python3 main.py,just like the follow. I am confused of this problem,I would appreciate it if you could reply to me soon, Thanks!!

an error in model_stage1.train(epoch) by the test 10X_data

Hello,authors:
I tested the test data_10X which from author and got the same error with my data. The error is the follow:

I am so confused that why the test data also came out the same problem, I would appreciate it if you could reply to me soon. Thanks!

ValueError at the "write embeddings [stage 3]" step for RNA

Hello,

I am running scJoint on unpaired scRNA-seq and scATAC-seq data for label transfer from RNA to ATAC modalities. I get all the output correctly except for the RNA_embeddings because of the error below when running main.py :

"Traceback (most recent call last):
File "main.py", line 56, in
main()
File "main.py", line 47, in main
model_stage3.write_embeddings()
File "/Analysis/ATAC/scJoint/util/trainingprocess_stage3.py", line 210, in write_embeddings
rna_embedding = rna_embedding / norm(rna_embedding, axis=1, keepdims=True)
File "/anaconda3/lib/python3.8/site-packages/scipy/linalg/misc.py", line 140, in norm
a = np.asarray_chkfinite(a)
File "/anaconda3/lib/python3.8/site-packages/numpy/lib/function_base.py", line 488, in asarray_chkfinite
raise ValueError(
ValueError: array must not contain infs or NaNs"

Can you please help with that error ?

Thank you,

Kind regards,
Sébastien

settings for scJoint

Hi,

I am wondering, can somebody help me with choosing the settings in config.py ?
I have integrated scRNA-Seq data with several differing factors (different technologies, stim and ctrl, different Age). I also have scATAC data, but from a completely different batch of samples.
After running scJoint according to the 10X tutorial notebook, the two datasets do not look nicely overlayed in the tSNE-plot.
Is there maybe a possibility to use pre-calculated embedding-values for RNA, so that the clustering/plotting is the same as in Seurat and the ATAC cells will be put in this embedding?
Or can I change the resolution of the tSNE somehow?

As I am a newby to python and this kind of analysis, I'd appreciate any help.

Thanks!

CUDA error in stage 3

Hi scJoint team,

In our projects the scJoint worked smoothly in the stage 1 and 2, but threw error on the stage3.
It appeard to me that the all of the loss value are getting incredibly large.

Many thanks for your insights!

Following are the output for stage 3:


Training start [Stage3]
num_workers: 0
load npz matrix: /home/shao/Desktop/Projects/X/10x_ATACseq/scJoint/data_hp_wt/wt_hp_rna.npz
load npz matrix: /home/shao/Desktop/Projects/X/10x_ATACseq/scJoint/data_hp_wt/wt_hp_atac.npz
Epoch: 0
LR is set to 0.01
LR is set to 0.01
 [============ 29/29 =========>.]    Step: 157ms | Tot: 4s393ms | encoding_loss: 12.817, rna_loss: 59.222, center_loss: 5.881
Epoch: 1
 [============ 29/29 =========>.]    Step: 156ms | Tot: 4s366ms | encoding_loss: 919.316, rna_loss: 27529.147, center_loss: 120.546
Epoch: 2
 [================ 29/29 =====>.]    Step: 158ms | Tot: 4s339ms | encoding_loss: 11958.841, rna_loss: 204954.188, center_loss: 413.1Epoch: 3
 [=================== 29/29 ==>.]    Step: 156ms | Tot: 4s349ms | encoding_loss: 260700.759, rna_loss: 7364164.912, center_loss: 254Epoch: 4
 [========================== 29/29   Step: 158ms | Tot: 4s335ms | encoding_loss: 137907357.234, rna_loss: 5798996849.379, center_losEpoch: 5.433
 [============================>.] 29/29 p: 156ms | Tot: 4s368ms | encoding_loss: 25037452131.310, rna_loss: 874840569008.552, centerEpoch: 658529.145
 [============================>.]    Step:  29/29  Tot: 4s324ms | encoding_loss: 104636323585765.516, rna_loss: 3276948089484535.000Epoch: 7_loss: 34642031.474
 [============================>.]    Step: 157 29/29 t: 4s347ms | encoding_loss: 2187132071060303.500, rna_loss: 73972744484661184.0Epoch: 8er_loss: 203889508.828
 [============================>.]    Step: 158ms | T 29/29 02ms | encoding_loss: 972514208877650816.000, rna_loss: 46906742034446524Epoch: 9 center_loss: 4715636888.276
 [============================>.]    Step: 157ms | Tot: 4s3 29/29 encoding_loss: 361596709780774846464.000, rna_loss: 11707894052183Epoch: 10.000, center_loss: 68285345968.552
 [============================>.]    Step: 157ms | Tot: 4s388ms  29/29 ing_loss: 32664000407402539122688.000, rna_loss: 143521479561Epoch: 112480.000, center_loss: 123240449412.414
 [============================>.]    Step: 159ms | Tot: 4s444ms | en 29/29 loss: 8485904532940365594886144.000, rna_loss: 3706595062Epoch: 1209562368.000, center_loss: 125000001182.897
 [============================>.]    Step: 160ms | Tot: 4s436ms | encodin 29/29  1186645034441254863890284544.000, rna_loss: 4430101Epoch: 131753024004096.000, center_loss: 125000002877.793
 [============================>.]    Step: 160ms | Tot: 4s479ms | encoding_los 29/29 50627509908939357502308352.000, rna_loss: 43719Epoch: 14617924641928249344.000, center_loss: 125000002595.310
 [============================>.]    Step: 159ms | Tot: 4s428ms | encoding_loss: 745 29/29 26601367427467389698048.000, rna_loss: 23Epoch: 15674412783426934507831296.000, center_loss: 125000002312.828
 [============ 29/29 =========>.]    Step: 159ms | Tot: 4s458ms | encoding_loss: inf, rna_loss: inf, center_loss: nan               Epoch: 16er_loss: 125000001536.000 000, center_loss: 125000000853.333
Traceback (most recent call last):   Step: 161ms | Tot: 2s89ms | encoding_loss: inf, rna_loss: nan, center_loss: nan
  File "main.py", line 56, in <module>
    main()
  File "main.py", line 44, in main
    model_stage3.train(epoch)
  File "/home/shao/Desktop/Projects/X/10x_ATACseq/scJoint/util/trainingprocess_stage3.py", line 152, in train
    encoding_loss.backward(retain_graph=True)
  File "/home/shao/miniconda3/envs/env_scjoint/lib/python3.8/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/shao/miniconda3/envs/env_scjoint/lib/python3.8/site-packages/torch/autograd/__init__.py", line 97, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: device-side assert triggered

Stage 1 and knn worked fine

Start time:  09:20:49
Training start [Stage1]
num_workers: 0
load npz matrix: /home/shao/Desktop/Projects/X/10x_ATACseq/scJoint/data_hp_wt/wt_hp_rna.npz
load npz matrix: /home/shao/Desktop/Projects/X/10x_ATACseq/scJoint/data_hp_wt/wt_hp_atac.npz
Epoch: 0
LR is set to 0.01
LR is set to 0.01
 [============ 29/29 =========>.]    Step: 152ms | Tot: 4s236ms | encoding_loss: 5.734, rna_loss: 1.521
Epoch: 1
 [============ 29/29 =========>.]    Step: 154ms | Tot: 4s270ms | encoding_loss: 3.825, rna_loss: 0.552
Epoch: 2
 [============ 29/29 =========>.]    Step: 152ms | Tot: 4s250ms | encoding_loss: 3.643, rna_loss: 0.281
Epoch: 3
 [============ 29/29 =========>.]    Step: 155ms | Tot: 4s256ms | encoding_loss: 3.586, rna_loss: 0.172
Epoch: 4
 [============ 29/29 =========>.]    Step: 157ms | Tot: 4s297ms | encoding_loss: 3.541, rna_loss: 0.128
Epoch: 5
 [============ 29/29 =========>.]    Step: 154ms | Tot: 4s283ms | encoding_loss: 3.507, rna_loss: 0.091
Epoch: 6
 [============ 29/29 =========>.]    Step: 155ms | Tot: 4s290ms | encoding_loss: 3.483, rna_loss: 0.075
Epoch: 7
 [============ 29/29 =========>.]    Step: 153ms | Tot: 4s256ms | encoding_loss: 3.462, rna_loss: 0.057
Epoch: 8
 [============ 29/29 =========>.]    Step: 155ms | Tot: 4s303ms | encoding_loss: 3.454, rna_loss: 0.044
Epoch: 9
 [============ 29/29 =========>.]    Step: 156ms | Tot: 4s327ms | encoding_loss: 3.435, rna_loss: 0.038
Epoch: 10
 [============ 29/29 =========>.]    Step: 156ms | Tot: 4s351ms | encoding_loss: 3.426, rna_loss: 0.032
Epoch: 11
 [============ 29/29 =========>.]    Step: 162ms | Tot: 4s337ms | encoding_loss: 3.410, rna_loss: 0.027
Epoch: 12
 [============ 29/29 =========>.]    Step: 160ms | Tot: 4s334ms | encoding_loss: 3.399, rna_loss: 0.026
Epoch: 13
 [============ 29/29 =========>.]    Step: 159ms | Tot: 4s356ms | encoding_loss: 3.382, rna_loss: 0.022
Epoch: 14
 [============ 29/29 =========>.]    Step: 155ms | Tot: 4s311ms | encoding_loss: 3.370, rna_loss: 0.019
Epoch: 15
 [============ 29/29 =========>.]    Step: 155ms | Tot: 4s313ms | encoding_loss: 3.363, rna_loss: 0.017
Epoch: 16
 [============ 29/29 =========>.]    Step: 153ms | Tot: 4s284ms | encoding_loss: 3.347, rna_loss: 0.016
Epoch: 17
 [============ 29/29 =========>.]    Step: 152ms | Tot: 4s295ms | encoding_loss: 3.338, rna_loss: 0.015
Epoch: 18
 [============ 29/29 =========>.]    Step: 153ms | Tot: 4s305ms | encoding_loss: 3.339, rna_loss: 0.014
Epoch: 19
 [============ 29/29 =========>.]    Step: 154ms | Tot: 4s319ms | encoding_loss: 3.331, rna_loss: 0.014
Write embeddings
 [============ 26/26 =========>.]    Step: 40ms | Tot: 2s639ms | write embeddings and predictions for db:wt_hp_rna
 [============ 30/30 ==========>]    Step: 48ms | Tot: 3s94ms | write embeddings and predictions for db:wt_hp_atac
Stage 1 finished:  09:22:28
KNN
[KNN] Read RNA data
[KNN] Read ATAC data
[KNN] Build Space
[KNN] knn
KNN finished:  09:22:30