luyitian / sc_mixology Goto Github PK

View Code? Open in Web Editor NEW

89.0 8.0 24.0 377.14 MB

This contains the dataset for comparing scRNA-seq analysis methods

License: MIT License

HTML 90.71% R 9.29%

human-cell-atlas dataset benchmarking

sc_mixology's Introduction

single cell mixology: single cell RNA-seq benchmarking

note: the repository has been renamed to sc_mixology. the old link will be redirected to the current repository.

sc_mixology uses three human lung adenocarcinoma cell lines HCC827, H1975 and H2228, which were cultured separately, and then processed in three different ways. Firstly, single cells from each cell line were mixed in equal proportions, with libraries generated using three different protocols: CEL-seq2, Drop-seq (with Dolomite equipment) and 10X Chromium. Secondly, the single cells were sorted from the three cell lines into 384-well plates, with an equal number of cells per well in different combinations (generally 9-cells, but with some 90-cell population controls). Thirdly, RNA was extracted in bulk for each cell line and the RNA was mixed in 7 different proportions and diluted to single cell equivalent amounts ranging from 3.75pg to 30pg and processed using CEL-seq 2 and SORT-seq. ERCC spike-in controls were present in samples processed using the 2 plate-based technologies (CEL-seq2 and SORT-seq).

Raw data from this series of experiments is available under GEO accession number GSE118767. The processed count data obtained from scPipe is stored in R objects that use the SingleCellExperiment class. Below are instructions for getting the count data and metadata (including annotations) for each dataset. All data is post sample quality control, without gene filtering.

Summary of all datasets

Load files into R

You can find R object files in the data folder

load("data/sincell_with_class.RData")

This will create three variables: sce10x_qc, sce4_qc, and scedrop_qc_qc. sce10x_qc contains the read counts after quality control processing from the 10x platform. sce4_qc contains the read counts after quality control processing from the CEL-seq2 platform. scedrop_qc_qc contains the read counts after quality control proessing from the Drop-seq platform.

ground truth

The true label is stored in colData(). For single cells the label is in column cell_line_demuxlet. For single cell mixtures the ground truth is the combination of three cell lines, which is in column H1975, H2228 and HCC827. so one merge and use the combination as the label, such as paste(sce_SC1_qc$H1975,sce_SC1_qc$H2228,sce_SC1_qc$HCC827,sep="_"). Similarly, the ground truth in RNA mixture is the proportion of RNA from each cell line, stored in column H2228_prop, H1975_prop and HCC827_prop, which can be merged into one column and use as the label, such as paste(sce2_qc$H2228_prop,sce2_qc$H1975_prop,sce2_qc$HCC827_prop,sep="_").

Counts

To access count data from a SingleCellExperiment object, use the counts(sce) function:

counts(sce10x_qc)[1:5, 1:5]

Metadata

To access sample information from a SingleCellExperiment object, use the colData(sce) function:

head(colData(sce10x_qc))

Examples of using these datasets

You can find an Rnotebook in the script/data_QC_visualization folder named data_explore_mixture.Rmd which includes code for analysing the cell mixture and RNA mixture datasets.

Scripts for reproducing a broader methods comparison

The [script] folder contains scripts that can reproduce the analysis and figures from our paper: Benchmarking single cell RNA-sequencing analysis pipelines using mixture control experiments.

Note: The ggtern package, which has been used to generate the ternary plots, has known issues with recent versions of ggplot and the relevant code may be broken if you have updated the ggplot package.

sc_mixology's People

Contributors

Stargazers

Watchers

sc_mixology's Issues

How to perform log transformation？

I can't repeat the log transformation of count in the data, may i ignore something step?

max(log2(c(sce_sc_CELseq2_qc@assays$data@listData$counts)+1))
[1] 11.53819
max(log2(c(sce_sc_CELseq2_qc@assays$data@listData$counts)))
[1] 11.5377
max(c(sce_sc_CELseq2_qc@assays$data@listData$logcounts))
[1] 11.84594

Plans to make into ExperimentHub package?

https://www.bioconductor.org/packages/release/bioc/vignettes/ExperimentHub/inst/doc/CreateAnExperimentHubPackage.html

This seems like a great way of both
a. making your data easily accessible to other Bioconductor users
b. easily integrate this data into any workflows you or the team plan to write in the future

Fix encoding(?) of the *_call.R scripts in script/clustering/Clustering_Algorithms

Thank you for organizing the analyses from your paper into such a useful repository.

It seems like something strange happened to the *_call.R files in the script/clustering/Clustering_Algorithms directory?

They can't be viewed via the code browser online, and pulling down this repo locally, it seems like these files are somehow binary encoded or some such ... or?

can't reproduce the results in the paper

I want to test my imputation method following the pipeline in you nm paper. So I ran the code on RNAmix data. I first ran the script norm_impu_RNAmix.Rmd to get the data normalized and imputed. Then I ran the code blocks about RNAmix data in norm_eval.R to reproduce the plots. However, I got results totally different from figure2 and supp figure4. I changed nothing in your data and code. Could you please help me figure out what happened? Thanks a lot.

All the methods tested in the paper all limited in R language?

I noticed that most scripts here are R scripts. Does this mean all the methods benchmarked in the paper are limited on the R language? As we know that there are also many Python-only methods for scRNA seq data. Thanks!

Loading the data from Python

This is a great benchmarking resource! Are there any plans to make the data inter-operable with the Python eco-system? Ideally, instead of loading R objects, it would be nice to also be able to load loom files or csv files for each dataset.

I have tried using Seurat's conversion tool (https://satijalab.org/seurat/conversion_vignette.html) for SingleCellExperiment -> loom but have not been able to get that to work.

Rdata file contains variables with unclear names

Based on the README and variable names, I'm gathering that sce10x_qc has the 10X counts after QC. There are two other variables as well, sce4_qc and scedrop_qc_qc. I'm guessing the dropseq is scedrop_qc_qc (any reason for the double _qc?). Does that mean that the CEL-seq2 is the sce4_qc variable?

Question regarding 10x UMIs

I've been trying to access the UMI matrix for the 'sce_sc_10x_qc' dataset. However, the documentation seems to only show how to retrieve a count matrix, one that is clearly not de-duplicated given the extremely high number of reads for each cell.

Is the UMI matrix for the 10x data supplied anywhere within the package? And if not, where are the fastqs and/or BAM files we'd need in order to generate a de-duplicated UMI matrix?

some question about clustering

hello, I am using your data for clustering experiments recently. But I found that when I use the ARI or NMI as my clustering evaluation index, the value of they can all achieve more than 0.9 for many existing methods, such as CIDR, SIMLR, RaceID and so on. Whether do you meet this phenomenon before? Thus, Is it due to the high quality of dataset so that the cell groups can be clearly distinguish? It seems the traditional Kmeans clustering method also can give perfect performance. I want to listen to your opinions. Thanks!

Number of cells discrepancy

I have a question about the data available on ncbi and the cell cluster labels. I downloaded GEO data, but am noticing that there are fewer cells in the ground truth true label files on github than in the GEO single cell dataset. Is there a cell cluster file available that contains more cells?

For example, there are 4001 cells for sc_10x GSM3022245, but only 902 cells in the ground truth true label dataset; there are 384 cells for sc_CEL-seq2 GSM3336845, but 274 cells in the ground truth true label dataset; there are 4001 sc_Drop-seq GSM3336849, but 225 cells in the ground truth true label dataset.

A query about theoretical total input of spike-ins

Hi,

For the code:

cms_095046 <- read.delim("~/cms_095046.txt", stringsAsFactors = FALSE)
SpikeInfo = data.frame(ERCCID=cms_095046$ERCC.ID, count=cms_095046$concentration.in.Mix.1..attomoles.ul.)
SpikeInfo = SpikeInfo[SpikeInfo$ERCCID %in% rownames(ERCC_dat),]
rownames(SpikeInfo) = SpikeInfo$ERCCID
SpikeInfo[,2] = SpikeInfo[,2]/1000

May I ask can I calculated the theoretical total input of spike-ins as sum(SpikeInfo$count)? If so, is it the same for both CELL_seq2 and Sort-seq datasets?

I would like to divide the total number of observed spike-ins by that number to obtain capture efficiencies which should range between 0 and 1.

Thank you very much!

Best wishes,
Wenhao

mitochondrial RNA

Dears,

first of all thank you for the nice datasets!

With regard to object 'sce_sc_10x_5cl_qc' loaded by load("data/sincell_with_class.RData") we see that on average about 19% of the transcript come from mitochondrial genes.

Is this normal (expected mt content) and can we trust the quality of the data?

The reason why I ask: mainstream scRNA-seq workflows typically remove cells with e.g. 5% or 10% mt content. Here such filtering would lead to the removal of most of the cells.

Is the 10X data 3' or 5'? Which chemistry was used?

... where one part was then processed by the Chromium 10× single cell platform using the manufacturer’s (10x Genomics) protocol.

But for 10X dataset, I was wondering are they 3' or 5' scRNA-seq? Which chemistry? V2 / V3?

Thank you.

five cancer cell line 10x data barcode+UMI length

I downloaded fastqs files of SRR8606521 and I found reads in SRR8606521_1.fastq.gz all started with N and followed a lot of Ts since 26th bp. I guess 1~25bp may be the location of barcode + UMI but the barcode + UMI length of V2 chemistry is 26 bp. Would you tell me where is the correct location of barcode+UMI in the reads?
Here is an example:
@SRR8606521.1 D00626:354:CCHA5ANXX:8:2209:1336:1975/1
NATGAAAGTACATGTCCACCCTACAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTATTTTTTT

known cell grouping variable

Dear Luyi Tian,

First, I'm impressed by the great work, and good luck for your publication procedure.

I want to use the three single cell RNA-seq datasets (sc_CEL-seq2, sc_10x, and sc_Drop-seq) to evaluate differential expression (DE) analysis tools, such as MAST, scDD, edgeR, etc. Such tools for DE analysis require the cell grouping variable (aka conditions, treatment, group...) apriori. The purpose of DE analysis it to identify the set of genes that show statistically significant differential expressions across the known conditions of interest.

However, after successfully downloading the datasets and their respective cell annotation file, I couldn't find any cell grouping variable (for example the cell line group, sorry if I'm wrong) to be able to apply DE analysis. Perhaps, I have to do this after cell clustering analysis. Right? Therefore, my questions are, is there any known cell grouping variable? or is it possible to know which cell comes from which cell population in the datasets?

Best regards,
Alemu
(Department of Data Analysis and Mathematical Modeling, Ghent University, Belgium)