Git Product home page Git Product logo

scatac-benchmarking's Introduction

scATAC-benchmarking

Recent innovations in single-cell Assay for Transposase Accessible Chromatin using sequencing (scATAC-seq) enable profiling of the epigenetic landscape of thousands of individual cells. scATAC-seq data analysis presents unique methodological challenges. scATAC-seq experiments sample DNA, which, due to low copy numbers (diploid in humans) lead to inherent data sparsity (1-10% of peaks detected per cell) compared to transcriptomic (scRNA-seq) data (10-45% of expressed genes detected per cell). Such challenges in data generation emphasize the need for informative features to assess cell heterogeneity at the chromatin level.

We present a benchmarking framework that was applied to 10 computational methods for scATAC-seq on 13 synthetic and real datasets from different assays, profiling cell types from diverse tissues and organisms. Methods for processing and featurizing scATAC-seq data were evaluated by their ability to discriminate cell types when combined with common unsupervised clustering approaches. We rank evaluated methods and discuss computational challenges associated with scATAC-seq analysis including inherently sparse data, determination of features, peak calling, the effects of sequencing coverage and noise, and clustering performance. Running times and memory requirements are also discussed.

Single Cell ATAC-seq Benchmarking Framework

Our benchmarking results highlight SnapATAC, cisTopic, and Cusanovich2018 as the top performing scATAC-seq data analysis methods to perform clustering across all datasets and different metrics. Methods that preserve information at the peak-level (cisTopic, Cusanovich2018, Scasat) or bin-level (SnapATAC) generally outperform those that summarize accessible chromatin regions at the motif/k-mer level (chromVAR, BROCKMAN, SCRAT) or over the gene-body (Cicero, Gene Scoring). In addition, methods that implement a dimensionality reduction step (BROCKMAN, cisTopic, Cusanovich2018, Scasat, SnapATAC) generally show advantages over the other methods without this important step. SnapATAC is the most scalable method; it was the only method capable of processing more than 80,000 cells. Cusanovich2018 is the method that best balances analysis performance and running time.

All the analyses performed are illustrated in Jupyter Notebooks.

Within each dataset folder, the folder 'output' stores all the output files and it consists of five sub-folders including 'feature_matrices', 'umap_rds', 'clusters', 'metrics', and 'figures'.

Real Data

Synthetic Data

Extra


Citation: Please cite our paper if you find this benchmarking work is helpful to your research. Huidong Chen, Caleb Lareau, Tommaso Andreani, Michael E. Vinyard, Sara P. Garcia, Kendell Clement, Miguel A. Andrade-Navarro, Jason D. Buenrostro & Luca Pinello. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biology 20, 241 (2019).

Credits: H Chen, C Lareau, T Andreani, ME Vinyard, SP Garcia, K Clement, MA Andrade-Navarro, JD Buenrostro, L Pinello

scatac-benchmarking's People

Contributors

caleblareau avatar huidongchen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scatac-benchmarking's Issues

Missing erythropoiesis_data for simulation?

Hello,
I'm trying to run simulate_erythropoesis.ipynb, however, it needs:

bulk <- data.matrix(data.frame(fread("./erythropoiesis_data/combined_12.counts.tsv")))
peaks <- fread("./erythropoiesis_data/ery_only.bed")

Where can I find erythropoiesis_data?

Thank you,
Gonzalo

Center the TF-IDF before irlba?

Hi,

Maybe, I missed this but I am just confused. Why don't you center the TF-IDF matrix before irlba function? This function doesn't do centering by default, and as far as I know, you need to center the data before extracting principal components.

Buenrostro2018 raw data paired end?

Hello!

Maybe this isn't the right place, since you're just hosting the data, but I tried downloading the Buenrostro 2018 raw data from dropbox and each of the bed files (after converting the bams using bedtools bamtobed seems to have a strand. Am I right in assuming that the strand information is not right, and each read is paired-end? Or is each read single-end and needs to be matched and extended?

Thanks!

A problem about cisTopic

Hello,
When I use this pipeline, I met a problem, the result of I click cisTopic entering the pipeline code of Cicero.

BoneMarrow_noisy_p2 data missing

Hi, thanks for the great benchmark resources. I tried to download the simulated data, and the bonemarrow noisy p2 data might be missing.

1633761356(1)

Cannot implement function 'runJaccard()' in SnapATAC

When I tried to reproduce the results using the code for the method SnapATAC, the code returned some errors. More specifically, the error occurred when the function runJaccard() was implemented in this Jupyter Notebook file. Actually, according to the help page in the R package SnapATAC, the function runJaccard() only permits 6 argument inputs, while in the Jupyter Notebook the number of arguments is 9.

Furthermore, when I removed the unused arguments in runJaccard(), I got another error:

Error in (function (cl, name, valueClass) :
‘method’ is not a slot in class “jaccard”
Calls: runJaccard -> runJaccard.default ->
Execution halted

Thank you for your help if you could fix this bug!

output folders

Hi,

Thank you for creating this evaluation framework and accompanying repository.
The README mentions output folders for all real datasets, but these are not present.
Is it possible to add these, or, like with the input folder, a link to a Dropbox folder?

Thanks,
Koen

Additional information to run the metrics for 10xpbmc5k

Hi,

I am trying to reproduce your benchmarking to compare epiScanpy performance to the other methods you already tested.
So far it has been working very well, however I am missing one file to compare the gene score metric in the 10pbmc5k dataset. In your notebook it is signal as: ./run_methods/GeneScoring/FM_GeneScoring_10xpbmc5k.tsv
Maybe it is somewhere obivous, but I haven't found it. Would it be possible to make this file accessible?

Best,
Anna

docker ?

Hi,
I am interested in reproducing your benchmarking but there is a lot of methods to implement. Do you have a docker environment with the different methods implemented by any chance?

Best,
Anna

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.