Git Product home page Git Product logo

ataqc's Introduction

ATAqC

This pipeline is designed for collecting advanced quality control metrics for ATAC-seq datasets.

===================================================

Annotation sets:

  • hg19 - Gencode v19
  • hg38 - Gencode v24
  • mm9 - vM1 (though I believe the ENCODE portal no longer supports mm9)
  • mm10 - vM7

The TSS bed files are generated directly from the Gencode full GTF files, with the following command:

zcat $GTF | 
grep -P '\tgene\t' | 
grep 'protein_coding' | 
grep -v 'level 3' | 
awk -F '[\t|\"]' '{ print $1"\t"$4"\t"$5"\t"$10"\t0\t"$7 }' | 
awk -F '\t' 'BEGIN{ OFS="\t" } { if ($6=="+") { $3=$2-1; $2=$2-2 } else { $2=$3; $3=$3+1 } print }' |
sort -k1,1 -k2,2n > $TSS

*Note that the TSS file is a point file, and is not the same as the promoter file (described below).

===================================================

The promoter/enhancer annotations are a little trickier (and likely should be updated, given that the annotations are based off the data that was in the ENCODE portal as of 03/27/2016):

These annotations should be viewed as preliminary and approximate, not as part of the ENCODE encyclopedia. Good for QC, but for deeper analysis please do consider carefully the process by which we got these annotations.

  • hg19 - we made use of the high stringency (-log_pval > 10) Reg2Map promoter and enhancer sets: https://personal.broadinstitute.org/meuleman/reg2map/HoneyBadger2_release/

  • hg38 - the promoter set is the union of ENCODE RAMPAGE peaks, the enhancer set is the remainder of the union of ENCODE open chromatin (ie DNase) peak sets after removing blacklist and promoter regions. Ie, any other site that was accessible in some ENCODE DNase experiment that was not labeled as a promoter by RAMPAGE data. There is no Reg2Map resource for hg38.

  • mm9 - after getting the union of all mouse mm9 DNase peaks available in the ENCODE portal, the promoter set is those peaks that overlap the TSS file, and the enhancer set is the rest.

  • mm10 - the promoters are the predicted promoters from the ENCODE portal (https://www.encodeproject.org/data/annotations/v3/). after getting the union of all mouse mm10 DNase peaks available in the ENCODE portal, the enhancer set is the remainder after subtracting the promoter and blacklist, since one exists for mm10.

Whenever a blacklist is mentioned, it's the recorded file from the ENCODE portal.

===================================================

Known issues

The pipeline is not currently compatible with samtools/1.3 - we are working on this incompatibility

Contributors

  • Daniel Kim - MD/PhD Student, Biomedical Informatics Program, Stanford University
  • Chuan Sheng Foo - PhD Student, Computer Science Dept., Stanford University
  • Anshul Kundaje - Assistant Professor, Dept. of Genetics, Stanford University

ataqc's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ataqc's Issues

ataqc on mm10

Is the QC already working with mm10? In the run_ataqc.sh script I saw the comment that so far only hg19 is working. When I try to run the pipeline with the QC turned on I always get the Error " Genome file %s is not existing".
Thank you for the support.

Is it necessary to set shift_width when bam.array ?

Hi
I noticed in line 430. when get the count of bam, you shift_width to center the read on the cut site

bam_array = bam.array(tss_ext, bins=bins, shift_width = -read_len/2, processes=processes, stranded=True)

But when I check the paramter in metaseq document, it said

shift_width : int
Each item from the genomic signal (e.g., reads from a BAM file) will be shifted shift_width bp in the 3’ direction. This can be useful for reconstructing a ChIP-seq profile, using the shift width determined from the peak-caller (e.g., modeled d in MACS). Not available for bigWig.

In my opinion, the shift_width is based on the single_end and short read length. But now, almost all sequencing is PE 150. So is it necessary to set the paramter?

By the way, I found whether set the paramter will make a great influence on the result. If I set the paramter, the TSS enrichment will be 6 while drop the paramter, my enrichment will be 4.
And the 4 is same as my own R script calculated TSS enrichment.

Incompatibility with samtools version 1.3

If I'm not mistaken, the scritpt run_ataqc.py is broken due to a change of API in samtools sort, notably lines 522 and 524.

See samtools 1.3 release:

The obsolete samtools sort in.bam out.prefix usage has been removed. If you are still using ‑f, ‑o, or out.prefix, convert to use -T PREFIX and/or -o FILE instead.

Could be nice to document this version constraint or would you be accepting a patch to update the script?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.