ChIP-seq-analysis

Resources for ChIP-seq data

ENCODE: Encyclopedia of DNA Elements
ENCODE Factorbook
ChromNet ChIP-seq interactions
paper: Learning the human chromatin network using all ENCODE ChIP-seq datasets
The International Human Epigenome Consortium (IHEC) epigenome data portal
GEO. Sequences are in .sra format, need to use sratools to dump into fastq.
European Nucleotide Archive. Sequences are available in fastq format.
Data bases and software from Sheirly Liu's lab at Harvard
Blueprint epigenome
A collection of tools and papers for nucelosome positioning and TF ChIP-seq

Papers on ChIP-seq

Peak calling

Be careful with the peaks you get:
Active promoters give rise to false positive ‘Phantom Peaks’ in ChIP-seq experiments

It is good to have controls for your ChIP-seq experiments. A DNA input control (no antibody is applied) is prefered. The IgG control is also fine, but because so little DNA is there, you might get many duplicated reads due to PCR artifact.

For cancer cells, an input control can be used to correct for copy-number bias.

A quote from Tao Liu: who develped MACS1/2

I remember in a PloS One paper last year by Elizabeth G. Wilbanks et al., authors pointed out the best way to sort results in MACS is by -10*log10(pvalue) then fold enrichment. I agree with them. You don't have to worry about FDR too much if your input data are far more than ChIP data. MACS1.4 calculates FDR by swapping samples, so if your input signal has some strong bias somewhere in the genome, your FDR result would be bad. Bad FDR may mean something but it's just secondary.

The most popular peak caller by Tao Liu: MACS2. Now --broad flag supports broad peaks calling as well.
TF ChIP-seq peak calling using the Irreproducibility Discovery Rate (IDR) framework and many Software Tools Used to Create the ENCODE Resource
SICER for broad histone modification ChIP-seq
HOMER can also used to call Transcription factor ChIP-seq peaks and histone modification ChIP-seq peaks. Different parameters using the same program can produce drastic different sets of peaks especially for histone modifications with variable enrichment length and gaps between peaks. One needs to make a valid argument for parameters he uses

An example of different parameters for homer findPeaks:

Binding does not infer functionality

A significant proportion of transcription-factor binding sites may be nonfunctional A post from Judge Starling
Several papers have shown that changes of adjacent TF binding poorly correlates with gene expression change: Extensive Divergence of Transcription Factor Binding in Drosophila Embryos with Highly Conserved Gene Expression
Transcription Factors Bind Thousands of Active and Inactive Regions in theDrosophila Blastoderm

The Functional Consequences of Variation in Transcription Factor Binding

" On average, 14.7% of genes bound by a factor were differentially expressed following the knockdown of that factor, suggesting that most interactions between TF and chromatin do not result in measurable changes in gene expression levels of putative target genes. "

paper A large portion of the ChIP-seq signal does not correspond to true binding
BIDCHIPS: Bias-Decomposition of ChIP-seq Signals
mappability, GC-content and chromatin accessibility affect ChIP-seq read counts.

Gene set enrichment analysis for ChIP-seq peaks

Broad Enrich
ChIP Enrich
GREAT predicts functions of cis-regulatory regions.
ENCODE ChIP-seq significance tool. Given a list of genes, co-regulating TFs will be identified.
cscan similar to the ENCODE significance tool.
CompGO: an R package for comparing and visualizing Gene Ontology enrichment differences between DNA binding experiments
interactive and collaborative HTML5 gene list enrichment analysis tool

Chromatin state Segmentation

ChromHMM from Manolis Kellis in MIT.

In ChromHMM the raw reads are assigned to non-overlapping bins of 200 bps and a sample-specific threshold is used to transform the count data to binary values

Segway from Hoffman lab. Base pair resolution. Takes longer time to run.
epicseg published 2015 in genome biology. Similiar speed with ChromHMM.
Spectacle: fast chromatin state annotation using spectral learning. Also published 2015 in genome biology.

Peak annotation

Homer annotatePeak
Bioconductor package ChIPseeker by Guangchuan Yu
See an important post by him on 0 or 1 based coordinates.

Most of the software for ChIP annotation doesn't considered this issue when annotating peak (0-based) to transcript (1-based). To my knowledge, only HOMER consider this issue. After I figure this out, I have updated ChIPseeker (version >= 1.4.3) to fix the issue.

Bioconductor package ChIPpeakAnno. There is a bug with this package, not sure if it is solved or not. Still a post from Guangchuan Yu: Bug of R package ChIPpeakAnno.

I used R package ChIPpeakAnno for annotating peaks, and found that it handle the DNA strand in the wrong way. Maybe the developers were from the computer science but not biology background.

There are many other tools, I just listed three.

Differential peak detection

Look at a post and here describing different tools.

MultiGPS
PePr. It can also call peaks.
histoneHMM
diffreps for histone. developed by Shen Li's lab in Mount Sinai who also develped ngs.plot.
diffbind bioconductor package. Internally uses RNA-seq tools: EdgR or DESeq. Most likely, I will use this tool.
ChIPComp. Very little tutorial. Now it is on bioconductor.
csaw bioconductor package. Tutorial here
chromDiff. Also from from Manolis Kellis in MIT. Similar with ChromHMM, documentation is not that detailed. Will have a try on this.
MACS2 can detect differential peaks as well

Motif enrichment

HOMER. It has really detailed documentation. It can also be used to call peaks.
suggestions for finding motifs from histone modification ChIP-seq data from HOMER page:

Since you are looking at a region, you do not necessarily want to center the peak on the specific position with the highest tag density, which may be at the edge of the region. Besides, in the case of histone modifications at enhancers, the highest signal will usually be found on nucleosomes surrounding the center of the enhancer, which is where the functional sequences and transcription factor binding sites reside. Consider H3K4me marks surrounding distal PU.1 transcription factor peaks. Typically, adding the -center option moves peaks further away from the functional sequence in these scenarios.

MEME suite. It is probably the most popular motif finding tool in the papers.
JASPAR database
pScan-ChIP
MotifMap
RAST Regulatory Sequence Analysis Tools.
ENCODE TF motif database
oPOSSUM is a web-based system for the detection of over-represented conserved transcription factor binding sites and binding site combinations in sets of genes or sequences.
my post how to get a genome-wide motif bed file
Many other tools here
A review of ensemble methods for de novo motif discovery in ChIP-Seq data
melina2. If you only have one sequence and want to know what TFs might bind there, this is a very useful tool.
STEME. A python library for motif analysis. STEME started life as an approximation to the Expectation-Maximisation algorithm for the type of model used in motif finders such as MEME. STEME’s EM approximation runs an order of magnitude more quickly than the MEME implementation for typical parameter settings. STEME has now developed into a fully-fledged motif finder in its own right.
CENTIPEDE: Transcription factor footprinting and binding site prediction. Tutorial
msCentipede: Modeling Heterogeneity across Genomic Sites and Replicates Improves Accuracy in the Inference of Transcription Factor Binding

Super-enhancer identification

The fancy "supper-enhancer" term was first introduced by Richard Young in Whitehead Institute. Basically, super-enhancers are enhancers that span large genomic regions(~12.5kb). The concept of super-enhancer is not new. One of the most famous example is the Locus Control Region (LCR) that controls the globin gene expression, and this has been known for decades.

A review in Nature Genetics What are super-enhancers?

From the HOMER page How finding super enhancers works:

Super enhancer discovery in HOMER emulates the original strategy used by the Young lab. First, peaks are found just like any other ChIP-Seq data set. Then, peaks found within a given distance are 'stitched' together into larger regions (by default this is set at 12.5 kb). The super enhancer signal of each of these regions is then determined by the total normalized number reads minus the number of normalized reads in the input. These regions are then sorted by their score, normalized to the highest score and the number of putative enhancer regions, and then super enhancers are identified as regions past the point where the slope is greater than 1.

Example of a super enhancer plot:

In the plot above, all of the peaks past 0.95 or so would be considered "super enhancers", while the one's below would be "typical" enhancers. If the slope threshold of 1 seems arbitrary to you, well... it is! This part is probably the 'weakest link' in the super enhancer definition. However, the concept is still very useful. Please keep in mind that most enhancers probably fall on a continuum between typical and super enhancer status, so don't bother fighting over the precise number of super enhancers in a given sample and instead look for useful trends in the data.

Using ROSE from Young lab
ROSE: RANK ORDERING OF SUPER-ENHANCERS

Bedgraph, bigwig manipulation tools

WiggleTools
bigwig tool
samtools
bedtools my all-time favorite tool from Araon Quinlan' lab. Great documentation!
Hosting bigWig for UCSC visualization
My first play with GRO-seq data, from sam to bedgraph for visualization
convert bam file to bigwig file and visualize in UCSC genome browser in a Box (GBiB)

Peaks overlapping significance test

The genomic association tester (GAT)
poverlap from Brent Pedersen. Now he is working with Aaron Quinlan at university of Utah.
Genometric Correlation (GenometriCorr): an R package for spatial correlation of genome-wide interval datasets

RNA-seq data integration

Beta from Shirley Liu's lab in Harvard. Tao Liu's previous lab.

Heatmap, mata-plot

Many papers draw meta-plot and heatmap on certain genomic regions (2kb around TSS, genebody etc) using ChIP-seq data.

See an example from the ngs.plot:

Tools

deeptools.It can do many others and have good documentation. It can also generate the heatmaps, but I personally use ngs.plot which is esy to use. (developed in Mount Sinai).
you can also draw heatmaps using R. just count (using either Homer or bedtools) the ChIP-seq reads in each bin and draw with heatmap.2 function. here and here. Those are my pretty old blog posts, I now have a much better idea on how to make those graphs from scratch.
You can also use bioconductor Genomation. It is very versatile.

One cavet is that the meta-plot (on the left) is an average view of ChIP-seq tag enrichment and may not reflect the real biological meaning for individual cases.

See a post from Lior Patcher [How to average genome-wide data](How to average genome-wide data)

I replied the post:

for ChIP-seq, in addition to the average plot, a heatmap that with each region in each row should make it more clear to compare (although not quantitatively). a box-plot (or a histogram) is better in this case . I am really uncomfortable averaging the signal, as a single value (mean) is not a good description of the distribution.

By Meromit Singer:

thanks for the paper ref! Indeed, an additional important issue with averaging is that one could be looking at the aggregation of several (possibly very distinct) clusters. Another thing we should all keep in mind if we choose to make such plots..

A paper from Genome Research Ubiquitous heterogeneity and asymmetry of the chromatin environment at regulatory elements

Enhancer databases

FANTOM project CAGE for promoters and enhancers.
DENdb: database of integrated human enhancers
VISTA enhancer browser

Enhancer target prediction

Allele-specific analysis

SNPs affect on TF binding

RegulomeDB Use RegulomeDB to identify DNA features and regulatory elements in non-coding regions of the human genome by entering dbSNP id, chromosome regions or single Nucleotides.
motifbreakR A Package For Predicting The Disruptiveness Of Single Nucleotide Polymorphisms On Transcription Factor Binding Sites.
Whole Genome Regulatory Variant Evaluation for Transcription Factor Binding

Integration of different data sets

methylPipe and compEpiTools: a suite of R packages for the integrative analysis of epigenomics data

fjrossello / chip-seq-analysis Goto Github PK