Git Product home page Git Product logo

chipseq's Introduction

ChIPSeQ


Here we describe the different steps involved in the analysis of ChipSEQ in the laboratory.

This tutorial goal is to make people understand how the raw data has been processed in the team (tools and workflow) and to help them to reproduce analysis.


  1. Set up Environment
  2. Set up directories structure and config file
  3. Launch Main pipeline
  4. Multiqc (marking duplicates and plot number of mapped reads)
  5. ChipQC
  6. DiffBinding and Peak annotation
  7. ChipSeq & Splicing

Set up Environment


First you need to check that the following tools are installed on server/computer.

Deeptools : Tool to handle deepsequencing files here

Macs2 : Peak Caller here

wigToBigWig : Include in KentTools suite here

It's advised to install Conda.
It will make things easier then for installing tools. (snakemake for example)

Conda : here

snakemake : Evil framework for pipeline in python here

WiggleTools : Optionnal , can be useful when you want to merge bigwig files and apply some basic numerical operations here

SPP : Need to be there but you don't care here

MaSC : Need to be there but you don't care here

R3.3.1 : A lot of packages are used ... It's adviced to read their docs.

ChipQC : Control Quality for peaks. Object created will be used then with ChIPQC for Differential binding here

DiffBind : Differential binding analysis. Which peaks are moving in my samples ? here

ChIPpeakAnno : Anotation of the peaks in a genomic context here

ChIPseeker : Used for a derived upset peaks graph counting how much peaks are annotated in one or other features. here

rtracklayer : Very usefull to export/import bed to Grange objects. here

ComplexHeatmap : Plot heatmap. here

Other packages are mandatory. biomaRt, GenomicFeatures ,stringr, reshape , ggplot2, optparse , knitr ... Look inside R scripts to know what to install.

Set up directories structure and config file


Analyse is based on snakemake pipeline provided by Raoul Raffel @t IGH,Montpellier.

Directory aborescence need to be strictly the same.

One directory per condition. Several histone marks are inside these directories.

Inside this directory, you will set up directories for input raw data and for each replicate of histone marks as followed : ChipSeqDir

As you can see in the picture, you also have in this directory the script and its config file.

In directories for the replicates only (not input), you need to place your raw data in raw_data directory. Raw data must be formated as follows file.fq.gz.

NB1 : If you received different files for one replicate, you need to merge them into one, with a command line like :

	cat file1.gz file2.gz > Luco22.fq.gz

Then, you need to configure your yaml configuration file. See provided example.

You need to change filepaths,filenames accordingly to your data & environment.

NB2 : In script, there is hard coded path, you need to manually change in code "--tempdir /home/jean-philippe.villemin/tmp_deeptools/ ". Create a dir wherever you want and change in the code. This is used by deeptools to write temporary files...

NB3 : You need to download Rscripts in dependencies dir from github. Set up paths in configfile. Same has to be done for the bedfile with exons used to do heatmaps and other QC stuffs. Exons were randomly selected. You can provide your own list.

Launch Main Pipeline

The main pipeline do all alignments for fastq files. It produces also fastqc, wig files for visualisation and finally it will gives us the peaks calling files. Other stuffs are done but are less important to continue the tutorial.

In the directory of you condition of interest you must then run : ( Use first --dryrun, -n to see if there is no error in output)

Dry Run :

	snakemake -s pipeline_chip-seq.Snakefile.py -j 16 -k -n --verbose -p

Real Run (nohup make the script runs in background.Note the ampersand at the end.):

	nohup snakemake -s pipeline_chip-seq.Snakefile.py -j 16 -k --verbose -p &

Control with htop -u username and look into nohup.out file to see if job is finished.

NB4 : You can relaunch script if you find something went wrong.It will execute only part that failed. Fastqc will be done again (whereas fastqc analysis has already be done with sucess) because of a bug I didn't manage to correct.

At the end of this part, you should have bam, bigwig and file containing peaks calling.


Multiqc

Using all fastqc, you can produce a multiqc wich is a resume of all fastqc produced. Before doint thant you can do a few things to get informations about duplicates.
You can mark duplicates inside all the bams generated using picard tools.
You can retrieve interesting values that let you plot bargraph with the number of reads mapped, number of reads withtout replicates, etc...see graph below.

#MarkDuplicate
find /pathTo/condition -name *.sorted.bam  | xargs -I{} bash -c 'java -jar /home/jean-philippe.villemin/bin/picard.2-6/picard.2-6.jar MarkDuplicates I=$0 O=${0/.sorted.bam/.sorted.marked.bam} M=${0/.bam/.marked_dup_metrics.txt} VALIDATION_STRINGENCY=LENIENT' {} \;

# Flagstat All
find /pathTo/condition -name *.sorted.marked.bam   | xargs -I{} bash -c 'samtools flagstat $0 > ${0/.sorted.marked.bam/.sorted.marked.bash2.txt}' {} \;

# Flagstat concat in one
find /pathTo/condition -name *.sorted.marked.bash2.txt -print -exec cat {} \; > /pathTo/condition/all.flagstat.bash2.txt

# REMOVE DUPLICATES
find /pathTo/condition -name *.sorted.bam   | xargs -I{} bash -c 'java -jar /home/jean-philippe.villemin/bin/picard.2-6/picard.2-6.jar MarkDuplicates I=$0 O=${0/.sorted.bam/.sorted.removed.marked.bam} M=${0/.bam/.sorted.removed.marked_dup_metrics.txt} REMOVE_DUPLICATES=TRUE VALIDATION_STRINGENCY=LENIENT' {} \;

# FLAGSTAT DUPS
find /pathTo/condition -name *.sorted.removed.marked.bam   | xargs -I{} bash -c 'samtools flagstat $0 > ${0/.bam/.sorted.removed.marked.flagstat.bash2.txt}' {}

# Concat in one all flagstat with name of file at the beginning of the block writed
find /pathTo/condition -name *.sorted.removed.marked.flagstat.bash2.txt -print -exec cat {} \; > /pathTo/condition/all.sorted.removed.marked.flagstat.bash2.txt

# Move all fastqc.zip and html for use with multiqc
find pathTo -name *.html | xargs -I{} bash -c 'mv $0 pathTo/FASTQC/$(basename $0)' {} \;
find pathTo -name *.zip | xargs -I{} bash -c 'mv $0 pathTo/FASTQC/$(basename $0)' {} \;
cd pathTo/FASTQC/
multiqc .

Quality

To plot this graph, extract interesting values from flagstat ouputs and create the following matrice and save in csv file :

Sample TOTAL DUP MAP TOTAL_WITHOUT_DUP MAP_WITHOUT_DUP
INPUT_UNT 77850539 12625556 75224143 65224983 62598587
UNT_2 67199665 49586720 64413504 17612945 14826784
T1_1 85399142 35307778 83406602 50091364 48098824

Then run :

Rscript multibarplot.R --file=PathToTheMatriceFile**

You can also use this script to plot Number of Raw ChipSeq Peaks, Number of Differentially Bound Histones. (Modify the input file accordingly)

ChipQC

First you need to create a samplesheet containing all path to your samples. see( ChipQC ,DiffBind,Doc )

Quality

You will use this file to create an R object saved on disk that will be used by the next step. This step is creating a first report with quality controls.

You can use consensus methodology where ChIPQC is done using consensus length around peak summit. With this approach, you will also have input signal plotted for interval you defined around peak summits with option -s .
Option -c is used to say you are using consenus methodology.
Option -n is used to say what is the name of the Rboject saved.

Rscript chipQC.R --file /pathTO/samplesChipMarks.csv -n ALL_MarkedDup -s 500 -c

Quality

DiffBinding and Peak annotation

Here you will test a variation of binding between condition using R object output (name ALL_MarkedDup in this example) from the step before.
Option -n is used to say the number of conditions.
Option -m is used to say which marks from your sample you want to analyze. Option -p is used to say percentage at least used to retrieve peaks in all you samples. minOverlap (peakset parameter in diffbind) only include peaks in at least this many peaksets in the main binding matrix.If minOverlap is between zero and one, peak will be included from at least this proportion of peaksets.
You will do that for each mark you want to analyse in your sampleshit.

Rscript chipDiffBind.R --file=/pathTO/ALL_MarkedDup -n 3 -m K27AC -p 0.99

NB : Don't use the .Rdata extension

Description of output : TODO

ChipSeq & Splicing

You have to use merge.splicing.fromRMATS.sh on each events to create a cleaned & merged list for each of splicing events.

merge.splicing.fromRMATS.sh SE
merge.splicing.fromRMATS.sh RI
merge.splicing.fromRMATS.sh MXE
merge.splicing.fromRMATS.sh A5SS 
merge.splicing.fromRMATS.sh A3SS 

Then you can apply on these cleaned bed list, the following script , it will count how many differents events you have per condition.

Rscript splicing.R --file1=RMATS/SE/EARLY.bed --file2=RMATS/MXE/EARLY.bed --file3=RMATS/RI/EARLY.bed --file4=RMATS/A5SS/EARLY.bed --file5=RMATS/A3SS/EARLY.bed

​Rscript splicing.R --file1=RMATS/SE/LATE.bed --file2=RMATS/MXE/LATE.bed --file3=RMATS/RI/LATE.bed --file4=RMATS/A5SS/LATE.bed --file5=RMATS/A3SS/LATE.bed​

Finally , you intersect peak from chipSeq with splicing Event. You intersect histone differentially bound in Early condition with splicing events detected in Early.( respectively histone bound differentially bound in LATE with splicing events detected in LATE) :

bedtools intersect -wo -a CHIPSEQ_2017_1_ALL/ChIPQC/${MARK}/dba_EVERYTHING_MarkedDupConsensus500.${MARK}.${TIMEPOINT}.bed -b workspace/beds/exon/met/${EVENT}/*${STATUS}.bed -filenames > data/workspace/beds/results/clean/${MARK}.${TIMEPOINT}.hg38.vs.ExonEMT.${EVENT}.${STATUS}.bed

where following parameters can be :

${STATUS}=LATE,EARLY
${MARK}=Name of your mark (K27ME3,K4ME3...)
${TIMEPOINT}=T1_vs_UnT7,T7_vs_UnT7
${EVENT}=SE,RI,A3SS,A5SS

Once you have done this for all the events , condition and marks. You can merge them into one file keeping on each line the name of the file from where they are extracted. Indeed, you will easily retrieve condition, event and name of you mark.

find . -type f | xargs -i echo {}|sed -r 's#(.\/)(.*)#cat &\|sed  "s:^: \2\t:g"#ge' > ../genes.tsv

You can count how many elements you have by condition and type of splicing events.

Rscript countOverlap --file=BigOverlapFile.csv

You can reformat the output and then plot histogram using multibarplot.R ( Need to modify hardcoded values because I used this script several times with a bunch of differents values).
Input file should look like this:

Sample A3SS A5SS MXE RI SE
K27AC 77 1256 75224143 683 627
K4ME1 675 4958 644 1765 1484
K79ME2 8532 38 834 5004 4804

You can separate again the file in several files per histone mark.

find . -type f -name "*K36ME3*" | xargs -i echo {}|sed -r 's#(.\/)(.*)#cat &\|sed  "s:^: \2\t:g"#ge' > ../genes_K36ME3.tsv
find . -type f -name "*K4ME3*" | xargs -i echo {}|sed -r 's#(.\/)(.*)#cat &\|sed  "s:^: \2\t:g"#ge' > ../genes_K4ME3.tsv
find . -type f -name "*K4ME1*" | xargs -i echo {}|sed -r 's#(.\/)(.*)#cat &\|sed  "s:^: \2\t:g"#ge' > ../genes_K4ME1.tsv
find . -type f -name "*K20ME1*" | xargs -i echo {}|sed -r 's#(.\/)(.*)#cat &\|sed  "s:^: \2\t:g"#ge' > ../genes_K20ME1.tsv
find . -type f -name "*K79ME2*" | xargs -i echo {}|sed -r 's#(.\/)(.*)#cat &\|sed  "s:^: \2\t:g"#ge' > ../genes_K79ME2.tsv
find . -type f -name "*K27AC*" | xargs -i echo {}|sed -r 's#(.\/)(.*)#cat &\|sed  "s:^: \2\t:g"#ge' > ../genes_K27AC.tsv
find . -type f -name "*K27ME3*" | xargs -i echo {}|sed -r 's#(.\/)(.*)#cat &\|sed  "s:^: \2\t:g"#ge' > ../genes_K27ME3.tsv

This is useful if you want to plot a scatterplot per histone mark between dpsi and peakfold as follows :

Rscript scatter.R --file=input.csv

Quality

chipseq's People

Contributors

lucolab avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.