Git Product home page Git Product logo

chip-seq_preprocess's Introduction

Pipeline for ChIP-seq preprocessing

Overview

Here is the current pipeline used for ChIP-seq preprocessing, which includes the following steps:

  • align the fastq data to reference genome by bowtie2.
  • run FastQC to check the sequencing quality.
  • remove all reads duplications of the aligned data.
  • generate TDF files for browsing in IGV.
  • run PhantomPeak to check the quality of ChIP.
  • run diffRepeats on multi- and un-mapped reads.
  • run ngs.plot to investigate the enrichment of ChIP-seq data at TSS, TES, and genebody (only implemented in local version, not lsf cluster computing).

The pipeline work flow is:

work flow

Requirement

The softwares used in this pipeline are:

Install above softwares and make sure they are in $PATH.

Installation

Put the scripts in ./bin to a place in $PATH or add ./bin to $PATH.

Usage

Update the config.yaml file to set the configuration required for your project.

Then execute:

python pipeline.py config.yaml

Or on an LSF cluster:

nohup python pipeline.py config.yaml &

After the running of the pipeline, then to summarize the result:

python results_parser.py config.yaml

For the configuration yaml file, project_dir: ~/projects/test_ChIP-seq and data_dir: "data" mean the data folder is ~/projects/test_ChIP-seq/data, and the results will be put in the same folder. Fastq files should be under ~/projects/test_ChIP-seq/data/fastq folder. Now *.fastq, *.fq, *.gz (compressed fastq) files are acceptable. aligner currently only supports using bowtie2.

The location of pipeline.py, results_parser.py, and config.yaml doesn't matter at all. But I prefer to put them under project/script/preprocess folder.

Important:

  • The alignment step includes parsing of results into unique-mapped, multi-mapped, and un-mapped bam files. The unique-mapped results are sent to rmdup, while the multi- and un-mapped results are used to run diffRepeats. Settings to determine unique- and multi-mapped reads are in config.yaml.
  • To make ngs.plot part work, please name the fastq files in this way:
Say condition A, B, each with 2 replicates, and one DNA input per condition. 
Name the files as A_rep1.fastq, A_rep2.fastq, A_input.fastq, B_rep1.fastq, 
B_rep2.fastq, and B_input.fastq.The key point is to make the same condition
 samples with common letters and input samples contain "input" or "Input"
 strings.
  • If use want to only run to some specific step, just modify the function name in pipeline_run in pipeline.py.
  • If the data are pair-end, follow this step:
    • Modify the config.yaml file, change "pair_end" to "yes".
    • Modify the config.yaml file, change "input_files" to "*R1*.fastq.gz" or "*R1*.fastq".
    • Make sure the fastq files named as "*R1*" and "*R2*" pattern.
  • if you want to use cluster:
    • Edit '~/.bash_profile' to make sure all paths in $PATH.
    • Modify config.yaml to fit your demands.
    • multithread in pipeline.py determines the number of concurrent jobs to be submitted to cluster nodes by ruffus. A default value of 10 is used.

Warning:

Bowtie2 allows multiple hits reads, and breaks the assumption of phantomPeak:

It is EXTREMELY important to filter out multi-mapping reads from the BAM/tagAlign
 files. Large number of multimapping reads can severly affect the phantom peak
 coefficient and peak calling results.

So be careful to interpret NSC and RSC in Bowtie2 alignment results.

Notes

In Bowtie2, default parameters are used.

chip-seq_preprocess's People

Contributors

eddielohyh avatar lishen avatar ny-shao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

chip-seq_preprocess's Issues

The meaning of multithread?

What does this option control? Does it determine the number of concurrent jobs to be submitted to cluster nodes? If that is true, why not simply set it to a large number, such as 100? A well configured HPC should be able to handle a large number of submissions easily.

"sort" in rmdup?

It is totally unnecessary to sort bam files again after rmdup. The rmdup program basically relies on the fact that a input bam file is already sorted. It removes duplicates in a stream and then outputs alignments keeping their original order. Since sorting large bam files costs significant amount of resource, this kind of process should be eliminated.

chip-seq_preprocess not compatible with new samtools

Hi- I've been using your awesome chip-seq_preprocess pipeline for several years. I just tried to run it on a computer on which I'm using the newest version of samtools (version 1.9) and am getting errors that I've traced to updates put into the new samtools. I've fixed one easier problem in fast2bam_by_bowtie2.sh. Here line 76 needs to have -o added before the output file name because the sort command on the new samtools requires this. With that change I can now align all my fastq files and run the subsequent steps (such as fastqc) until the rmdup step.
This issue has me a little stuck. rmdup.bam.sh uses rmdup in line 7. However, this command no longer exists in samtools and has been replaced by markdup. markdup requires a few initial steps that I'm not quite sure how to incorporate here since I'm not quite sure of how to name the output files appropriately so as to not mess up the subsequent steps of the pipeline. I'm also not totally clear on how the input file is sorted at this stage in the pipeline and that is important for markdup. I think it would just require an additional few lines of code added to rmdup.bam.sh to replace the rmdup command but I'm having trouble figuring out how to do this. Is this something you could help with? Here is the link to the new samtools manual and below is the relevant info on markdup.
I'd love to be able to use the newer version of samtools if possible so any suggestions would be very much appreciated! Thank you!

http://www.htslib.org/doc/samtools.html

markdup
samtools markdup [-l length] [-r] [-s] [-T] [-S] in.algsort.bam out.bam

Mark duplicate alignments from a coordinate sorted file that has been run through fixmate with the -m option. This program relies on the MC and ms tags that fixmate provides.

-l INT
Expected maximum read length of INT bases. [300]

-r
Remove duplicate reads.

-s
Print some basic stats.

-T PREFIX
Write temporary files to PREFIX.samtools.nnnn.mmmm.tmp

-S
Mark supplementary reads of duplicates as duplicates.

EXAMPLE

The first sort can be omitted if the file is already name ordered

samtools sort -n -o namesort.bam example.bam

Add ms and MC tags for markdup to use later

samtools fixmate -m namesort.bam fixmate.bam

Markdup needs position order

samtools sort -o positionsort.bam fixmate.bam

Finally mark duplicates

samtools markdup positionsort.bam markdup.bam

Report number of multi-reads

It's better to report the number of reads that are aligned to multiple locations (reads with alignments suppressed due to -m) as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.