kundajelab / chipseq_pipeline Goto Github PK

AQUAS TF and histone ChIP-seq pipeline

License: BSD 3-Clause "New" or "Revised" License

Shell 5.17% CSS 0.17% JavaScript 4.23% HTML 0.29% Python 17.18% Java 70.98% Perl 1.21% AngelScript 0.78%

chipseq_pipeline's Introduction

This pipeline has been deprecated as of June 2018. Please update your pipelines to the official WDL-based ENCODE DCC pipeline at https://github.com/ENCODE-DCC/chip-seq-pipeline2 (June 2018)

June 2018: Note that the updated official ENCODE DCC pipeline is an exact replica of the pipeline in this repository except that it uses WDL instead of BigDataScript for workflow management. We recommend transitioning to the WDL version since it easier to install. Also all future updates and bug fixes will be made to the WDL-based pipeline. If you have processed datasets using the pipeline in this repository, you do NOT need to rerun anything. For future runs, we recommend switching to the WDL-based pipeline.

AQUAS Transcription Factor and Histone ChIP-Seq processing pipeline

Introduction

The AQUAS pipeline implements the ENCODE (phase-3) transcription factor and histone ChIP-seq pipeline specifications (by Anshul Kundaje) in this google doc.

NOTE: We recommend using the WDL-based implementation of this pipeline here as it uses a more stable and maintained workflow management system.

Installation

Install software/database in a correct order according to your system. For example on Kundaje lab's clusters, you only need to install one software Pipeline.

General computer
Kundaje lab's clusters
- AQUAS Pipeline
Stanford SCG cluster
Stanford Sherlock cluster

Java

Install Java 8 (jdk >= 1.8 or jre >= 1.8) on your system. If you don't have super-user privileges on your system, locally install it and add it to your $PATH.

For Debian/Ubuntu (>14.10) based Linux:

$ sudo apt-get install git openjdk-8-jre

For Fedora/Red-Hat based Linux:

$ sudo yum install git java-1.8.0-openjdk

For Ubuntu 14.04 (trusty):

$ sudo add-apt-repository ppa:webupd8team/java -y
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer

Conda

REMOVE ANY ANACONDA OR OTHER VERSIONS OF CONDA FROM YOUR BASH STARTUP SCRIPT. WE CANNOT GUARANTEE THAT PIPELINE WORKS WITH OTHER VERSIONS OF CONDA. ALSO REMOVE R AND OTHER CONFLICTING MODULES FROM IT TOO. Remove any other Anaconda from your $PATH. Check your loaded modules with $ module list and unload any Anaconda modules in your bash startup scripts ($HOME/.bashrc or $HOME/.bash_profile). Add unset PYTHONPATH to your bash start up scripts.

Install Miniconda3 latest on your system. IMPORTANT Make sure that the absolute path of the destination directory is short. Long path will cause an error in the depenecies installation step issue #8.

$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh

Answer yes for the final question. If you choose no, you need to manually add Miniconda3 to your $HOME/.bashrc.

Do you wish the installer to prepend the Miniconda3 install location
to PATH in your /your/home/.bashrc ? [yes|no]
[no] >>> yes

Open a new terminal after installation.

BigDataScript

Install BigDataScript v0.99999e (forked) on your system. The original BDS v0.99999e does not work correctly with the pipeline (see PR #142 and issue #131).

$ wget https://github.com/leepc12/BigDataScript/blob/master/distro/bds_Linux.tgz?raw=true -O bds_Linux.tgz
$ mv bds_Linux.tgz $HOME && cd $HOME && tar zxvf bds_Linux.tgz

Add export PATH=$PATH:$HOME/.bds to your $HOME/.bashrc. If Java memory occurs, add export _JAVA_OPTIONS="-Xms256M -Xmx728M -XX:ParallelGCThreads=1" too.

AQUAS Pipeline

Get the latest version of the pipeline.

$ git clone https://github.com/kundajelab/TF_chipseq_pipeline

Dependencies

Install software dependencies automatically. It will create two conda environments (aquas_chipseq and aquas_chipseq_py3) under your conda.

$ bash install_dependencies.sh

If you don't use install_dependencies.sh, manually replace BDS's default bds.config with a correct one:

$ cp bds.config ./utils/bds_scr $HOME/.bds

If install_dependencies.sh fails, run ./uninstall_dependencies.sh, fix problems and then try bash install_dependencies.sh again.

Genome data

Install genome data for a specific genome [GENOME]. Currently hg19, mm9, hg38 and mm10 are available. Specify a directory [DATA_DIR] to download genome data. A species file generated on [DATA_DIR] will be automatically added to your ./default.env so that the pipeline knows that you have installed genome data using install_genome_data.sh. If you want to install multiple genomes make sure that you use the same directory [DATA_DIR] for them. Each genome data will be installed on [DATA_DIR]/[GENOME]. If you use other BDS pipelines, it is recommended to use the same directory [DATA_DIR] to save disk space.

IMPORTANT: install_genome_data.sh can take longer than an hour for downloading data and building index. DO NOT run the script on a login node, use qlogin for SGE and srun --pty bash for SLURM.

# DO NOT run this on a login node
$ bash install_genome_data.sh [GENOME] [DATA_DIR]

If you have super-user privileges on your system, it is recommended to install genome data on /your/data/bds_pipeline_genome_data and share them with others.

$ sudo su
$ bash install_genome_data.sh [GENOME] /your/data/bds_pipeline_genome_data

You can find a species file [SPECIES_FILE] on /your/data/bds_pipeline_genome_data for each pipeline type. Then others can use the genome data by adding -species_file [SPECIES_FILE_PATH] to the pipeline command line. Or they need to add species_file = [SPECIES_FILE_PATH] to the section [default] in their ./default.env.

Installation for internet-free computers

AQUAS pipeline does not need internet connection but installers (install_dependencies.sh and install_genome_data.sh) do need it. So the workaround should be that first install dependencies and genome data on a computer that is connected to the internet and then move Conda and genome database directories to your internet-free one. Both computers should have THE SAME LINUX VERSION.

On your computer that has an internet access,
- Follow the installation instruction for general computers
- Move your Miniconda3 directory to $HOME/miniconda3 on your internet-free computer.
- Move your genome database directory, which has aquas_chipseq_species.conf and directories per species, to $HOME/genome_data on your internet-free computer. $HOME/genome_data on your internet-free computer should have aquas_chipseq_species.conf.
- Move your BDS directory $HOME/.bds to $HOME/.bds on your internet-free computer.
- Move your pipeline directory chipseq_pipelines/ to $HOME/chipseq_pipelines/ on your internet-free computer.
On your internet-free computer,
- Add your miniconda3/bin and BDS binary to $PATH in your bash initialization script ($HOME/.bashrc or $HOME/.bash_profile).
```
export PATH="$PATH:$HOME/miniconda3/bin"
export PATH="$PATH:$HOME/.bds"
```
- Modify [default] section in $HOME/chipseq_pipelines/default.env.
```
[default]
conda_bin_dir=$HOME/miniconda3/bin
species_file=$HOME/genome_data/aquas_chipseq_species.conf
```
Modify all paths in $HOME/genome_data/aquas_chipseq_species.conf so that they correctly point to the right files.

Check BDS version.

$ bds -version
Bds 0.99999e (build 2016-08-26 06:34), by Pablo Cingolani

Make sure that your java rumtime version is >= 1.8.

$ java -version
java version "1.8.0_111"
Java(TM) SE Runtime Environment (build 1.8.0_111-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.111-b14, mixed mode)

Usage

We recommend using BASH to run pipelines.

Command line arguments / configuration JSON file

IMPORTANT! The latest version of the pipeline includes a Python wrapper chipseq.py to parse command line arguments and JSON configuration file. python chipseq.py takes the same command line arguments as in the original bds chipseq.bds. However, chipseq.py takes in JSON configuration file instead of the original INI one.

There are two ways to define parameters for ChIP-Seq pipelines. Default values for most of those parameters are already given. Take a look at example commands and configuration files in examples. AQUAS pipeline is multi-threaded. Set up maximum number of processors with -nth.

Parameters from command line arguments: Any of three positional arguments can be skipped. Then default values will be used for skipped ones. Choose [CHIPSEQ_TYPE] between TF (default) and histone. You can also specify it with -type [CHIPSEQ_TYPE]. You can stop the pipeline at the end of any stage. Choose [FINAL_STAGE] among bam, filt_bam, tag, xcor, peak and idr (default). You can also specify it with -final_stage [FINAL_STAGE]. See Pipeline steps for details about pipeline stages.
```
$ python chipseq.py [CHIPSEQ_TYPE] [FINAL_STAGE] [CONF_JSON_FILE] -species [SPECIES] -nth [NUM_THREADS] -fastq1 ... -fastq2 ... -ctl_fastq1 ...
```
Parameters from a configuration JSON file: Note that both command line arguments and a configruation JSON share the same key name. In a configuration JSON, only the deepest keys and values are taken. Any JSON structure/hierachy to group those keys is allowed. See full example JSON and reduced example JSON how to group keys.
```
$ python chipseq.py [CONF_JSON_FILE]

# CONF_JSON_FILE (can work without group hierachy)
{
  "type" : "[CHIPSEQ_TYPE]",
  "final_stage" : "[FINAL_STAGE]"
  "fastq1" : "...",
  "fastq2" : "...",
  "ctl_fastq1" : "...",
  ...
  "species" : "hg19",
  "nth" : [NUM_THREADS]
}
```
You can mix up method 1 and 2. Parameters specied in command line arguments will override the other.

To list all parameters: $ python chipseq.py -h

Stopping / Resuming pipeline

Press Ctrl + C on a terminal or send any kind of kill signals to it. Please note that this will delete all intermediate files and incomplete outputs for the running tasks. AQUAS pipeline automatically determines if each task has finished or not (by comparing timestamps of input/output files for each task). To run the pipeline from the point of failure, correct error first and then just run the pipeline with the same command that you started the pipeline with. There is no additional parameter for restarting the pipeline.

Running pipelines with a cluster engine

On servers with a cluster engine (such as Sun Grid Engine and SLURM), DO NOT QSUB/SBATCH BDS COMMAND LINE. Run BDS command directly on login nodes. BDS is a task manager and it will automatically submit(qsub/sbatch) and manage its sub tasks. You can choose [CLUSTER_ENGINE] between sge (default on Kundaje), slurm (default on Sherlock and SCG) and local (default for others). You can also let BDS submit its subtasks to a specific queue/partition [QUEUE_NAME] on Sun Grid Engine or SLURM.

$ python chipseq.py -system sge -q [SGE_QUEUE_NAME] ...
$ python chipseq.py -system slurm -q [SLURM_PARTITON_NAME] ... # Sherlock example 
$ python chipseq.py -system slurm -q_for_slurm_account -q [SLURM_ACCOUNT_NAME] ... # SCG example

IMPORTANT! Please read this section carefully if you run pipelines on Stanford SCG and Sherlock cluster.

Most clusters have a policy to limit number of threads and memory per user on a login node. One BDS process, as a Java-based task manager, takes up to 1GB of memory and 50 threads even though it just submits/monitors subtasks. So if you want to run more than 50 pipelines in parallel, your cluster will kill BDS processes due to resource limit on a login node (check resource limit per user with ulimit -a). For example of 50 pipelines, 50 GB of memory and 2500 threads will be taken by 50 BDS processes. So the Workaround for this is to make an interactive node to keep all BDS processes alive. Such interactive node must have long walltime enough to wait for all pipelines in it to finish. Recommended resource setting is 1.0GB memory per pipeline.

SGE example to make an interactive node for 100 pipelines: 1 cpu, 100GB memory, 3 days walltime.

$ qlogin -l h_rt=72:00:00 -l h_vmem=100G

SLURM example to make an interactive node for 100 pipelines: 1 cpu, 100GB memory, 3 days walltime.

$ srun -n 1 --mem 100G -t 3-0 -p [YOUR_PARTITON] --qos normal --pty /bin/bash -i -l 
$ srun -n 1 --mem 100G -t 3-0 --account [YOUR_ACCOUNT] --qos normal --pty /bin/bash -i -l

Once you get an interactive node, repeat the following commands per sample to run a pipeline.

$ cd [WORK_DIR]

$ python chipseq.py -screen [SCREEN_NAME] -system [CLUSTER_ENGINE: slurm or sge] -q [SGE_QUEUE_OR_SLURM_PARTITION] -nth [MAX_NUM_THREAD_PER_PIPELINE] ...

$ python chipseq.py -screen [SCREEN_NAME] -system slurm -q_for_slurm_account -q [SLURM_ACCOUNT] -nth [MAX_NUM_THREAD_PER_PIPELINE] ...

$ sleep 2 # wait for 2 seconds for safety

Then you can monitor your pipelines with screen -ls and tail -f [WORK_DIR]/[SCREEN_NAME].BSD.log. If you want to run more than 200 pipelines, you would want to make multiple interactive nodes and distribute your samples to them.

Input data type

There are five data types; fastq, bam, filt_bam, tag and peak.

For exp. replicates: define data path with -[DATA_TYPE][REPLICATE_ID].
For contols: define data path with -ctl_[DATA_TYPE][CONTROL_ID].

You can skip [REPLICATE_ID] or [CONTROL_ID] if it's 1. (eg. -fastq, -ctl_bam, -tag, ... ). Except for fastq, add -pe if your data set is PAIRED-END. You can also individually specify endedness for each replicate; -pe[REPLICATE_ID] for exp. replicates, -ctl_pe[CONTROL_ID] for controls.

Starting from fastqs: see the example in the previous section

Starting from paired end bams:

$ python chipseq.py -species hg19 -pe -bam1 /DATA/REP1.bam -bam2 /DATA/REP2.bam -ctl_bam /DATA/CTL.bam ...

Starting from singled-ended deduped / filtered bams:

$ python chipseq.py -species hg19 -se -filt_bam1 /DATA/REP1.filt.bam -filt_bam2 /DATA/REP2.filt.bam -ctl_filt_bam /DATA/CTL.filt.bam ...

Starting from paired end tagaligns:

$ python chipseq.py -species mm9 -pe -tag1 /DATA/REP1.tagAlign.gz -tag2 /DATA/REP2.tagAlign.gz -ctl_tag /DATA/CTL.tagAlign.gz

Starting from narrow/relaxed(region) peak files:
```
$ python chipseq.py -species hg19 -peak1 /DATA/Example1.regionPeak.gz -peak2 /DATA/Example2.regionPeak.gz -peak_pooled /DATA/Example.pooled.regionPeak.gz ...
```
If you want do perform full IDR including pseudo-replicates and pooled pseudo-replicates, add the following to the command line.
- For IDR on pseduro replicates of replicate 1: -peak1_pr1 [PEAK1_PR1] -peak1_pr2 [PEAK1_PR2]
- For IDR on pseduro replicates of replicate 2: -peak2_pr1 [PEAK2_PR1] -peak2_pr2 [PEAK2_PR2]
- For IDR on pseduro replicates of replicate N: -peakN_pr1 [PEAK2_PR1] -peakN_pr2 [PEAK2_PR2]
- For IDR on pooled pseduro replicates: -peak_ppr1 [PEAK_PPR1] -peak_ppr2 [PEAK_PPR2]

Mixing up input types:

$ python chipseq.py -species mm9 -se -fastq1 /DATA/REP1.fastq.gz -bam2 /DATA/ENCSR000EGM/REP2.bam -ctl_tag /DATA/CTL.tagAlign.gz

Endedness (SE/PE)

All data are treated as SINGLED-ENDED if endedness is not explicltly specifed. Endedness is automatically detected for fastqs.

Add -pe to the command line if all data set are paired-end. You can also individually specify endedness for each replicate.

For exp. replicates: -pe[REPLICATE_ID]
For controls: -ctl_pe[CONTROL_ID]

For fastqs, you do not need to add '-pe' since the pipeline will automatically determine it according to indices in command line parameters.

For exp. replicates:
- Define data path as -fastq[REPLICATE_ID], then it's SE (single ended).
- Define data path as -fastq[REPLICATE_ID]_[PAIRING_ID], then it's PE.
For controls:
- Define data path as -ctl_fastq[REPLICATE_ID], it's SE.
- Define data path as -ctl_fastq[REPLICATE_ID]_[PAIRING_ID], it's PE.

Example: 2 replicates and 1 control replicate (all SE)

$ python chipseq.py -species hg19 -fastq1 /DATA/REP1.fastq.gz -fastq2 /DATA/REP2.fastq.gz -ctl_fastq1 /DATA/CTL.fastq.gz

Example: 2 replicates and 2 control replicates (all PE)

$ python chipseq.py -species hg19 -fastq1_1 /DATA/REP1_1.fastq.gz -fastq1_2 /DATA/REP1_2.fastq.gz -fastq2_1 /DATA/REP2_1.fastq.gz -fastq2_2 /DATA/REP2_2.fastq.gz -ctl_fastq1_1 /DATA/Ctl/CTL_1_1.fastq.gz -ctl_fastq1_2 /DATA/Ctl/CTL_1_2.fastq.gz -ctl_fastq2_1 /DATA/Ctl/CTL_2_1.fastq.gz -ctl_fastq2_2 /DATA/Ctl/CTL_2_1.fastq.gz

You can mix up not only data types but also endedness.

Example: 1 SE fastq, 1 PE bam and 1 PE control tagalign

$ python chipseq.py -species hg19 -fastq1 /DATA/REP1.fastq.gz -pe2 -bam2 /DATA/REP2.bam -pe_ctl -ctl_tag /DATA/CTL.tagAlign.gz

Pipeline steps

AQUAS pipeline goes through the following stages:

bam : mapping (fastq -> bam)
filt_bam : filtering and deduping bam (bam -> filt_bam)
tag : creating tagalign (filt_bam -> tagalign)
xcor : cross-correlation analysis (tagalign -> xcor plot.pdf/score.txt )
peak : peak calling (tagalign -> peak)
idr : IDR (peaks -> IDR score and peaks)

AQUAS pipeline stops right after -final_stage [STAGE] (idr by default). It is useful if you are not interested in peak calling and want to map/align lots of genome data (fastq, bam or filt_bam) IN PARALLEL.

Parallelization

For completely serialized jobs, add -no_par to the command line. Individual tasks can still go multi-threaded.

IMPORTANT! You can set up a limit for total number of threads with -nth [MAX_TOTAL_NO_THREADS]. Total number of threads used by a pipeline will not exceed this limit.

Default -nth for each cluster is defined on ./default.env (e.g. 16 on SCG and 8 on Kundaje lab cluster)

The pipeline automatically distributes [MAX_TOTAL_NO_THREADS] threads for jobs according to corresponding input file sizes. For example of two fastqs (1GB and 2GB) with -nth 6, 2 and 4 threads are allocated for aligning 1GB and 2GB fastqs, respectively. The same policy applies to other multi-threaded tasks like deduping and peak calling.

However, all multi-threaded tasks (like bwa, bowtie2, spp and macs2) still have their own max. memory (-mem_APPNAME [MEM_APP]) and walltime (-wt_APPNAME [WALLTIME_APP]) settings. Max. memory is NOT PER CPU. For example on Kundaje lab cluster (with SGE flag activated bds -s sge chipseq.bds ...), the actual shell command submitted by BDS for each task is like the following:

qsub -V -pe shm [NTH_ALLOCATED_FOR_APP] -h_vmem=[MEM_APP]/[NTH_ALLOCATED_FOR_APP] -h_rt=[WALLTIME_APP] -s_rt=[WALLTIME_APP] ...

This ensures that total memory reserved for a cluster job equals to [MEM_APP]. The same policy applies to SLURM.

Changing peak caller

There are two peak callers (spp and macs2) supported by the pipeline. spp and macs2 are by default for TF ChIP-seq and histone ChIP-seq, respectively. But you can specify a peak caller regardless of the [CHIPSEQ_TYPE]. Simply add -peak_caller [PEAK_CALLER_FOR_IDR] to the command line.

Changing dup marker

There are two dup markers (picard and sambamba) supported by the pipeline. picard is by default. picard is based on Java so there can be a lot Java-related issues (e.g. Java heap error). To change the dup marker to sambamba, simply add -dup_marker sambamba to the command line.

Managing multiple pipelines

Simply add -screen [SCREEN_NAME] to create a detached screen for a pipeline and then stdout/stderr will be redirected to a log file [SCREEN_NAME].log. If a log file already exists, stdout/stderr will be appended to it. Monitor the pipeline with tail -f [SCREEN_NAME].log. A screen will be automatically closed once the pipeline run is done. To kill a pipeline manually while it's running, use ./utils/kill_scr or screen -X quit:

$ screen -X -S [SCREEN_NAME] quit
$ kill_scr [SCREEN_NAME]

Java issues (memory and temporary directory)

Picard tools is used for marking dupes in the reads and it's based on Java. If you see any Java heap space errors then increase memory limit for Java with -mem_dedup [MEM] (default: 12G).

If your /tmp quickly fills up and you want to change temporary directory for all Java apps in the pipeline, then add the following line to your bash startup script ($HOME/.bashrc). Our pipeline takes in $TMPDIR (not $TMP) for all Java apps.

export TMPDIR=/your/temp/dir/

Another quick workaround for dealing with Java issues is not to use Picard tools in the pipeline. Add -use_sambamba_markdup to your command line and then you can use sambamba markdup instead of picard markdup.

How to customize genome data installer?

Please refer to the section Installer for genome data on BDS pipeline programming.

Usage cheat sheet

For general use, use the following command line. You can skip first three positional arguments to use default values.

$ python chipseq.py [CHIPSEQ_TYPE; TF(default), histone] [FINAL_STAGE: bam, filt_bam, tag, xcor, peak, idr(default)] [CONF_JSON_FILE] -species [SPECIES; hg19, mm9, ... ] -nth [NUM_THREADS] -fastq1 [READ_REP1] -fastq2 [READ_REP2] ...

IMPORTANT! If your data set is PAIRED END add the following to the command line, otherwise the pipeline works for SE by default.

-pe

You can also individually specify endedness for each replicate.

-se[REPLICATE_ID]   # for exp. replicates,

-se1 -pe2 -se3 ...

If you have just one replicate (PE), define fastqs with -fastq[REP_ID]_[PAIR_ID].

-fastq1_1 [READ_PAIR1] -fastq1_2 [READ_PAIR2] \

For multiple replicates (PE), define fastqs with -fastq[REP_ID]_[PAIR_ID]. Add -fastq[]_[] for each replicate and pair to the command line:replicates.

-fastq1_1 [READ_REP1_PAIR1] -fastq1_2 [READ_REP1_PAIR2] -fastq2_1 [READ_REP2_PAIR1] -fastq2_2 [READ_REP2_PAIR2] ...

For multiple replicates (SE), define fastqs with -fastq[REP_ID]:

-se -fastq1 [READ_REP1] -fastq2 [READ_REP2] ...

You can start from bam files. There are two kinds of bam files (raw or deduped) and you need to explicitly choose between raw bam (bam) and deduped one (filt_bam). Don't forget to add -pe if they are paired end (PE). For raw bams,

-bam1 [RAW_BAM_REP1] -bam2 [RWA_BAM_REP1] ...

For deduped (filtered) bams:

-filt_bam1 [FILT_BAM_REP1] -filt_bam2 [FILT_BAM_REP1] ...

For tagaligns (non-tn5-shifted):

-tag1 [TAGALIGN_REP1] -tag2 [TAGALIGN_REP2] ...

You can also mix up any data types.

-bam1 [RAW_BAM_REP1] -tag2 [TAGALIGN_REP2] ...

For controls, simply add a prefix ctl_ to the parameters. This rule applies to the endedness of controls too. You can also have multiple sets of controls. For example of 2 PE controls,

-ctl_pe -ctl_fastq1_1 [FASTQ_CTL1_PAIR1] -ctl_fastq1_2 [FASTQ_CTL1_PAIR2] -ctl_fastq2_1 [FASTQ_CTL1_PAIR1] -ctl_fastq2_2 [FASTQ_CTL1_PAIR2] ...

If your control is raw bam (paired end).

-ctl_pe -ctl_bam1 [BAM_CTL1] ...

To subsample beds (tagaligns) add the following to the command line. This is different from subsampling for cross-corr. analysis. Peaks will be called with subsampled tagaligns.

-subsample_chip [NO_READS_TO_SUBSAMPLE]

To subsample control beds (tagaligns) add the following to the command line.

-subsample_ctl [NO_READS_TO_SUBSAMPLE]

To change # of lines to subsample for cross-corr. analysis. This will not affect tasks downstream (peak calling and IDR).

-subsample_xcor [NO_READS_TO_SUBSAMPLE]

To disable pseudo replicate generation, add the following. By default, peak calling and IDR will be done for true replicates and pseudo replicates, but if you have -true_rep in the command line, you will also get IDR on true replicates only.

-true_rep

-true_rep disables peak calling for pooled replicates as well as self pseudo replicates. If you want to call peaks on true/pooled replicates:

-no_pseudo_rep

You can change the IDR threshold.

-idr_thresh [IDR_THRESHOLD]

You can specify a peak caller for IDR regardless of the type of ChIP-seq.

-peak_caller [PEAK_CALLER; spp, macs2]

To disable cross-corr. analysis.

-no_xcor

If you turn off cross-corr. analysis (-no_xcor) make sure to set both -extsize_macs2 [EXTSIZE] and -speak_spp [SPEAK]. Those two values are supposed to be taken from cross-corr. analysis.

Useful HTML reports

There are two kinds of HTML reports provided by the pipeline.

BigDataScript HTML report for debugging: Located at the working folder with name chipseq_[TIMESTAMP]_report.html. This report is automatically generated by BigDataScript and is useful for debugging since it shows summary, timeline, Stdout and Stderr for each job.
ChIP-Seq pipeline report for QC and result: The pipeline automatically generate a nice HTML report (Report.html) on its output directory (specified with -out_dir or just './out'). It summarizes files and directory structure, includes QC reports and show a workflow diagram and genome browser tracks for peaks and signals (bigwigs for pValue and fold change). Move your output directory to a web directory (for example, /var/www/somewhere) or make a softlink of it to a web directory. For genome browser tracks, specify your web directory root for your output While keeping its structure. Make sure that you have bgzip and tabix installed on your system. Add the following to the command line:
```
-url_base http://your/url/to/output -title [PREFIX_FOR_YOUR_REPORT]
```

Temporary files

If you stop a BDS pipeline with Ctrl+C while calling peaks with spp. Temporary files generated by Rscript are not removed and they are still on $TMP (or /tmp if not explicitly exported). You need to manually remove them.

Output directory structure and file naming

For more details, refer to the file table section in an HTML report generated by the pipeline.

out                               # root dir. of outputs
│
├ *report.html                    #  HTML report
├ *tracks.json                    #  Tracks datahub (JSON) for WashU browser
├ ENCODE_summary.json             #  Metadata of all datafiles and QC results
│
├ align                           #  mapped alignments
│ ├ rep1                          #   for true replicate 1 
│ │ ├ *.bam                       #    raw bam
│ │ ├ *.nodup.bam                 #    filtered and deduped bam
│ │ ├ *.tagAlign.gz               #    tagAlign (bed6) generated from filtered bam
│ │ ├ *.*M.tagAlign.gz            #    subsampled tagAlign for cross-corr. analysis
│ ├ rep2                          #   for true repilicate 2
│ ...
│ ├ ctl1                          #   for control 1
│ ...
│ ├ pooled_rep                    #   for pooled replicate
│ ├ pseudo_reps                   #   for self pseudo replicates
│ │ ├ rep1                        #    for replicate 1
│ │ │ ├ pr1                       #     for self pseudo replicate 1 of replicate 1
│ │ │ └ pr2                       #     for self pseudo replicate 2 of replicate 1
│ │ ├ rep2                        #    for repilicate 2
│ │ ...                           
│ └ pooled_pseudo_reps            #   for pooled pseudo replicates
│   ├ ppr1                        #    for pooled pseudo replicate 1 (rep1-pr1 + rep2-pr1 + ...)
│   └ ppr2                        #    for pooled pseudo replicate 2 (rep1-pr2 + rep2-pr2 + ...)
│
├ peak                            #  peaks called
│ ├ macs2                         #   peaks generated by MACS2
│ │ ├ rep1                        #    for replicate 1
│ │ │ ├ *.narrowPeak.gz           #     narrowPeak
│ │ │ ├ *.gappedPeak.gz           #     gappedPeak
│ │ │ ├ *.filt.narrowPeak.gz      #     blacklist filtered narrowPeak 
│ │ │ ├ *.filt.gappedPeak.gz      #     blacklist filtered gappedPeak
│ │ ├ rep2                        #    for replicate 2
│ │ ...
│ │ ├ pseudo_reps                 #   for self pseudo replicates
│ │ └ pooled_pseudo_reps          #   for pooled pseudo replicates
│ │
│ ├ spp                           #   peaks generated by SPP
│ │ ├ rep1                        #    for replicate 1
│ │ │ ├ *.regionPeak.gz           #     regionPeak (narrowPeak format) generated from SPP
│ │ │ ├ *.filt.regionPeak.gz      #     blacklist filtered narrowPeak 
│ │ ├ rep2                        #    for replicate 2
│ │ ...
│ │ ├ pseudo_reps                 #   for self pseudo replicates
│ │ └ pooled_pseudo_reps          #   for pooled pseudo replicates
│ │
│ └ idr                           #   IDR thresholded peaks (using peaks from SPP)
│   ├ true_reps                   #    for replicate 1
│   │ ├ *.narrowPeak.gz           #     IDR thresholded narrowPeak
│   │ ├ *.filt.narrowPeak.gz      #     IDR thresholded narrowPeak (blacklist filtered)
│   │ └ *.12-col.bed.gz           #     IDR thresholded narrowPeak track for WashU browser
│   ├ pseudo_reps                 #    for self pseudo replicates
│   │ ├ rep1                      #    for replicate 1
│   │ ...
│   ├ optimal_set                 #    optimal IDR thresholded peaks
│   │ └ *.filt.narrowPeak.gz      #     IDR thresholded narrowPeak (blacklist filtered)
│   ├ conservative_set            #    optimal IDR thresholded peaks
│   │ └ *.filt.narrowPeak.gz      #     IDR thresholded narrowPeak (blacklist filtered)
│   ├ pseudo_reps                 #    for self pseudo replicates
│   └ pooled_pseudo_reps          #    for pooled pseudo replicate
│
├ qc                              #  QC logs
│ ├ *IDR_final.qc                 #   Final IDR QC
│ ├ rep1                          #   for true replicate 1
│ │ ├ *.flagstat.qc               #    Flagstat QC for raw bam
│ │ ├ *.dup.qc                    #    Picard (or sambamba) MarkDuplicate QC log
│ │ ├ *.pbc.qc                    #    PBC QC
│ │ ├ *.nodup.flagstat.qc         #    Flagstat QC for filtered bam
│ │ ├ *M.cc.qc                    #    Cross-correlation analysis score for tagAlign
│ │ └ *M.cc.plot.pdf/png          #    Cross-correlation analysis plot for tagAlign
│ ...
│
├ signal                          #  signal tracks
│ ├ macs2                         #   signal tracks generated by MACS2
│ │ ├ rep1                        #    for true replicate 1 
│ │ │ ├ *.pval.signal.bigwig (E)  #     signal track for p-val
│ │ │ └ *.fc.signal.bigwig   (E)  #     signal track for fold change
│ ...
│ └ pooled_rep                    #   for pooled replicate
│ 
└ report                          # files for HTML report

QC metrics spreadsheet (TSV) generation

For each pipeline rune, ENCODE_summary.json file is generated under the output directory (-out_dir). This JSON file includes all metadata and QC metrics.

./utils/parse_summary_qc_recursively.py recursively finds ENCODE_summary.json files and parse them to generate one big TSV spreadsheet for QC metrics.

$ python parse_summary_qc_recursively.py -h
usage: ENCODE_summary.json parser for QC [-h] [--out-file OUT_FILE]
                                         [--search-dir SEARCH_DIR]
                                         [--json-file JSON_FILE]

Recursively find ENCODE_summary.json, parse it and make a TSV spreadsheet of
QC metrics.

optional arguments:
  -h, --help            show this help message and exit
  --out-file OUT_FILE   Output TSV filename)
  --search-dir SEARCH_DIR
                        Root directory to search for ENCODE_summary.json
  --json-file JSON_FILE
                        Specify json file name to be parsed

Programming with BDS

Requirements

For python2 (python 2.x >= 2.7) and R-3.x, requirements.txt
For python3, requirements_py3.txt

Troubleshooting

See more troubleshootings.

"[gzclose] buffer error" during bwa aln

Example:

[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_read_seq] 0.2% bases are trimmed.
[bwa_aln_core] calculate SA coordinate... 21.39 sec
[bwa_aln_core] write to the disk... 0.10 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_read_seq] 0.2% bases are trimmed.
[bwa_aln_core] calculate SA coordinate... 21.27 sec
[bwa_aln_core] write to the disk... 0.09 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_read_seq] 0.2% bases are trimmed.
[bwa_aln_core] calculate SA coordinate... 21.51 sec
[bwa_aln_core] write to the disk... 0.08 sec
[bwa_aln_core] 786432 sequences have been processed.
[bwa_read_seq] 0.2% bases are trimmed.
[bwa_aln_core] calculate SA coordinate... 21.34 sec
[bwa_aln_core] write to the disk... 0.08 sec
[bwa_aln_core] 1048576 sequences have been processed.
[bwa_read_seq] 0.2% bases are trimmed.
[bwa_aln_core] calculate SA coordinate... 21.83 sec
[bwa_aln_core] write to the disk... 0.07 sec
[bwa_aln_core] 1309550 sequences have been processed.
[gzclose] buffer error

Solution1 (BEST): Use bwa-0.7.3 or bwa-0.6.2.

$ git clone https://github.com/lh3/bwa bwa-0.7.3
$ cd bwa-0.7.3
$ git checkout tags/0.7.3
$ make
$ ./bwa

Then, append -addpath /path/to/your/bwa to your command line.

Solution2: Upgrade zlib to 1.2.8

$ BWA_VER=0.7.3
$ git clone https://github.com/lh3/bwa
$ cd bwa
$ git checkout tags/${BWA_VER}
$ sed -e's#INCLUDES=#INCLUDES=-Izlib-1.2.8/ #' -e's#-lz#zlib-1.2.8/libz.a#' Makefile > Makefile.zlib
$ make clean
$ wget http://zlib.net/zlib-1.2.8.tar.gz
$ tar xvzf zlib-1.2.8.tar.gz
$ cd zlib-1.2.8
$ ./configure
$ make
$ cd ..
$ make -f Makefile.zlib
$ ./bwa

Tested bwa versions (with zlib 1.2.8)

successful: 0.6.2 0.7.1 0.7.2 0.7.3
failed: 0.7.4 0.7.5 0.7.7 0.7.8 0.7.11 0.7.12

Cannot allocate memory (bwa fails due to lack of memory)

An example of a failed job due to lack of memory (desktop with 4 cores and 12 GB of memory):

[bam_header_read] EOF marker is absent. The input is probably truncated.
[bwa_read_seq] 0.0% bases are trimmed.
[bwa_aln_core] convert to sequence coordinate... [bwt_restore_sa] Failed to allocate 1547846992 bytes at bwt.c line 404: Cannot allocate memory
[samopen] SAM header is present: 25 sequences.
[sam_read1] reference 'ID:bwa   PN:bwa  VN:0.7.3-r789   CL:bwa samse /home/leepc12/run/index/encodeHg19Male_bwa-0.7.3.fa /home/leepc12/run/ENCSR000EGM3/out/TEST_Rep2.sai /home/leepc12/run/ENCODE/ENCFF000YLY.fastq.gz
' is recognized as '*'.
[main_samview] truncated file.

Solution: balance memory usage between parallel jobs or disable parallel jobs (add '-no_par')

$ python chipseq.py -no_par ...

[samopen] no @SQ lines in the header. ( bwa sam failure )

For computers with limited memory, bwa samse/sampe fails without non-zero exit value. This leads to a failure of a pipeline or corruption of outputs.

Solution: balance memory usage between parallel jobs or disable parallel jobs (add '-no_par')

$ python chipseq.py -no_par ...

Error: could not find environment: aquas_chipseq

Unload any Anaconda Python modules. Remove locally installed Anaconda Python from your $PATH

SPP error: `In min(npld$y[npld$fdr <= fdr])`

Warning message:
In min(npld$y[npld$fdr <= fdr]) :
  no non-missing arguments to min; returning Inf
excluding systematic background anomalies ... done
calculating statistical thresholds
FDR 0.99 threshold= Inf
Detected 0 peaks

Check if number of reads in your control tagalign is too high, and then reduce it with -subsample_ctl [NO_READ_TO_SUBSAMPLE_CONTROL]. Try with half of the original number of reads in control.

Contributors

Jin wook Lee - PhD Student, Mechanical Engineering Dept., Stanford University
Nathan Boley - Postdoc, Dept. of Genetics, Stanford University
Anshul Kundaje - Assistant Professor, Dept. of Genetics, Stanford University

chipseq_pipeline's People

Contributors

Stargazers

Watchers

chipseq_pipeline's Issues

"-q [SLURM_ACCOUNT_NAME]" interpreted as file

Hey Jin,

I can't figure out how to correctly specify the --account for SLURM job submission on SCG.

Your example: "python chipseq.py -system slurm -q_for_slurm_account -q [SLURM_ACCOUNT_NAME]"

I used:
python /srv/gsfs0/projects/snyder/chappell/TF_chipseq_pipeline/chipseq.py -type histone --screen GFP.H3K4me1 -system slurm -q_for_slurm_account -q mpsnyder -pe -species hg19 -nth 12 -fastq1_1 GFP.H3K4me1.repA.trim.R1.fq.gz -fastq1_2 GFP.H3K4me1.repA.trim.R2.fq.gz -fastq2_1 GFP.H3K4me1.repB.trim.R1.fq.gz -fastq2_2 GFP.H3K4me1.repB.trim.R2.fq.gz -ctl_fastq1_1 GFP.Input.repAB.trim.R1.fq.gz -ctl_fastq1_2 GFP.Input.repAB.trim.R2.fq.gz -out_dir GFP.H3K4me1 -mem_dedup 15G

Get the following error in .log output:
"sbatch: error: Unable to open file mpsnyder"

Building custom genome data

Are there any instructions on how to build custom genome data for a species that is not available for the pipeline? What are the usual resources to find relevant data, like the black list, ATAC specific annotation data?

JSD error: "RuntimeError: module compiled against API version 0xc but this version of numpy is 0xa"

Got an error at the JSD step. I used nth=1 (I tried with more but they somehow did not work either, or said "not enough resources"), and on a server with 1 core, 80gb memory.
I think the pipeline install_dependencies.sh installed numpy 1.18.5.

Error pasted below:

03:18:53.464	ExecutionerLocal 'Local[28]': Task selected 'chipseq.bds.20200801_100653_737/task.postalign_bam.jsd.line_812.id_32' on host 'localhost'
RuntimeError: module compiled against API version 0xc but this version of numpy is 0xa
Traceback (most recent call last):
  File "/home/unix/levgenio/software/miniconda3/envs/aquas_chipseq/bin/plotFingerprint", line 4, in <module>
    from deeptools.plotFingerprint import main
  File "/home/unix/levgenio/software/miniconda3/envs/aquas_chipseq/lib/python2.7/site-packages/deeptools/plotFingerprint.py", line 15, in <module>
    import deeptools.countReadsPerBin as countR
  File "/home/unix/levgenio/software/miniconda3/envs/aquas_chipseq/lib/python2.7/site-packages/deeptools/countReadsPerBin.py", line 13, in <module>
    import pyBigWig
ImportError: numpy.core.multiarray failed to import
Task failed:
	Program & line     : '/home/unix/levgenio/software/TF_chipseq_pipeline/modules/postalign_bam.bds', line 812
	Task Name          : 'jsd'
	Task ID            : 'chipseq.bds.20200801_100653_737/task.postalign_bam.jsd.line_812.id_32'
	Task PID           : '22031'
	Task hint          : 'plotFingerprint -b /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/align/rep1/10222853_Microglia_H3K27ac1_161020Tsa_D16-11125_hg19_bestmap.nodup.bam /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/align/rep2/10222853_Microglia_H3K27ac2_161020Tsa_D16-11126_hg19_bestmap'
	Task resources     : 'cpus: 1	mem: -1.0 B	wall-timeout: 8640000'
	State              : 'ERROR'
	Dependency state   : 'ERROR'
	Retries available  : '1'
	Input files        : '[/home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/align/rep1/10222853_Microglia_H3K27ac1_161020Tsa_D16-11125_hg19_bestmap.nodup.bam, /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/align/rep2/10222853_Microglia_H3K27ac2_161020Tsa_D16-11126_hg19_bestmap.nodup.bam, /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/align/ctl1/10222853_Microglia_Input1_161020Tsa_D16-11119_hg19_bestmap.nodup.bam]'
	Output files       : '[/home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/qc/10222853_Microglia_out_jsd.png, /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/qc/10222853_Microglia_out_jsd.qc]'
	Script file        : '/broad/compbio/levgenio/code/AQUAS_pipeline/chipseq.bds.20200801_100653_737/task.postalign_bam.jsd.line_812.id_32.sh'
	Exit status        : '1'
	Program            :

		# SYS command. line 813

		 if [[ -f $(which conda) && $(conda env list | grep aquas_chipseq | wc -l) != "0" ]]; then source activate aquas_chipseq; sleep 5; fi;  export PATH=/home/unix/levgenio/software/TF_chipseq_pipeline/.:/home/unix/levgenio/software/TF_chipseq_pipeline/modules:/home/unix/levgenio/software/TF_chipseq_pipeline/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

		# SYS command. line 814

		 plotFingerprint -b /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/align/rep1/10222853_Microglia_H3K27ac1_161020Tsa_D16-11125_hg19_bestmap.nodup.bam /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/align/rep2/10222853_Microglia_H3K27ac2_161020Tsa_D16-11126_hg19_bestmap.nodup.bam /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/align/ctl1/10222853_Microglia_Input1_161020Tsa_D16-11119_hg19_bestmap.nodup.bam --JSDsample /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/align/ctl1/10222853_Microglia_Input1_161020Tsa_D16-11119_hg19_bestmap.nodup.bam \
					--labels rep1 rep2 ctl1 \
					--outQualityMetrics /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/qc/10222853_Microglia_out_jsd.qc \
					--minMappingQuality 30 \
					-T "Fingerprints of different samples" \
					--blackListFileName /home/unix/levgenio/data/hg19/wgEncodeDacMapabilityConsensusExcludable.bed.gz \
					--numberOfProcessors 1 \
					--plotFile /home/unix/levgenio/data/AQUAS/AQUAS_out/10222853_Microglia_out/qc/10222853_Microglia_out_jsd.png

		# SYS command. line 823

		 TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0
	StdErr (100000000 lines)  :
		RuntimeError: module compiled against API version 0xc but this version of numpy is 0xa
		Traceback (most recent call last):
		  File "/home/unix/levgenio/software/miniconda3/envs/aquas_chipseq/bin/plotFingerprint", line 4, in <module>
		    from deeptools.plotFingerprint import main
		  File "/home/unix/levgenio/software/miniconda3/envs/aquas_chipseq/lib/python2.7/site-packages/deeptools/plotFingerprint.py", line 15, in <module>
		    import deeptools.countReadsPerBin as countR
		  File "/home/unix/levgenio/software/miniconda3/envs/aquas_chipseq/lib/python2.7/site-packages/deeptools/countReadsPerBin.py", line 13, in <module>
		    import pyBigWig
		ImportError: numpy.core.multiarray failed to import

03:19:03.543	Wait: Task 'chipseq.bds.20200801_100653_737/task.postalign_bam.jsd.line_812.id_32' finished.
03:19:03.543	Wait: Waiting for task to finish: chipseq.bds.20200801_100653_737/task.postalign_bam.dedup_bam_2_rep1.line_205.id_12, state: FINISHED
03:19:03.543	Wait: Task 'chipseq.bds.20200801_100653_737/task.postalign_bam.dedup_bam_2_rep1.line_205.id_12' finished.
03:19:03.543	Wait: Waiting for task to finish: chipseq.bds.20200801_100653_737/task.postalign_xcor.xcor_rep1.line_102.id_16, state: FINISHED
03:19:03.543	Wait: Task 'chipseq.bds.20200801_100653_737/task.postalign_xcor.xcor_rep1.line_102.id_16' finished.
03:19:03.543	Wait: Waiting for task to finish: chipseq.bds.20200801_100653_737/task.postalign_bam.markdup_bam_picard_ctl1.line_409.id_25, state: FINISHED
03:19:03.543	Wait: Task 'chipseq.bds.20200801_100653_737/task.postalign_bam.markdup_bam_picard_ctl1.line_409.id_25' finished.
03:19:03.543	Wait: Waiting for task to finish: chipseq.bds.20200801_100653_737/task.postalign_bam.dedup_bam_2_ctl2.line_205.id_30, state: FINISHED
03:19:03.543	Wait: Task 'chipseq.bds.20200801_100653_737/task.postalign_bam.dedup_bam_2_ctl2.line_205.id_30' finished.
03:19:03.543	Wait: Waiting for task to finish: chipseq.bds.20200801_100653_737/task.postalign_bam.dedup_bam_1_ctl2.line_155.id_28, state: FINISHED
03:19:03.543	Wait: Task 'chipseq.bds.20200801_100653_737/task.postalign_bam.dedup_bam_1_ctl2.line_155.id_28' finished.
03:19:03.543	Wait: Waiting for task to finish: chipseq.bds.20200801_100653_737/task.postalign_bam.bam_to_tag_rep1.line_595.id_14, state: FINISHED
03:19:03.543	Wait: Task 'chipseq.bds.20200801_100653_737/task.postalign_bam.bam_to_tag_rep1.line_595.id_14' finished.
03:19:03.543	Wait: Waiting for task to finish: chipseq.bds.20200801_100653_737/task.postalign_bam.bam_to_tag_rep1.line_595.id_13, state: FINISHED
03:19:03.543	Wait: Task 'chipseq.bds.20200801_100653_737/task.postalign_bam.bam_to_tag_rep1.line_595.id_13' finished.
Fatal error: /home/unix/levgenio/software/TF_chipseq_pipeline/chipseq.bds, line 414, pos 3. Task/s failed.

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.
03:19:03.705	Writing report file 'chipseq.bds.20200801_100653_737.report.html'
03:19:03.739	Program 'chipseq.bds.20200801_100653_737' finished, exit value: 1, tasks executed: 23, tasks failed: 1, tasks failed names: jsd.
03:19:03.739	Finished. Exit code: 1
03:19:03.739	ExecutionerLocal 'Local[28]': Killed

Installing Genomes: TwoBitToFa not found

Hi!

I installed the pipeline then tried to install a genome but came across this

install_genome_data.sh: line 171: twoBitToFa: command not found

I tried to install twobittofa using conda but ran into some issues. Is there any workaround?

Can't download hg19 genome data

Hello,

There is an issue downloading hg19 genome data. Stuck at "HTTP request sent, awaiting response..."
hg38 can be downloaded.

Can you please solve? Thanks!

[Error] two same tag align file in (z)cat and shuf in pipe

zcat sampleXXX.tagAlign.gz | grep -v "chrM" | shuf -n 15000000 --random-source=sampleXXX.tagAlign.gz | gzip -nc > sampleXXX.no_chrM.15M.tagAlign.gz

shuf: ‘sampleXXX.tagAlign.gz’: end of file

chipseq pipeline set up

If Error occured in jsd step as showed below.

Fatal error: /home/liangma1/TF_chipseq_pipeline/chipseq.bds, line 410, pos 3. Task/s failed.
chipseq.bds, line 90 : main()
chipseq.bds, line 93 : void main() { // chipseq pipeline starts here
chipseq.bds, line 100 : jsd() // plot fingerprint and compute synthetic JS distance
chipseq.bds, line 384 : void jsd() { // plot fingerprint
chipseq.bds, line 385 : if ( filt_bam.hasKey("ctl1") && filt_bam.hasKey("rep1") ) {

There are two options to solve this:

you can try with -species hg38
try to disable JSD plot (-no_jsd).

If use -bam file and -fastq file at the same time, don't use the same replicate number.

bds_scr h3k27ac_noncellcycle_dmso
~/TF_chipseq_pipeline/chipseq.bds
-se
-species hg19
-dup_marker sambamba
-nth 4
-type histone
-mem_macs2 30G
-fastq1 R1.2.fastq.gz
-fastq2 R1.3.fastq.gz
-fastq3 R1.fastq.gz
-bam4 ac_0.bam
-ctl_fastq1 o2.gz
-ctl_fastq2 o3.gz
-ctl_bam3 o1.bam \

-trim_bp is for paired end dataset only and it's for cross-correlation analysis only. There is no way to trim fastqs in the pipeline. You need to trim it yourself.
-fastq1 ... -bam1 ... means that you are specifying two files for replicate 1, which is not allowed in the pipeline.
ChIP-Seq pipeline does not trim adapters.
When your job is done successfully, you will see '== Done report()' at the end of the log file.

aquas_chipseq pipline idr error

Hello

could you kindly help me to fix an error i keep getting when running aquas_chipseq pipeline , the error jam getting is as below:

00:16:53.803 Wait: Waiting for task to finish: chipseq.bds.20180325_124007_960/task.callpeak_idr.idr2_rep1_pr.line_80.id_36, state: ERROR
Task failed:
Program & line : '/home/adroubsa/TF_chipseq_pipeline/modules/callpeak_idr.bds', line 80
Task Name : 'idr2 rep1-pr'
Task ID : 'chipseq.bds.20180325_124007_960/task.callpeak_idr.idr2_rep1_pr.line_80.id_36'
Task PID : '41355'
Task hint : 'idr --samples /scratch/dragon/intel/adroubsa/HDAC2_Chip_Seq_DMD/HDAC_T6_DMD/DMD_T6_HDAC2_sam_files/HDAC2_DMD_T6_Chip_Seq_analysis/peak/macs2/pseudo_reps/rep1/pr1/D9813_T6_HDAC2_merged_Rep1.nodup.pr1.tagAlign_x_D9813_T6_HDAC2_Input1.nodup.tagAlign.pval0.01.500K.narrowPeak.gz /scratch/dragon/intel/adr'
Task resources : 'cpus: -1 mem: -1.0 B wall-timeout: 8640000'
State : 'ERROR'
Dependency state : 'ERROR'
Retries available : '1'

your kind help is much appreciated

job submission command

Hi Jin,
I have 2 tested files ran successfully, the command I used is as below. N2 results look weird as shown in washU browser. I guess my command is not right. What do you think? And I'm not sure the controls are needed, we don't have controls.

N1: single-end with replicates and controls

bds_scr h3k27ac_noncellcycle_dmso20180519 ~/TF_chipseq_pipeline/chipseq.bds -q chettys -se -species hg19 -dup_marker sambamba -nth 4 -type histone -mem_bwa 30G -mem_dedup 30G -mem_macs2 30G -out_dir out20180519 -fastq1 Sample_Hues6_P35_DMSO_H3K27ac_R1_2_TGTCGGAT.R1.2.fastq.gz -fastq2 Sample_Hues6_P35_DMSO_H3K27ac_R1_2_TGTCGGAT.R1.3.fastq.gz -fastq3 Sample_Hues6_P35_DMSO_H3K27ac_R1_2_TGTCGGAT.R1.fastq.gz -fastq4 Sample_Hues6_P37_DMSO_H3K27ac_GACCGTTG.R1.fastq.gz -ctl_fastq1 /labs/chettys/liangma1/test/chipseq/wce/wce_for_dmso2.gz -ctl_fastq2 /labs/chettys/liangma1/test/chipseq/wce/wce_for_dmso3.gz

N2: paired-end without replicates and controls

bds_scr SRR3554631 ~/TF_chipseq_pipeline/chipseq.bds -q chettys -pe -species hg19 -dup_marker sambamba -nth 4 -type histone -mem_bwa 30G -mem_dedup 30G -mem_macs2 30G -out_dir out -fastq1_1 SRR3554631_1.fastq.gz -fastq1_2 SRR3554631_2.fastq.gz

bigwig files

N1: out20180519/signal/macs2/pooled_rep/*
N2: out/signal/macs2/rep1/*

JSON key error in example config

In the example JSON config ("example_conf_full.json"), two of the parameters don't match the chipseq.py file - npeaks_spp (should be cap_num_peak_spp) and use_system (should be system). This throws an error (line 583 of chipseq.py)

software installation

bioconda might help make the software installation much faster and reproducible. For example if you write a requirements.txt that looks something like:

bwa ==0.7.3
samtools ==0.1.19
bedtools ==2.19
ucsc-wigtobigwig
ucsc-bedgraphtobigwig
ucsc-bigwiginfo
ucsc-bedclip
matplotlib
numpy
scipy
# ...and everything else you need

Then everything can be installed with:

conda install \
--prefix $SOFTWARE_HOME/software_bds \
--file requirements.txt \
--channel bioconda

A couple details:

it looks like in install_dependencies.sh you manually separate Python environments, which can also be handled by conda. You would need a second requirements file and would need to choose a different directory
there are a couple of tools not yet on bioconda, but that's an easy fix. Exception to that might be MCR due to licensing
--prefix $HOME/software_bds in the conda command creates the environment there, minimizing changes to the pipeline

Just a suggestion. I used to have installation scripts like this but then found conda. Then bioconda came along and helped more. Made things much easier for other people to get things running and much less brittle.

Missing `-c` for `macs2 peakcall` ?

Seems that -c is missing between $tag and $ctl
https://github.com/kundajelab/chipseq_pipeline/blob/master/modules/callpeak_macs2_chipseq.bds#L188

sys macs2 callpeak -t $tag $ctl -f BED -n $prefix.tmp \
	-g $gensz -p $pval_thresh_macs2 --broad --nomodel --shift $shift_macs2 $extsize_param \
	--keep-dup $keep_dup_macs2 $extra_param_macs2

Encode pipeline description has the -c: https://docs.google.com/document/d/1lG_Rd7fnYgRpSIqrIfuVlAz2dW1VaSQThzk836Db99c/edit#

macs2 callpeak -t ${REP1_TA_FILE}.tagAlign.gz -c ${CONTROL_TA_PREFIX}.tagAlign.gz -f BED -n ${PEAK_OUTPUT_DIR}/${CHIP_TA_PREFIX} -g ${GENOMESIZE} -p 1e-2 --nomodel --shift 0 --extsize ${FRAGLEN} --keep-dup all -B --SPMR

Citing the AQUAS Pipeline

Hello,

I used the AQUAS TF Chip-seq peak calling pipeline for a work that I will be submitting very soon and I can't find which DOI to officially cite in the paper? Which DOI is the best to associate with the pipeline?

-Michael

NA

aquas_chipseq pipline SPP error

Hello

could you kindly help me to fix an error I keep getting when running aquas_chipseq pipeline , the error jam getting is as below：
04:55:10.240 Wait: Task 'chipseq.bds.20180405_152211_048/task.callpeak_spp.spp_rep1_pr1.line_70.id_33' finished.
04:55:10.240 Wait: Waiting for task to finish: chipseq.bds.20180405_152211_048/task.callpeak_spp.spp_rep1.line_70.id_31, state: FINISHED
04:55:10.240 Wait: Task 'chipseq.bds.20180405_152211_048/task.callpeak_spp.spp_rep1.line_70.id_31' finished.
04:55:10.240 Waiting for all 'parrallel' to finish.
04:55:10.240 Waiting for parallel 'chipseq.bds.20180405_152211_048_parallel_22' to finish. RunState: FINISHED
04:55:10.240 Waiting for parallel 'chipseq.bds.20180405_152211_048_parallel_23' to finish. RunState: FINISHED
Fatal error: /data/Workspace/Haichao/tool/chipseq_pipeline-master/chipseq.bds, line 1354, pos 2. Task/s failed.
chipseq.bds, line 91 : main()
chipseq.bds, line 94 : void main() { // chipseq pipeline starts here
chipseq.bds, line 103 : if ( !pe_xcor_only ) {
chipseq.bds, line 105 : call_peaks() // call peaks in parallel (MACS2,SPP)
chipseq.bds, line 929 : void call_peaks() {
chipseq.bds, line 1354 : wait

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.
04:55:10.298 Writing report file 'chipseq.bds.20180405_152211_048.report.html'
04:55:10.316 Program 'chipseq.bds.20180405_152211_048' finished, exit value: 1, tasks executed: 31, tasks failed: 1, tasks failed names: spp rep1-pr1.
04:55:10.317 Finished. Exit code: 1
04:55:10.321 ExecutionerLocal 'Local[24]': Killed
bds -c /data/Workspace/Haichao/tool/chipseq_pipeline-master/bds.config -v /data/Workspace/Haichao/tool/chipseq_pipeline-master/chipseq.bds -out_dir ./SCI_final_notrim5_out -pe_no_trim_fastq -pe -type TF -species mm10 -fastq1_1:1 /data/NGS.Labdata/HiSeq_RUN/2018_03_13/trim/mouse-primary-SCI-ChIPseq-STAT3-Day7-rep1.paired.read1.fastq -fastq1_2:1 /data/NGS.Labdata/HiSeq_RUN/2018_03_13/trim/mouse-primary-SCI-ChIPseq-STAT3-Day7-rep1.paired.read2.fastq -ctl_fastq1_1:1 /data/NGS.Labdata/HiSeq_RUN/2018_03_13/trim/stat3-input.paired.read1.fastq -ctl_fastq1_2:1 /data/NGS.Labdata/HiSeq_RUN/2018_03_13/trim/stat3-input.paired.read2.fastq

Thank you!

permission denied / file not found

Hey,

I installed everything according to your instructions without noticing any errors, but I cannot get chipseq.py to work. If I use fastq or bam as input, I get an error of the following form

fork/exec /media/nico/data/home2/projects/macrophages/chip_ana2/8-ENCODE-pipeline/chipseq.bds.20171207_115700_844/task.postalign_bam.dedup_bam_1_rep1.line_155.id_10.sh: permission denied

If I use peaks as input, the error has the following form:
Error (/home/nico/TF_chipseq_pipeline/modules/input_peak.bds, line 31, pos 33): File not found!
The files pass the initial check, though.

I don't know what to do next.

Here is the stderr output when running with a fastq and with a peak input.

log-17_12_07-11:56:57.txt

log-17_12_07-11:54:38.txt

Suggestion on how to analysis the ChIP-seq data

Hi, I have ChIP-seq data for one patient and one normal person as well as input for both of them (they run at different batch and each batch has a input). I am wondering what is the best way to do the analysis? I have some thinkings but think could get better idea from here.

compare patient and normal person respectively to the batch input to see if there is any different in between?
compare patient to combine controls (input and normal person together) to see if there are any differences?

Thanks.

Trimmed data-sets

Hello,

Is it correct to run the pipeline after trimming adaptors in paired-end data-sets? I get different read lengths.

Raquel

Crashing on bam_to_bed_atac step

@leepc12:

(This issue is for bds_atac, but I'm filing it here because the crash is caused by a module in TF_chipseq_pipeline)

The bds_atac pipeline is reproducibly crashing on nandi and mitra at the bam_to_bed step, possibly due to bad path input syntax for samtools view. This happens for multiple different samples, both when run as replicates and when run individually.

The source code line where the pipeline crashes with samtools complaining failed to open file for a file that exists and is readable is here: https://github.com/kundajelab/TF_chipseq_pipeline/blob/master/modules/postalign_bam.bds#L444

An example log file (from running bds_atac on mitra) is here: https://gist.github.com/chrisprobert/08a63955967426df1b4aace8d40c6eed, and the associated bds.conf file is here: https://gist.github.com/chrisprobert/f7658b4c92f153e37199ac7f86c718a7

From the log file (relevant line here), it looks like the input to samtools view is:
nonMitoChromosomes=$(samtools view -H "{1=/srv/persistent/cprobert/projects/jcharalel_atac_022616_ATAC_SKBR3_SKRT-28854828/analysis/SKRT-3-ATAC-022616_S1/SKRT-3-ATAC-022616_S1_R1.trim.PE2SE.nodup.bam}" | grep chr | cut -f2 | sed 's/SN://g' | grep -v chrM)

To me, the input path there for samtools view looks suspicious: browsing other parts of the BDS log the path syntax is a regular path like samtools view -H /srv/.../....bam, not wrapped in brackets like "{1=/srv/.../....bam}" is in the samtools view command above.

For example, here's a samtools view command from elsewhere in the log that executes ok without samtools crashing (note that the input file path is just provided as a path, not with the "{1=...}" syntax):
samtools view -F 1804 -f 2 -q 30 -u /srv/persistent/cprobert/projects/jcharalel_atac_022616_ATAC_SKBR3_SKRT-28854828/analysis/SKRT-3-ATAC-022616_S1/SKRT-3-ATAC-022616_S1_R1.trim.PE2SE.bam | samtools sort -n - /srv/persistent/cprobert/projects/jcharalel_atac_022616_ATAC_SKBR3_SKRT-28854828/analysis/SKRT-3-ATAC-022616_S1/SKRT-3-ATAC-022616_S1_R1.trim.PE2SE.dupmark

Any thoughts what could be causing this? Perhaps filt_bam is not being set correctly by whatever calls _bam_to_bed_atac?

Fatal error: TF_chipseq_pipeline/modules/align_bwa.bds, line 87, pos 3. Task/s failed.

Fatal error: TF_chipseq_pipeline/modules/align_bwa.bds, line 87, pos 3. Task/s failed.
chipseq.bds, line 76 : main()
chipseq.bds, line 79 : void main() { // chipseq pipeline starts here
chipseq.bds, line 87 : align() // align and postalign
chipseq.bds, line 318 : void align() {
chipseq.bds, line 355 : for ( int ctl=0; ctl <= 1; ctl++) { // iterate through inputs (ctl==0 : exp. replicate, ctl==1 : control)
chipseq.bds, line 358 : for ( int rep=1; rep <= get_num_rep( ctl ); rep++) {
chipseq.bds, line 364 : if ( no_par ) align( ctl, rep, nth_rep{group} ) // parallel jobs for align() for each replicate and each control
chipseq.bds, line 374 : void align( int ctl, int rep, int nth_rep ) {
chipseq.bds, line 376 : if ( is_se( ctl, rep ) ) align_SE( ctl, rep, nth_rep )
chipseq.bds, line 377 : else align_PE( ctl, rep, nth_rep )
chipseq.bds, line 550 : void align_PE( int ctl, int rep, int nth_rep ) {
chipseq.bds, line 563 : if ( is_input_fastq( ctl, rep ) ) {
chipseq.bds, line 582 : if ( aligner == "bwa" ) {
chipseq.bds, line 583 : ( bam_, flagstat_qc_ ) = bwa_PE( pooled_fastq_pair1, pooled_fastq_pair2, aln_o_dir, qc_o_dir, group, nth_rep )
align_bwa.bds, line 66 : string[] bwa_PE( string fastq1, string fastq2, string o_dir, string log_o_dir, string group, int nth_bwa ) {
align_bwa.bds, line 76 : if ( out <- in ) { // compare file timestamps of in and out (to check if job is already done or not)
align_bwa.bds, line 87 : wait

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.
00:00:51.033 Writing report file 'chipseq.bds.20170529_194354_069.report.html'
00:00:51.155 Program 'chipseq.bds.20170529_194354_069' finished, exit value: 1, tasks executed: 1, tasks failed: 1, tasks failed names: bwa_sam_PE rep1.
00:00:51.156 Finished. Exit code: 1
00:00:51.156 ExecutionerLocal 'Local[41]': Killed

Fatal error: chipseq.bds, line 1354, pos 2. Task/s failed.

Hi @leepc12 ,

Hope you're doing well.

I'm running into the below error in call_peaks():

Fatal error: /path/to/chipseq_pipeline/chipseq.bds, line 1354, pos 2. Task/s failed.
chipseq.bds, line 91 :	main()
chipseq.bds, line 94 :	void main() { // chipseq pipeline starts here
chipseq.bds, line 103 :		if ( !pe_xcor_only ) {
chipseq.bds, line 105 :			call_peaks() // call peaks in parallel (MACS2,SPP)
chipseq.bds, line 929 :	void call_peaks() {
chipseq.bds, line 1354 :		wait

Here are the commands I used:

srun -p part1 --mem=1G --pty /bin/bash

python /path/to/chipseq.py -type histone -final_stage idr -q part1 -system slurm -title job_1 -screen job_1 -species hg19 -nth 8 -no_jsd -mem_macs2 20G -fastq1 /path/to/sample_1.fastq -fastq2 /path/to/sample_2.fastq -ctl_fastq1 /path/to/ctl_1.fastq -ctl_fastq2 /path/to/ctl_2.fastq -out_dir /path/to/peaks

Full job log attached here:
job_1.txt

I would really appreciate any help with debugging this issue.

-Easwaran

mm10

I tried installing the mm10 genome today and encountered issues retrieving data from http://mitra.stanford.edu/kundaje/genome_data/mm10/. I looked and it appears as though there is no more data here. Have the data for generating genome files been removed?

Fatal error: /TF_chipseq_pipeline/chipseq.bds, line 1529, pos 2. Task/s failed.

I am running chipseq_pipeline on our data and the following two commands have been used. The first one has no problems but the second one cannot go through with the error message I attached below too. Please help/advise. Thanks a lot.

COMMANDS:

bds /home/TF_chipseq_pipeline/chipseq.bds -species hg38 -nth 4 -out_dir FOXA2GAII -se -fastq1 SRR074923.fastq.gz -fastq2 SRR074926.fastq.gz -ctl_fastq1 SRR074928.fastq.gz (GOOD)

bds /home/TF_chipseq_pipeline/chipseq.bds -species hg38 -nth 4 -out_dir FOXA2GAI -se -fastq1:1 SRR074921.fastq.gz -fastq1:2 SRR074922.fastq.gz -fastq2:1 SRR074924.fastq.gz -fastq2:2 SRR074925.fastq.gz -ctl_fastq1 SRR074927.fastq.gz -ctl_fastq2 SRR074929.fastq.gz (ERROR)

ERROR:

== git info
Latest git commit : eb3e236 (Thu Aug 3 20:12:41 2017)
Reading parameters from section (default) in file(/home/ysun/TF_chipseq_pipeline/default.env)...

== configuration file info
Hostname : ****************
Configuration file :
Environment file : /home/ysun/TF_chipseq_pipeline/default.env

== parallelization info
No parallel jobs : false
Maximum # threads : 4

== cluster/system info
Walltime (general) : 5h50m
Max. memory (general) : 7G
Force to use a system : local
Process priority (niceness) : 0
Retiral for failed tasks : 0
Submit tasks to a cluster queue :
Unlimited cluster mem./walltime : false

== shell environment info
Conda env. : aquas_chipseq
Conda env. for python3 : aquas_chipseq_py3
Conda bin. directory :

Shell cmd. for init. : if [[ -f $(which conda) && $(conda env list | grep aquas_chipseq | wc -l) != "0" ]]; then source activate aquas_chipseq; sleep 5; fi; export PATH=/home/ysun/TF_chipseq_pipeline/.:/home/ysun/TF_chipseq_pipeline/modules:/home/ysun/TF_chi
pseq_pipeline/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

Shell cmd. for init.(py3) : if [[ -f $(which conda) && $(conda env list | grep aquas_chipseq_py3 | wc -l) != "0" ]]; then source activate aquas_chipseq_py3; sleep 5; fi; export PATH=/home/ysun/TF_chipseq_pipeline/.:/home/ysun/TF_chipseq_pipeline/modules:/home/ysu
n/TF_chipseq_pipeline/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

Shell cmd. for fin. : TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0

Cluster task min. len. : 60

Cluster task delay : 0

== output directory/title info
Output dir. : /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI
Title (prefix) : FOXA2GAI
Reading parameters from section (default) in file(/home/ysun/TF_chipseq_pipeline/default.env)...
Reading parameters from section (hg38) in file(/home/ysun/bds_pipeline_genome_data/aquas_chipseq_species.conf)...

== species settings
Species : hg38
Species file : /home/ysun/bds_pipeline_genome_data/aquas_chipseq_species.conf

Species name (WashU browser) : hg38
Ref. genome seq. fasta : /home/ysun/bds_pipeline_genome_data/hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
Chr. sizes file : /home/ysun/bds_pipeline_genome_data/hg38/hg38.chrom.sizes
Black list bed : /home/ysun/bds_pipeline_genome_data/hg38/hg38.blacklist.bed.gz
Ref. genome seq. dir. :

== ENCODE accession settings
ENCODE experiment accession :
ENCODE award RFA :
ENCODE assay category :
ENCODE assay title :
ENCODE award :
ENCODE lab :
ENCODE assembly genome :
ENCODE alias prefix : KLAB_PIPELINE

== report settings
URL root for output directory :
Genome coord. for browser tracks :

== align bwa settings
Param. for bwa : -q 5 -l 32 -k 2
BWA index : /home/ysun/bds_pipeline_genome_data/hg38/bwa_index/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
Walltime (bwa) : 47h
Max. memory (bwa) : 12G

== align multimapping settings

alignments reported for multimapping : 0

== postalign bam settings
MAPQ reads rm thresh. : 30
Rm. tag reads with str. :
No dupe removal in filtering raw bam : false
Walltime (bam filter) : 23h
Max. memory (bam filter) : 12G
Dup marker : picard
Use sambamba markdup (instead of picard) : false

== postalign bed/tagalign settings
Max. memory for UNIX shuf : 12G

== postalign cross-corr. analysis settings
Max. memory for UNIX shuf : 12G
User-defined cross-corr. peak strandshift : -1
Extra parameters for cross-corr. analysis :

== callpeak spp settings
Threshold for # peak : 300000
Walltime (spp) : 47h
Max. memory (spp) : 12G
Stack size for run_spp.R :
Use-defined cross-corr. peak strandshift; if -1, use frag. len. :-1
Extra parameters for run_spp.R :

== callpeak gem settings
Threshold for # peak in GEM : 300000
Min. length of k-mers in GEM : 6
Max. length of k-mers in GEM : 13
Q-value threshold for GEM : 0.0
Read distribution txt for GEM : /home/ysun/TF_chipseq_pipeline/etc/Read_Distribution_default.txt
Extra parameters for GEM :
Walltime (GEM) : 47h
Max. memory (GEM) : 15G

== callpeak PeakSeq settings
Target FDR for PeakSeq :0.05
Number of simulations for PeakSeq :10
Enrichment mapped frag. len. for PeakSeq :-1
Minimum interpeak distance for PeakSeq :-1
Mappability map file for PeakSeq :
Maximum Q-value for PeakSeq :0.1
Background model for PeakSeq :Simulated
Extra parameters for PeakSeq :
Walltime (PeakSeq) : 47h
Max. memory (PeakSeq) : 12G

== callpeak macs2 settings
Genome size (hs,mm) : hs
Walltime (macs2) : 23h
Max. memory (macs2) : 15G
P-value cutoff (macs2 callpeak) : 0.01
--keep-dup (macs2 callpeak) : all
--extsize (macs2 callpeak); if -1 then use frag. len. : -1
--shift (macs2 callpeak) : 0
Extra parameters for macs2 callpeak :

== callpeak naiver overlap settings
Bedtools intersect -nonamecheck : false

== IDR settings
Append IDR threshold to IDR out_dir : false

== chipseq pipeline settings

replicates : 2

Type of ChIP-Seq pipeline : TF
Final stage for ChIP-Seq :
Signal tracks for pooled rep. only : false
Aligner to map raw reads : bwa

reads to subsample for cross-corr. analysis : 15000000

reads to subsample exp. replicates (0: no subsampling): 0

reads to subsample controls (0: no subsampling) : 0

Generate anonymized filt. bam : false
Peak caller for IDR analysis : spp
Control rep. depth ratio : 1.2
Scoring column for IDR : signal.value
IDR threshold : 0.05
Force to use pooled ctl : false
Peak calling for true reps only : false
No peak calling for self pseudo reps : false
Disable cross-correlation analysis : false
Disable g. peak filt. thru. n. peak : false
Disable genome browser tracks : false

== checking chipseq parameters ...

== checking input files ...

Rep1 fastq (SE) :
/home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/SRR074921.fastq.gz
/home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/SRR074922.fastq.gz
Rep2 fastq (SE) :
/home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/SRR074924.fastq.gz
/home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/SRR074925.fastq.gz
Control Rep1 fastq (SE) :
/home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/SRR074927.fastq.gz
Control Rep2 fastq (SE) :
/home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/SRR074929.fastq.gz
Distributing 4 to ...
{rep2=1, rep1=1, ctl1=1, ctl2=1}

== Done align()

== Done pool_tags()

Fewer reads in control 1 than experiment replicate 1. Using pooled controls for replicate 1.

Fewer reads in control 2 than experiment replicate 2. Using pooled controls for replicate 2.
Distributing 4 to ...
[1, 1, 1, 1]

== Done call_peaks()

== Done naive_overlap()
Task failed:
Program & line : '/home/ysun/TF_chipseq_pipeline/modules/callpeak_idr.bds', line 74
Task Name : 'idr2 rep2-pr'
Task ID : 'chipseq.bds.20170920_084412_704/task.callpeak_idr.idr2_rep2_pr.line_74.id_38'
Task PID : '30964'
Task hint : 'idr --samples /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/pseudo_reps/rep2/pr1/SRR074924_SRR074925.nodup.pr1.tagAlign_x_SRR074927.nodup_SRR074929.nodup.tagAlign.regionPeak.gz /home/ysun/published/GEO/GSE25836_Soccio
MolEndocr/FOXA2GAI/peak/spp/pseudo_reps/rep2/pr2/SRR074924'
Task resources : 'cpus: -1 mem: -1.0 B wall-timeout: 8640000'
State : 'ERROR'
Dependency state : 'ERROR'
Retries available : '1'
Input files : '[/home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/pseudo_reps/rep2/pr1/SRR074924_SRR074925.nodup.pr1.tagAlign_x_SRR074927.nodup_SRR074929.nodup.tagAlign.regionPeak.gz, /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/F
OXA2GAI/peak/spp/pseudo_reps/rep2/pr2/SRR074924_SRR074925.nodup.pr2.tagAlign_x_SRR074927.nodup_SRR074929.nodup.tagAlign.regionPeak.gz, /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/rep2/SRR074924_SRR074925.nodup.tagAlign_x_SRR074927.nodup_SRR07492
9.nodup.tagAlign.regionPeak.gz]'
Output files : '[/home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.filt.narrowPeak.gz, /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-
pr.unthresholded-peaks.txt.png, /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.log.txt, /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.unthres
holded-peaks.txt.gz, /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.filt.12-col.bed.gz]'
Script file : '/home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/chipseq.bds.20170920_084412_704/task.callpeak_idr.idr2_rep2_pr.line_74.id_38.sh'
Exit status : '1'
Program :

	# SYS command. line 76
	
	 if [[ -f $(which conda) && $(conda env list | grep aquas_chipseq_py3 | wc -l) != "0" ]]; then source activate aquas_chipseq_py3; sleep 5; fi;  export PATH=/home/ysun/TF_chipseq_pipeline/.:/home/ysun/TF_chipseq_pipeline/modules:/home/ysun/TF_chipseq_pipe

line/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

	# SYS command. line 78
	
	 idr --samples /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/pseudo_reps/rep2/pr1/SRR074924_SRR074925.nodup.pr1.tagAlign_x_SRR074927.nodup_SRR074929.nodup.tagAlign.regionPeak.gz /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FO

XA2GAI/peak/spp/pseudo_reps/rep2/pr2/SRR074924_SRR074925.nodup.pr2.tagAlign_x_SRR074927.nodup_SRR074929.nodup.tagAlign.regionPeak.gz --peak-list /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/rep2/SRR074924_SRR074925.nodup.tagAlign_x_SRR074927.nodu
p_SRR074929.nodup.tagAlign.regionPeak.gz --input-file-type narrowPeak
--output-file /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.unthresholded-peaks.txt --rank signal.value --soft-idr-threshold 0.05
--plot --use-best-multisummit-IDR --log-output-file /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.log.txt

	# SYS command. line 82
	
	 idr_thresh_transformed=$(awk -v p=0.05 'BEGIN{print -log(p)/log(10)}')
	
	# SYS command. line 85
	
	 awk 'BEGIN{OFS="\t"} $12>='"${idr_thresh_transformed}"' {if ($2<0) $2=0; print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,"0"}' /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.unthresholded-peaks.tx

t
| sort | uniq | sort -k7n,7n | gzip -nc > /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.13-col.bed.gz

	# SYS command. line 88
	
	 zcat /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.13-col.bed.gz | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' | gzip -nc > /home/ysun/published/GEO/GSE25836_Soccio_Mo

lEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.narrowPeak.gz

	# SYS command. line 89
	
	 zcat /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.13-col.bed.gz | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' | gzip -nc > /home/ysun/published/GEO/GSE25836_S

occio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.12-col.bed.gz

	# SYS command. line 91
	
	 bedtools intersect -v -a <(zcat -f /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.13-col.bed.gz) -b <(zcat -f /home/ysun/bds_pipeline_genome_data/hg38/hg38.blacklist.bed.gz) | grep -P '

chr[\dXY]+[ \t]' | awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}' | gzip -nc > /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.filt.13-col.bed.gz

	# SYS command. line 92
	
	 zcat /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.filt.13-col.bed.gz | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' | gzip -nc > /home/ysun/published/GEO/GSE25836_Socc

io_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.filt.narrowPeak.gz

	# SYS command. line 93
	
	 zcat /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.filt.13-col.bed.gz | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' | gzip -nc > /home/ysun/published/GEO/GSE25

836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.filt.12-col.bed.gz

	# SYS command. line 95
	
	 gzip -f /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.unthresholded-peaks.txt
	
	# SYS command. line 96
	
	 rm -f /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.13-col.bed.gz /home/ysun/published/GEO/GSE25836_Soccio_MolEndocr/FOXA2GAI/peak/spp/idr/pseudo_reps/rep2/FOXA2GAI_rep2-pr.IDR0.05.fil

t.13-col.bed.gz

	# SYS command. line 98
	
	 TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0

Fatal error: /home/ysun/TF_chipseq_pipeline/chipseq.bds, line 1529, pos 2. Task/s failed.
chipseq.bds, line 79 : main()
chipseq.bds, line 82 : void main() { // chipseq pipeline starts here
chipseq.bds, line 102 : do_idr() // IDR
chipseq.bds, line 1465 : void do_idr() {
chipseq.bds, line 1529 : wait

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.

phantompeakqual Error in names(cc) < names(otl) :attempt to set an attribute on NULL

hi,
I run phantompeakqual on my duplicates removed bam files and got an error shown in the title.
later, I re-run the same command but on bam files before samtools rmdup -s, it worked fine.

seems that samtools rmdup did something phantompeakqual did not like.

Thanks,
Tommy

Documentation for macs2?

This pipeline allows for use of macs2 in peak calling, but in the googledoc, I can't find what parameters are used. What parameters are given to macs2 for peak calling? Thanks!

Fatal error: /data/fmao/soft/chipseq_pipeline/modules/graphviz.bds, line 126, pos 18. Trying to access element number 1 from list 'etc' (list size: 1).

Final parameter values: [0.86 1.24 0.53 0.01]
Number of reported peaks - 24706/24706 (100.0%)

Number of peaks passing IDR cutoff of 0.05 - 7/24706 (0.0%)

Task has finished (119 seconds).
62944 (process ID) old priority 0, new priority 0
Task has finished (0 seconds).

== Done do_idr()
63010 (process ID) old priority 0, new priority 0
63265 (process ID) old priority 0, new priority 0
Fatal error: /data/fmao/soft/chipseq_pipeline/modules/graphviz.bds, line 126, pos 18. Trying to access element number 1 from list 'etc' (list size: 1).
chipseq.bds, line 67 : main()
chipseq.bds, line 70 : void main() { // chipseq pipeline starts here
chipseq.bds, line 92 : report()
chipseq.bds, line 1204 : void report() {
chipseq.bds, line 1211 : html += html_graph() // graphviz workflow diagram
graphviz.bds, line 83 : string html_graph() { // graph diagram
graphviz.bds, line 90 : dot := _make_dot("$rpt_aux_dir/$prefix"+"workflow.dot")
graphviz.bds, line 108 : string _make_dot( string file ) {
graphviz.bds, line 120 : for ( int k=0; k<_tmp_in.size(); k++ ) {
graphviz.bds, line 126 : box_name := etc[1]

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.

ERROR:root:--extsize must >= 1!

How to modify bds if my SGE cluster using vf=memory,p=n_cpu?

I succeed in submitting my jobs using this pipeline. However, I found the resource claimed in bds.config does not work so the pipeline would run even if my cluster is full of jobs. My SGE cluster used vf=memory,p=n_cpu instead of h_vmem and slots. We used to submit jobs like:

qsub -cwd -l vf=10g,p=8 -V run.sh

In my bds.config, I used:

sge.pe = make
sge.mem = vf
sge.timeout = h_rt
sge.timeout2 = s_rt
clusterRunAdditionalArgs = -V

The question is how could I modify my config file to adding something like vf=10g,p=8 in qsub command? Thank you.

SPP error in one replicate

Hi,
when I run your pipeline with 2 treatment and 2 control conditions, I get the following error after supposably several attempts to run SPP on a rep2 (see below).

Do you have an idea why? Thanks for helping out.

| 9116 | running (RUNNING) | spp rep2 | | if [ $(which run_spp_nodups.R 2> /dev/null | wc -l || echo) == "1" ]; then RUN_SPP=$(which run_spp_nodups.R);; else RUN_SPP=$(which run_spp.R);; fi; Rscript ${RUN_SPP} -c=data/align/rep2/n-flag-n2.nodup.tagAlign.gz -p=1 -i=/ |

03:48:51.697 Writing report file 'chipseq.bds.20171126_120808_625.report.html'
There were 23 warnings (use warnings() to see them)
data/krishnakumar2016/chipseq.bds.20171126_120808_625/task.callpeak_spp.spp_rep2.line_68.id_18.sh: line 38: error_in_spp_output_peak_does_not_exist: command not found
Task failed:
Program & line : 'TF_chipseq_pipeline/modules/callpeak_spp.bds', line 68
Task Name : 'spp rep2'
Task ID : 'chipseq.bds.20171126_120808_625/task.callpeak_spp.spp_rep2.line_68.id_18'
Task PID : '9116'
Task hint : 'if [ $(which run_spp_nodups.R 2> /dev/null | wc -l || echo) == "1" ]; then RUN_SPP=$(which run_spp_nodups.R);; else RUN_SPP=$(which run_spp.R);; fi; Rscript ${RUN_SPP} -c=data/align/rep2/n-flag-n2.nodup.tagAlign.gz -p=1 -i=/'
Task resources : 'cpus: -1 mem: -1.0 B wall-timeout: 8640000'
State : 'ERROR'
Dependency state : 'ERROR'
Retries available : '1'
Input files : '[data/align/rep2/n-flag-n2.nodup.tagAlign.gz, data/align/ctl2/n-ctr-n2.nodup.tagAlign.gz]'
Output files : '[data/peak/spp/rep2/n-flag-n2.nodup.tagAlign_x_n-ctr-n2.nodup.tagAlign.regionPeak.gz, data/peak/spp/rep2/n-flag-n2.nodup.tagAlign_x_n-ctr-n2.nodup.tagAlign.ccscore, data/peak/spp/rep2/n-flag-n2.nodup.tagAlign_x_n-ctr-n2.nodup.tagAlign.pdf]'
Script file : 'data/chipseq.bds.20171126_120808_625/task.callpeak_spp.spp_rep2.line_68.id_18.sh'
Exit status : '1'
Program :

[fread] Unexpected end of file (bwa_sai2sam_pe_core)

I am getting the following error for a paired end chipseq data.
.
.
.
[bwa_read_seq] 2.6% bases are trimmed.
[bwa_read_seq] 1.8% bases are trimmed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[fread] Unexpected end of file

00:18:26.669 Wait: Task 'chipseq.bds.20180516_143724_219_parallel_30/task.align_bwa.bwa_sam_PE_ctl1.line_181.id_10' finished.
Fatal error: /path/TF_chipseq_pipeline/modules/align_bwa.bds, line 87, pos 3. Task/s failed.

I checked that the fastq files (i.e. for both replicate and control replicate, i checked the pairing of reads, between read1 and read2, it looks ok).

Interestingly this error is coming at the end of aligning a control replicate. The same control replicate I have used as control for a different replicate. There the whole pipeline ran till completion. How do I fix this? Thanks in advance...

more than 2 replicates can not be used for chips_pipeline

Hi, I am trying to run chipseq_pipeline using more than 2 replicates using the command as below. But it seems that the program can not take more than 2 replicates. I also attached part of the log file to show the files the program took. Is there any parameter I should set to using more replicates? Thanks.

$ bds /home/ysun/TF_chipseq_pipeline/chipseq.bds -trimmed_fastq TRUE -species hg38 -nth 12 -out_dir CTCF -se -fastq1 CTCF_islet6-1.fastq.gz -fastq2 CTCF_islet6-2.fastq.gz -ctl_fastq1 ctrl_islet2.fastq.gz -ctl_fastq2 ctrl_islet3.fastq.gz -ctl_fastq3 ctrl_islet4.fastq.gz -ctl_fastq4 ctrl_islet5.fastq.gz -ctl_fastq5 ctrl_islet6.fastq.gz

========================================================================
== git info
......

== checking chipseq parameters ...

== checking input files ...

Rep1 fastq (SE) :
/home/ysun/published/GEO/GSE23784_Stitzel_CellMetab/CTCF_islet6-1.fastq.gz
Rep2 fastq (SE) :
/home/ysun/published/GEO/GSE23784_Stitzel_CellMetab/CTCF_islet6-2.fastq.gz
Control Rep1 fastq (SE) :
/home/ysun/published/GEO/GSE23784_Stitzel_CellMetab/ctrl_islet2.fastq.gz
Control Rep2 fastq (SE) :
/home/ysun/published/GEO/GSE23784_Stitzel_CellMetab/ctrl_islet3.fastq.gz
Distributing 12 to ...
{rep2=1, rep1=4, ctl1=6, ctl2=1}

MACS2 KeyError

Hello,
sorry to bother you, but is it possible in some way to update MACS2?
I get a KeyError with MACS2 if there is a conting which doesn't contain any peaks.
It is possible to get around this problem with the following fix:
def get_data_from_chrom (self, str chrom):
if not self.peaks.has_key(chrom):
self.peaks[chrom]=[]
return self.peaks[chrom]

In PeakIO.pyx file.

This issue is fixed in MACS 2.1.1, but if I change the requirements.txt file, MACS2 doesn't work anymore.

Read length

We have been using the Aquas pipeline successfully with 75bp SE reads. However, our sequencing core is adding NovaSeq machines that will at minimum give 100bp reads. Is there an advantage/disadvantage to the extra read length? Will we need to change anything in the pipeline to adjust for this difference?

Thank you in advance!

ERROR:root:--extsize must >= 1! and chipseq.bds can not get all replicates go through

I am running chipseq with 9 replicates and 11 controls as

bds chipseq.bds -final_stage peak -no_pseudo_rep -peak_caller macs2 -species hg19 -nth 12 -out_dir test_hg19 -title B10 -fastq1 SRR8701918.fastq.gz -fastq2 SRR8701919.fastq.gz -fastq3 SRR8701920.fastq.gz -fastq4 SRR8701921.fastq.gz -fastq5 SRR8701922.fastq.gz -fastq6 SRR8701923.fastq.gz -fastq7 SRR8701925.fastq.gz -fastq8 SRR8701926.fastq.gz -fastq9 SRR8701924.fastq.gz -ctl_fastq1 SRR8702298.fastq.gz -ctl_fastq2 SRR8702299.fastq.gz -ctl_fastq3 SRR8702300.fastq.gz -ctl_fastq4 SRR8702301.fastq.gz -ctl_fastq5 SRR8702302.fastq.gz -ctl_fastq6 SRR8702303.fastq.gz -ctl_fastq7 SRR8702304.fastq.gz -ctl_fastq8 SRR8702305.fastq.gz -ctl_fastq9 SRR8702306.fastq.gz -ctl_fastq10 SRR8702307.fastq.gz -ctl_fastq11 SRR8702308.fastq.gz

The alignment seemed fine. But in the "align" folder, I can only see 2 controls as "ctl1 ctl2 pooled_ctl pooled_pseudo_reps pooled_rep pseudo_reps rep1 rep2 rep3 rep4 rep5 rep6 rep7 rep8 rep9". Were other controls pooled together? Or 2 controls are the most control chipseq.bds can take?
macs2 peak call seemed failed and I got a lot of error message "ERROR:root:--extsize must >= 1!"

Any suggestion will be appreciated. Thanks.

Some reps get ERROR:root:--extsize must >= 1!

We have run several ChIP-seq datasets through the Aquas pipeline, and occasionally we will get ERROR:root:--extsize must >= 1!
It will cause any rep with this error to fail.
What causes this error and what can be done to correct it?

I have found that if the pipeline shows this error, we can change the macs2 --extsize parameter to 200, and restart the job. The pipeline will pick up from macs2 and apparently complete the job without error. The resulting output seems to be correct when looking at the number of peaks called, passing IDR, and visualizing bigwigs with IGV. However, the cross-correlation plot is clearly effected and the estimated fragment length is negative. Is this the correct or best way to address this issue?

I have attached the output log from such a run, along with the summary html and an example image of bigwig visualization between reps. Rep1 ran normally, Rep2 had the error.
Thanks.

dead loop

Hi Jin,
I've updated the pipeline on SCG, and ran the pipeline again using the same sample files. It took 8h to run successfully last time. However, it's almost 2 days, the pipeline is still do the alignment, no error pop up though. I attached the two logs, could you let me know what is wrong? Thanks.

h3k27ac_noncellcycle_dmso.BDS20180129.log
h3k27ac_noncellcycle_dmso20180516.BDS.log

phantompeakqualtools missing from install_dependencies

When installing dependencies, it reports an error on line 63 for cloning phantompeakqualtools.
When attempting to run the pipeline, it errors at xcor with an stderr message of no run_spp.R in directories.

.fq extension not removed from prefix in bwa alignement step

bwa_aln(), bwa_sam() and bwa_sam_PE() do not remove the "fq" extension of the fastq file while generating the prefix for the output files.
This causes the ouput bam file to be saved with the "fq" extension, which in turn causes failure in the postalign_bam.bds step with the following error:

[E::hts_open] fail to open file '/path/to/out/samplename.PE2SE.bam' samtools: failed to open "/path/to/out/samplename.PE2SE.bam" for reading: No such file or directory sambamba-sort: not enough data in stream

Issue with bds

Hi, I installed the piepline as suggested in README here without any error, but I get the below error when I just type bds or try to run the pipeline. Do you know how can I fix this? I have installed bds as recommended in the GitHub readme. I am not sure if it's a bds issue or something with our cluster. Thanks in advance for your help.

Exception in thread "main" java.lang.UnsupportedClassVersionError: org/bds/Bds : Unsupported major.minor version 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:482)

pipeline deprecated?

Hello,

Thanks for making this software available.

I have been using the current version of chipseq_pipeline for the past few months and would like to continue using it. However, last week I noticed that the readme.md has the message 'This pipeline has been deprecated'.
I took a quick look at the new version of the pipeline, it seems that the steps are largely the same though the installation is different. Does this sound correct?

Thanks!

shuf: end of file

Hi, I am running the pipeline for the first time for a histone chipseq data paired-end with two replicates and two control replicates. I am getting this error:
shuf: /data/path/prefix.trim_50bp.tagAlign.gz: end of file
Program & line : '/software/path/TF_chipseq_pipeline/modules/postalign_bed.bds', line 42

When I checked the latest version of that bds file on github I found an inserted line.

"no_random_source := false help Disable --random-source for UNIX shuf. Hot fix for end of file error."

Do I need to get the latest changes and rerun the pipeline from beginning ?
On a related note, how do I keep the pipeline up to date? Using
'git pull' ?
I did 'git pull origin master', which was probably a mistake, as I am getting this message now.

"error: Your local changes to the following files would be overwritten by merge:
default.env
Please, commit your changes or stash them before you can merge.
Aborting"
Sorry about the naive questions, and thanks in advance.

software installation

I have encountered the following problem.

Error: ERROR: placeholder '/root/miniconda3/envs/_build_placehold_placehold_placehold_placehold_placehold_p' too short in: glib-2.43.0-2

And then I followed your instruction: "If you see the following error, then update your Anaconda with conda update conda and downgrade it to 4.0.5 conda install conda==4.0.5." But it didn't work.
I found that the problem came from

conda create -n aquas_chipseq --file requirements.txt -y -c defaults -c bioconda -c r

in your install_dependencies.sh.
Then I read conda/conda-build#877 and https://groups.google.com/a/continuum.io/forum/#!topic/anaconda/fgDBJ2YwETI.
My anaconda path is

/share_bio/nas5/amsszhangsh_group/wangcan/software/python/anaconda3

It seemed that when I run

conda create -n aquas_chipseq --file requirements.txt -y -c defaults -c bioconda -c r

the prefix was

/share_bio/nas5/amsszhangsh_group/wangcan/software/python/anaconda3/envs/aquas_chipseq

which consisted of 86 characters. Then I tried to limit it to 80 characters. So I trimmed the trailing 6 characters in aquas_chipseq and run

conda create -n aquas_c --file requirements.txt -y -c defaults -c bioconda -c r

And I succeeded.
I found that conda 4.0.5 and 4.1.11 are no different on this problem. So the version of conda is not the point.
To solve this problem without modifying your script, I re-installed the anaconda in a directory with a shorter absolute path

/share_bio/nas5/amsszhangsh_group/wangcan/anaconda3

Then there were no problem when I run install_dependencies.sh.
If one does not want to re-install anaconda, I think to use conda create -p your_short_path can also work. For example, use

conda create -p /share_bio/nas5/amsszhangsh_group/wangcan/aquas_chipseq

to replace

conda create -n aquas_chipseq

conda create -n aquas_chipseq --file requirements.txt -y -c defaults -c bioconda -c r

Undefined symbol: PyUnicodeUCS2_AsASCIIString

Hi there !

I followed all the very clear instructions you provide to install the pipeline and everything went fine.
I am giving my very first try on some toy datasets and while mapping / filtering / cross-correlation steps are running without any issue, I get the following error at the step right after, which forces the pipeline to exit :

= Done align()
00:00:37.633	Waiting for parallel 'chipseq.bds.20180117_161156_433_parallel_33' to finish. RunState: FINISHED
00:00:37.634	Waiting for parallel 'chipseq.bds.20180117_161156_433_parallel_32' to finish. RunState: FINISHED
00:00:41.753	Waiting for all 'parrallel' to finish.
00:00:41.754	Waiting for parallel 'chipseq.bds.20180117_161156_433_parallel_31' to finish. RunState: FINISHED
00:00:41.754	Waiting for parallel 'chipseq.bds.20180117_161156_433_parallel_30' to finish. RunState: FINISHED
00:00:41.755	Waiting for parallel 'chipseq.bds.20180117_161156_433_parallel_33' to finish. RunState: FINISHED
00:00:41.755	Waiting for parallel 'chipseq.bds.20180117_161156_433_parallel_32' to finish. RunState: FINISHED
00:00:42.504	Executioner factory: Creating new executioner type 'LOCAL'
00:00:42.668	ExecutionerLocal 'Local[34]': Queuing task: chipseq.bds.20180117_161156_433/task.postalign_bam.jsd.line_812.id_10
00:00:42.669	Waiting for all tasks to finish.
00:00:42.669	Wait: Waiting for task to finish: chipseq.bds.20180117_161156_433/task.postalign_bam.jsd.line_812.id_10, state: SCHEDULED
00:00:42.669	ExecutionerLocal 'Local[34]': Started running
00:00:42.670	ExecutionerLocal 'Local[34]': Task selected 'chipseq.bds.20180117_161156_433/task.postalign_bam.jsd.line_812.id_10' on host 'localhost'
Traceback (most recent call last):
  File "/pasteur/homes/piroux/miniconda3/envs/aquas_chipseq/bin/plotFingerprint", line 4, in <module>
    from deeptools.plotFingerprint import main
  File "/pasteur/homes/piroux/miniconda3/envs/aquas_chipseq/lib/python2.7/site-packages/deeptools/plotFingerprint.py", line 4, in <module>
    import numpy as np
  File "/mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/__init__.py", line 180, in <module>
    from . import add_newdocs
  File "/mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/add_newdocs.py", line 13, in <module>
    from numpy.lib import add_newdoc
  File "/mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/lib/__init__.py", line 8, in <module>
    from .type_check import *
  File "/mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/lib/type_check.py", line 11, in <module>
    import numpy.core.numeric as _nx
  File "/mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/core/__init__.py", line 14, in <module>
    from . import multiarray
ImportError: /mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/core/multiarray.so: undefined symbol: PyUnicodeUCS2_AsASCIIString
Task failed:
	Program & line     : '/pasteur/homes/piroux/TF_chipseq_pipeline/modules/postalign_bam.bds', line 812
	Task Name          : 'jsd'
	Task ID            : 'chipseq.bds.20180117_161156_433/task.postalign_bam.jsd.line_812.id_10'
	Task PID           : '30942'
	Task hint          : 'plotFingerprint -b /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./align/rep1/D0K4Me1_Rep1.nodup.bam /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./align/rep2/D0K4Me1_Rep2.nodup.bam /pasteur/projet'
	Task resources     : 'cpus: 4	mem: -1.0 B	wall-timeout: 8640000'
	State              : 'ERROR'
	Dependency state   : 'ERROR'
	Retries available  : '1'
	Input files        : '[/pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./align/rep1/D0K4Me1_Rep1.nodup.bam, /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./align/rep2/D0K4Me1_Rep2.nodup.bam, /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./align/ctl1/D0Input_Rep1.nodup.bam]'
	Output files       : '[/pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./qc/._jsd.png, /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./qc/._jsd.qc]'
	Script file        : '/pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/chipseq.bds.20180117_161156_433/task.postalign_bam.jsd.line_812.id_10.sh'
	Exit status        : '1'
	Program            :

		# SYS command. line 813

		 if [[ -f $(which conda) && $(conda env list | grep aquas_chipseq | wc -l) != "0" ]]; then source activate aquas_chipseq; sleep 5; fi;  export PATH=/pasteur/homes/piroux/TF_chipseq_pipeline/.:/pasteur/homes/piroux/TF_chipseq_pipeline/modules:/pasteur/homes/piroux/TF_chipseq_pipeline/utils:${PATH}:/bin:/usr/bin:/usr/local/bin:${HOME}/.bds; set -o pipefail; STARTTIME=$(date +%s)

		# SYS command. line 814

		 plotFingerprint -b /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./align/rep1/D0K4Me1_Rep1.nodup.bam /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./align/rep2/D0K4Me1_Rep2.nodup.bam /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./align/ctl1/D0Input_Rep1.nodup.bam --JSDsample /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./align/ctl1/D0Input_Rep1.nodup.bam \
					--labels rep1 rep2 ctl1 \
					--outQualityMetrics /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./qc/._jsd.qc \
					--minMappingQuality 30 \
					-T "Fingerprints of different samples" \
					--blackListFileName /pasteur/homes/piroux/TF_chipseq_pipeline/genome_data/hg19/wgEncodeDacMapabilityConsensusExcludable.bed.gz \
					--numberOfProcessors 4 \
					--plotFile /pasteur/projets/policy01/Bischof-NGS/work/PROJET_SENESCENCE/ChIP-seq_HISTONE/RAS_OIS/TST_AQUA_2/./qc/._jsd.png

		# SYS command. line 823

		 TASKTIME=$[$(date +%s)-${STARTTIME}]; echo "Task has finished (${TASKTIME} seconds)."; sleep 0
	StdErr (100000000 lines)  :
		Traceback (most recent call last):
		  File "/pasteur/homes/piroux/miniconda3/envs/aquas_chipseq/bin/plotFingerprint", line 4, in <module>
		    from deeptools.plotFingerprint import main
		  File "/pasteur/homes/piroux/miniconda3/envs/aquas_chipseq/lib/python2.7/site-packages/deeptools/plotFingerprint.py", line 4, in <module>
		    import numpy as np
		  File "/mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/__init__.py", line 180, in <module>
		    from . import add_newdocs
		  File "/mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/add_newdocs.py", line 13, in <module>
		    from numpy.lib import add_newdoc
		  File "/mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/lib/__init__.py", line 8, in <module>
		    from .type_check import *
		  File "/mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/lib/type_check.py", line 11, in <module>
		    import numpy.core.numeric as _nx
		  File "/mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/core/__init__.py", line 14, in <module>
		    from . import multiarray
		ImportError: /mount/gensoft2/adm/lib/python2.7/site-packages/numpy-1.10.1-py2.7-linux-x86_64.egg/numpy/core/multiarray.so: undefined symbol: PyUnicodeUCS2_AsASCIIString

00:01:00.967	Wait: Task 'chipseq.bds.20180117_161156_433/task.postalign_bam.jsd.line_812.id_10' finished.
00:01:00.968	Waiting for all 'parrallel' to finish.
00:01:00.968	Waiting for parallel 'chipseq.bds.20180117_161156_433_parallel_31' to finish. RunState: FINISHED
00:01:00.968	Waiting for parallel 'chipseq.bds.20180117_161156_433_parallel_30' to finish. RunState: FINISHED
00:01:00.968	Waiting for parallel 'chipseq.bds.20180117_161156_433_parallel_33' to finish. RunState: FINISHED
00:01:00.968	Waiting for parallel 'chipseq.bds.20180117_161156_433_parallel_32' to finish. RunState: FINISHED
Fatal error: /pasteur/homes/piroux/TF_chipseq_pipeline/chipseq.bds, line 410, pos 3. Task/s failed.
chipseq.bds, line 90 :	main()
chipseq.bds, line 93 :	void main() { // chipseq pipeline starts here
chipseq.bds, line 100 :		jsd() // plot fingerprint and compute synthetic JS distance
chipseq.bds, line 384 :	void jsd() { // plot fingerprint
chipseq.bds, line 385 :		if ( filt_bam.hasKey("ctl1") && filt_bam.hasKey("rep1") ) {
chipseq.bds, line 410 :			wait

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.
00:01:01.076	Writing report file 'chipseq.bds.20180117_161156_433.report.html'
00:01:01.176	Program 'chipseq.bds.20180117_161156_433' finished, exit value: 1, tasks executed: 1, tasks failed: 1, tasks failed names: jsd.
00:01:01.176	Finished. Exit code: 1
00:01:01.177	ExecutionerLocal 'Local[34]': Killed
00:01:01.192	ExecutionerLocal 'Local[34]': Finished running

I assume it is somehow related to numpy library, but wasn't able to find any useful thread on the internet. Plus, I must confess my understanding in handling Python libraries is limited ...

Thx a lot for this awesome tool and for your previous help.

Pef

kundajelab / chipseq_pipeline Goto Github PK

chipseq_pipeline's Introduction

This pipeline has been deprecated as of June 2018. Please update your pipelines to the official WDL-based ENCODE DCC pipeline at https://github.com/ENCODE-DCC/chip-seq-pipeline2 (June 2018)

AQUAS Transcription Factor and Histone ChIP-Seq processing pipeline

Introduction

Installation

Java

Conda

BigDataScript

AQUAS Pipeline

Dependencies

Genome data

Installation for internet-free computers

Usage

Command line arguments / configuration JSON file

Stopping / Resuming pipeline

Running pipelines with a cluster engine

Input data type

Endedness (SE/PE)

Pipeline steps

Parallelization

Changing peak caller

Changing dup marker

Managing multiple pipelines

Java issues (memory and temporary directory)

How to customize genome data installer?

Usage cheat sheet

Useful HTML reports

Temporary files

Output directory structure and file naming

QC metrics spreadsheet (TSV) generation

Programming with BDS

Requirements

Troubleshooting

"[gzclose] buffer error" during bwa aln

Cannot allocate memory (bwa fails due to lack of memory)

[samopen] no @SQ lines in the header. ( bwa sam failure )

Error: could not find environment: aquas_chipseq

SPP error: In min(npld$y[npld$fdr <= fdr])

Contributors

chipseq_pipeline's People

Contributors

Stargazers

Watchers

Forkers

chipseq_pipeline's Issues

N1: single-end with replicates and controls

N2: paired-end without replicates and controls

bigwig files

COMMANDS:

bds /home/TF_chipseq_pipeline/chipseq.bds -species hg38 -nth 4 -out_dir FOXA2GAI -se -fastq1:1 SRR074921.fastq.gz -fastq1:2 SRR074922.fastq.gz -fastq2:1 SRR074924.fastq.gz -fastq2:2 SRR074925.fastq.gz -ctl_fastq1 SRR074927.fastq.gz -ctl_fastq2 SRR074929.fastq.gz (ERROR)

ERROR:

alignments reported for multimapping : 0

replicates : 2

reads to subsample for cross-corr. analysis : 15000000

reads to subsample exp. replicates (0: no subsampling): 0

reads to subsample controls (0: no subsampling) : 0

Creating checkpoint file: Config or command line option disabled checkpoint file creation, nothing done.

Recommend Projects

Recommend Topics

Recommend Org

SPP error: `In min(npld$y[npld$fdr <= fdr])`