hpcbio / 16s-rdna-dada2-pipeline Goto Github PK

**DEFUNCT** dada2-based Nextflow pipeline for multiple amplicon data. See website for up-to-date version

Home Page: https://github.com/h3abionet/16S-rDNA-dada2-pipeline

License: MIT License

Nextflow 73.61% Dockerfile 0.39% HTML 3.62% Groovy 5.87% R 14.23% Python 0.80% Shell 1.48%

16s-rdna-dada2-pipeline's Introduction

16S rRNA gene amplicon sequencing pipeline using DADA2, implemented in Nextflow

A dada2-based workflow using the Nextflow workflow manager. The basic pipeline is currently implemented, including some basic read-tracking. This pipeline is adapted from https://github.com/HPCBio/dada2-Nextflow for implementation on the UCT high-performance compute cluster

Basic usage:

This pipeline can be run specifying parameters in a config file or with command line flags.
The typical example for running the pipeline with command line flags is as follows:
nextflow run uct-cbio/16S-rDNA-dada2-pipeline --reads '*_R{1,2}.fastq.gz' --trimFor 24 --trimRev 25 --reference 'gg_13_8_train_set_97.fa.gz' -profile uct_hex
The typical command for running the pipeline with command line flags is as follows:
nextflow run -c <dada2.conf>  <dada2.nf> -profile uct_hext

where:
dada2.conf is the configuration file
dada2.nf   is the pipeline script

To override existing values from the command line, please type these parameters:

Mandatory arguments:
  --reads                       Path to input data (must be surrounded with quotes)
  -profile                      Hardware config to use. Currently profile available for UCT's HPC 'uct_hex' - create your own if necessary
                                NB -profile should always be specified on the command line, not in the config file
  --trimFor                     integer. headcrop of read1 (set 0 if no trimming is needed)
  --trimRev                     integer. headcrop of read2 (set 0 if no trimming is needed)
  --reference                   Path to taxonomic database to be used for annotation (e.g. gg_13_8_train_set_97.fa.gz)

All available read preparation parameters:
  --trimFor                     integer. headcrop of read1
  --trimRev                     integer. headcrop of read2
  --truncFor                    integer. tailcrop of read1. enforced before trimming
  --truncRev                    integer. tailcrop of read2. enforced before trimming
  --maxEEFor                    integer. After truncation, R1 reads with higher than maxEE "expected errors" will be discarded. EE = sum(10^(-Q/10)), default=2
  --maxEERev                    integer. After truncation, R1 reads with higher than maxEE "expected errors" will be discarded. EE = sum(10^(-Q/10)), default=2
  --truncQ                      integer. Truncate reads at the first instance of a quality score less than or equal to truncQ; default=2
  --maxN                        integer. Discard reads with more than maxN number of Ns in read; default=0
  --maxLen                      integer. maximum length of sequence; maxLen is enforced before trimming and truncation; default=Inf (no maximum)
  --minLen                      integer. minLen is enforced after trimming and truncation; default=50
  --rmPhiX                      {"T","F"}. remove PhiX from read
  --minOverlap                  integer. minimum length of the overlap required for merging R1 and R2; default=20 (dada2 package default=12)
  --maxMismatch                 integer. The maximum mismatches allowed in the overlap region; default=0
  --trimOverhang                {"T","F"}. If "T" (true), "overhangs" in the alignment between R1 and R2 are trimmed off.
                                "Overhangs" are when R2 extends past the start of R1, and vice-versa, as can happen when reads are longer than the amplicon and read into the other-direction                                               primer region. Default="F" (false)

Other arguments:
  --pool                        Should sample pooling be used to aid identification of low-abundance ASVs? Options are
                                pseudo pooling: "pseudo", true: "T", false: "F"
  --outdir                      The output directory where the results will be saved
  --email                       Set this parameter to your e-mail address to get a summary e-mail with details of the run
                                sent to you when the workflow exits
  -name                         Name for the pipeline run. If not specified, Nextflow will automatically generate a random mnemonic.

Help:
  --help                        Will print out summary above when executing nextflow run uct-cbio/16S-rDNA-dada2-pipeline

Merging arguments (optional):
  --minOverlap                  The minimum length of the overlap required for merging R1 and R2; default=20 (dada2 package default=12)
  --maxMismatch                 The maximum mismatches allowed in the overlap region; default=0.
  --trimOverhang                If "T" (true), "overhangs" in the alignment between R1 and R2 are trimmed off. "Overhangs" are when R2 extends past the start of R1, and vice-versa, as can happen
                                when reads are longer than the amplicon and read into the other-direction primer region. Default="F" (false)

Taxonomic arguments (optional):
  --species                     Specify path to fasta file. See dada2 addSpecies() for more detail.

 Example run:
 To run on UCT hex
 1) Start a 'screen' session from the headnode
 2) Start an interactive job using: qsub -I -q UCTlong -l nodes=1:series600:ppn=1 -d `pwd`
 3) A typical command would look something like:
    nextflow run uct-cbio/16S-rDNA-dada2-pipeline --trimFor 24 --trimRev 25 --reference             /specify/relevant/directory/gg_13_8_train_set_97.fa.gz --email [email protected] -profile uct_hex --reads  '/specify/relevant/directory/*{R1,R2}.fastq' -with-singularity /scratch/DB/bio/singularity-containers/1a32017e5935-2018-05-31-  db3a9cebe9fc.img --pool 'pseudo'

Prerequisites

Nextflow, dada2 (>= 1.8), R (>= 3.2.0), Rcpp (>= 0.11.2), methods (>= 3.2.0), DECIPHER, phangorn, biomformat Note: if you are working on UCT hex you can simply use the singularity image specified in the uct_hex profile (no need to install these R packages)

Documentation

The uct-cbio/16S-rDNA-dada2-pipeline pipeline comes with documentation about the pipeline, found in the docs/ directory:

Built With

Credits

The initial implementation of the DADA2 pipeline as a Nextflow workflow (https://github.com/HPCBio/dada2-Nextflow) was done by Chris Fields from the high performance computational biology unit at the University of Illinois (http://www.hpcbio.illinois.edu). Please remember to cite the authors of DADA2 when using this pipeline. Further development to the Nextflow workflow and containerisation in Docker and Singularity for implementation on UCT's HPC was done by Dr Katie Lennard and Gerrit Botha, with inspiration and code snippets from Phil Ewels http://nf-co.re/

License

This project is licensed under the MIT License - see the LICENSE.md file for details

16s-rdna-dada2-pipeline's People

Contributors

Stargazers

Watchers

Forkers

uct-cbio brookelab jsan4christ mengqingren replikation eddug carolm5 xiangrong131 ningshuang-yao

16s-rdna-dada2-pipeline's Issues

Config files not working

Other nf.core pipelines, like methylseq, also work by passing in a configuration file (with all of the parameters, like so):

nextflow run methylseq/ -c green.conf -profile uiuc_biocluster -qs 8 -resume

This runs perfectly fine, and is actually a great way to encapsulate all the params used in a run. However, when trying the same thing with this workflow and this config:

name = "Testing"
// Pipeline Options
reads = "/home/groups/hpcbio/projects/benchmarking/dada2/testruns/round2/test-seqs2/*R{1,2}.fastq"
// timestamp = new SimpleDateFormat("yyyy-MM-dd").format(new java.util.Date())
// outdir = "./" + timestamp + "-dada2"

// Trimming
trimFor = 19
trimRev = 20
truncFor = 0
truncRev = 0
maxEEFor = 2
maxEERev = 2
truncQ = 2 //default
maxN = 0 //default
maxLen = "Inf" //default
minLen = 20 //default
rmPhiX = "FALSE"

// Merging
minOverlap = 20 //default=12
maxMismatch = 0 //default
trimOverhang = "T"

reference = "/home/groups/hpcbio/projects/benchmarking/dada2/data/silva_nr_v132_train_set.fa.gz"
species = "/home/groups/hpcbio/projects/benchmarking/dada2/data/silva_species_assignment_v132.fa.gz"

// NYI, for dada sample inference pooling (requires all samples)
pool = "F"

email = "[email protected]"
plaintext_email = "[email protected]"

I get this:

[cjfields@biologin-hpcbio round2]$ nextflow run dada2-Nextflow -c dada2.config -profile uiuc_biocluster -qs 8 -resume
N E X T F L O W  ~  version 0.31.1
Launching `dada2-Nextflow/main.nf` [angry_legentil] - revision: f19de2ceef
ERROR ~ Must set length of R1 (--trimFor) that needs to be trimmed (set 0 if no trimming is needed)

 -- Check '.nextflow.log' file for details

Clean up and fix outputs

Running into a few issues with outputs, particularly ones where sequences are concatenated. We should do the following for consistency:

Convert the sequence-based IDs to something a bit simpler and generic. The sequence-based ones work if we use phangorn, but they break with tools like Fasttree (it chops the names)
Dump sequence outputs (both unaligned and aligned seqs)
Dump tab-delim ASV and tax tables

ERROR ~ Error executing process > 'runFastQC

Hi,

Am trying to run the pipeline on Centos 7 and I get the following error,

ERROR ~ Error executing process > 'runFastQC (rFQC.120003-1020_S1_L001_R1_001)'

Caused by:
No signature of method: nextflow.processor.TaskPath.get() is applicable for argument types: (Integer) values: [0]
Possible solutions: getAt(int), grep(), getAt(java.lang.String), grep(java.lang.Object), head(int), wait()

Source block:
"""
mkdir ${pairId}_fastqc
fastqc --outdir ${pairId}_fastqc
${in_fastq.get(0)}
${in_fastq.get(1)}
"""

Tip: when you have fixed the problem you can continue the execution appending to the nextflow command line the option -resume

-- Check '.nextflow.log' file for details

Frome the log, I have the following information:

Apr-25 13:11:33.591 [Actor Thread 3] ERROR nextflow.processor.TaskProcessor - Error executing process > 'runFastQC (rFQC.120003-1020_S1_L001_R1_001)'

Apr-25 13:11:33.610 [Actor Thread 3] DEBUG nextflow.Session - Unexpected error dumping DGA status
java.util.ConcurrentModificationException: null
at java.util.LinkedHashMap$LinkedHashIterator.nextNode(LinkedHashMap.java:719)

Please advise,

Alternative taxonomic classifiers?

dada2 uses the RDP classifier and Silva; should we have an option for others?

Have option for QIIME2-like hash IDs for ASV names

At the moment we rename and use a very generic naming scheme. We could have a simple option to alternatively use library(digest) to generate MD5 for the IDs based on the sequences.

Testing and integration

Add a small toy data set for testing.

Add OMEGA_A and OMEGA_C settings?

It might be worth adding an option to set OMEGA_A and OMEGA_C:

benjjneb/dada2#354 (comment)

We have a few data sets that might benefit from this.

Sanity check primers

One issue we should check is whether the primers used in the amplification match the 5' ends of the sequence. PI mentioned that some custom amplicons may not match up

Rename repo?

We are using ITS now, and I may add custom amplicon pretty soon. So the '16S-rDNA-' part is becoming irrelevant.

Pull in upstream commits?

CBIO is working on a variation of this workflow which we should probably standardize on.

Add flag for stopping after trimming/MultiQC

See subject

Merge + concatenation

A few data sets (ITS in particular) have enough variation in amplicon length that the resulting sequences are likely a mixture of:

Reads that can be merged (ends overlap with confidence)
And reads that cannot, where the amplified fragment length is long enough sequences can't effectively overlap

How do we deal with these data in the pipeline? There is an example of one method for tackling this:

benjjneb/dada2#537

Add QC for assigned ASVs

We need a simple QC step added for ASVs which summarize ranks assigned/unassigned by whichever classifier used. Could be added in right after assignment.

Alternative MSA tools?

Should we have alternatives to DECIPHER, such as Infernal (+ covariance model ala IM-TORNADO), MAFFT, etc?

Nextflow and TRUE/FALSE flags in R

Boolean parameter flags for Groovy, such as true or false, are not being interpolated correctly within the Rscript section of processes (R expects upper-case boolean). So particular parameters relying on this switch are broken, such as trimOverhang and rmPhiX and are currently hard-coded.

> TRUE
[1] TRUE
> true
Error: object 'true' not found
>

Add additional options for generating phylogenetic trees

phangorn is notoriously slow in generating phylogenetic trees from data and may take days with a large data set. We should add an option to use something like Fasttree or RaXML.

Add low complexity filtering

We've run into low-complexity sequences sneaking through several times, particularly in runs with low conc samples or in very high coverage samples.

Portability

Testing portability of UCT modifications (e.g. using modules instead of Singularity, etc).

Better sample names?

Currently the sample IDs are derived from the FASTQ files. We could re-engineer the workflow to use a sample sheet similar to QIIME (which could optionally include experimental metadata).

Set up Nextflow process selectors for config file

Note the output from nextflow 0.31.1 regarding process selectors:

WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$runFastQC = <value>` with a process selector
[warm up] executor > slurm
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$runMultiQC = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$filterAndTrim = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$runFastQC_postfilterandtrim = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$runMultiQC_postfilterandtrim = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$mergeTrimmedTable = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$LearnErrorsFor = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$LearnErrorsRev = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$SampleInferDerepAndMerge = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$mergeDadaRDS = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$SequenceTable = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$ChimeraTaxonomySpecies = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$BiomFile = <value>` with a process selector
WARN: Process configuration syntax $processName has been deprecated -- Replace `process.$ReadTracking = <value>` with a process selector

dada2 on NovaSeq

We're starting to see more and more binned FASTQ data coming in, and this may be a problem with NovaSeq:

benjjneb/dada2#791

We could add a workaround based on what was proposed there.

Update read tracking for ITS

ITS has a few pre-processing steps that are run, including removal of reads with N's and using cutadapt prior to a final filterAndTrim step. Due to this, initial read counts and some of the intermediate processing steps are missing in the final read tracking report.