abiswas-odu / disco Goto Github PK

Multi-threaded Distributed Memory Overlap-Layout-Consensus (OLC) Metagenome Assembler

License: GNU General Public License v3.0

C++ 7.71% Shell 5.73% Makefile 0.17% C 0.97% Python 0.18% Java 85.17% CMake 0.02% M4 0.04%

disco's Introduction

DISCO

DISCO, Distributed Co-assembly of Overlap graphs, is a multi threaded and multiprocess distributed memory overlap-layout-consensus (OLC) metagenome assembler - DISCO. The detailed user manual of the assembler and how to use it to acheive best results is provided here: http://disco.omicsbio.org/user-manual. This is a quick start guide generally for developers and testers. Users with limited experience with genome assembly are advised to use the user manual.

Current Version

v1.0

Setup and Installation

Basic Dependencies

GNU GCC with C++11 support i.e. gcc4.9+ or above
MPI Library with MPI-3 support i.e. OpenMPI 1.8 and above or cray-mpich/7.4.0 and above. By default the mpic++ wrapper is needed. If you are on a Cray cluster and the wrapper is "CC". You will need to edit the compiler.mk file. Uncomment the line "CC := CC" and comment out "CC := mpic++".
zlib/1.2.8 is optional for reading gzipped fasta/fastq files.
Java Runtime Environment (build 1.7) is optional for trimming, filtering and error correction of reads using BBtools.

Installation Steps

Download the tarball with compiled executables for Linux with GCC 4.9 and above from https://github.com/abiswas-odu/Disco/releases. The code has been tested only on Linux and compiled with GCC4.9 and opemnpi 1.8.4.
If you decide to download the source code, use the following commands to build:
OpenMP version "make openmp". This is also the default make option.
MPI distributed computing version "make mpi-dist-comp"
MPI distributed memory version "make mpi-dist-mem"
All the versions can be built with "make all"
The assembler can be built with the make option "READGZ=1" to read gzipped files. If compiled successfully, the required executables will be built and the various runDisco... scripts can be used to run the assembler.

Quickly Running An Assembly

There are two basic versions of the assembler one for running on a single machine and another for running with MPI on a cluster. Both versions require data pre-processing of raw illumina reads. We provide two scripts to perform data pre-processing. The details of the pre-processing are provided in the Preprocessing of the Illumina data section below. If your data is pre-processed please continue to the Quickly Running DISCO section.

Data Pre-processing and Assembly: The following commands show the usage of the script for running data pre-processing and assembly on a single machine with multiple cores.

#!/bin/bash

# Pre-processing and assembly of separated paired end reads
runAssembly.sh -d ${output_dir} -in1 readA_1.fastq -in2 readA_2.fastq -n ${num_threads} -o ${OP_PREFIX} 

# Pre-processing and assembly of interleaved paired end reads
runAssembly.sh -d ${output_dir} -inP readA.fastq.gz,readB.fastq.gz -n ${num_threads} -o ${OP_PREFIX}

Data Pre-processing: The following commands show the usage of the script for running data pre-processing on a single machine with multiple cores. The pre-processed data thus produced can be assembled with the distributed version of DISCO.

#!/bin/bash

# Pre-processing and assembly of separated paired end reads
runECC.sh -d ${output_dir} -in1 readA_1.fastq -in2 readA_2.fastq -n ${num_threads} -o ${OP_PREFIX} 

# Pre-processing and assembly of interleaved paired end reads
runECC.sh -d ${output_dir} -inP readA.fastq.gz,readB.fastq.gz -n ${num_threads} -o ${OP_PREFIX}

Quickly Running DISCO

There are two versions of the assembler for running on a single machine and for running with MPI on a cluster.

Single Machine Version: This version of the assembler should be used if you are going to run the assembler on a single machine with one or more cores. The assembler is invoked through a run script ./runDisco.sh. Make sure the RAM on the machine is more than the disk space size of the uncompressed reads. The quick start command as shown below will be used in a batch job submission script or directly typed on the commandline terminal.

#!/bin/bash

# Separated paired end reads
runDisco.sh -d ${output_dir} -in1 readA_1.fastq -in2 readA_2.fastq -n ${num_threads} -o ${OP_PREFIX} 

# Interleaved paired end reads
runDisco.sh -d ${output_dir} -inP readA.fastq.gz,readB.fastq.gz -n ${num_threads} -o ${OP_PREFIX}

Use ./runDisco.sh -h for help information.

MPI Version: This version of the assembler should be used if you are going to run the assembler with MPI support on a cluster. The run script to invoke the assembler depends on the cluster management and job scheduling system.
1. If you have ORTE i.e. mpirun is available, invoke the assembler using the run script runDisco-MPI.sh.
2. If you have SLRUM i.e. srun is available, invoke the assembler using the run script runDisco-MPI-SLRUM.sh.
3. If you have ALPS i.e. aprun is available, invoke the assembler using the run script runDisco-MPI-ALPS.sh.

For the basic MPI version make sure the RAM on the nodes is more than the disk space size of the reads. If you have a large dataset, then use the Remote Memory Access (RMA) version. The RMA version of the assembler will equally distribute about 70% of the memory usage across all the MPI nodes. The quick start commands are:

#!/bin/bash

### MPI Verion 
### Separated paired end reads
runDisco-MPI.sh -d ${output_dir} -in1 {read_1.fastq} -in2 ${read2_2.fastq} -o ${OP_PREFIX} 

### MPI Remote Memory Access(RMA) Verion 
### Separated paired end reads
runDisco-MPI.sh -d ${output_directory} -in1 {read_1.fastq}  -in2 ${read2_2.fastq} -o ${OP_PREFIX} -rma

Use runDisco-MPI.sh -h for help information.

Guide to Assembly of Raw Metagenomic Illumina data

The raw Illumina sequences need to be preprocessed before assembly with Disco. Disco provides wrapper scripts to perform preprocessing with BBTools. Please see user manual for more details: http://disco.omicsbio.org/user-manual. We package BBtools inside our release for ease of use. The BBtools scripts shown below are available in the bbmap directory.

Preprocessing of the Illumina data

Since Disco works best with reads without errors, preprocessing plays an important role in deciding the quality of the assembly results. The 3 basic pre-processing steps are trimming, filtering and eror correction.

Trimming, filtering, (merging), and eror correction

We have tested Brian Bushnell's suite of tools BBTools extensively on Illumina data and have obtained good results. Suppose the Illumina reads data set is called $reads, the steps we recommend are following:

#!sh

# Use bbduk.sh to quality and length trim the Illumina reads and remove adapter sequences
# 1. ftm = 5, right trim read length to a multiple of 5
# 2. k = 11, Kmer length used for finding contaminants
# 3. ktrim=r, Trim reads to remove bases matching reference kmers to the right
# 4. mink=7, look for shorter kmers at read tips down to 7 bps
# 5. qhdist=1, hamming distance for query kmers
# 6. tbo, trim adapters based on where paired reads overlap
# 7. tpe, when kmer right-trimming, trim both reads to the minimum length of either
# 8. qtrim=r, trim read right ends to remove bases with low quality
# 9. trimq=15, regions with average quality below 10 will be trimmed.
# 10. minlength=70, reads shorter than 70bps after trimming will be discarded.
# 11. ref=$adapters, adapters shipped with bbnorm tools
# 12. –Xmx8g, use 8G memory
# 13. 1>trim.o 2>&1, redirect stderr to stdout, and save both to file *trim.o*
adapters= bbmap/resources/adapters.fa
artifacts= bbmap/resources/sequencing_artifacts.fa.gz
phiX_adapters= bbmap/resources/phix174_ill.ref.fa.gz
bbduk.sh in=$reads out=trim.fq.gz ktrim=r k=23 mink=7 hdist=1 tpe tbo ref=${adapters} ftm=5 qtrim=r trimq=15
bbduk.sh in=trim.fq.gz out=filter.fq.gz k=23 hdist=1 ref=${artifacts},${phiX_adapters}

Error correction with BBMerge and Tadpole

Tarpole is a memory efficient error correction tool from the bbtools package that runs within reasonable time. We also use the bbmerge tool from the same package to error correct the overlapping paired end reads. We suggest using the following commands for error correction.

#!bash
# 1. ecco mode of bbmerge for correction of overlapping paired end reads without merging
# 2. mode=correct, use tadpole for correction
bbmerge.sh in=filter.fq.gz out=ecc.fq.gz ecco mix adapters=default
tadpole.sh in=ecc.fq.gz out=tecc.fq.gz ecc ordered prefilter=1
#If the above goes out of memory, try
tadpole.sh in=ecc.fq.gz out=tecc.fq.gz ecc ordered prefilter=2

Assembly of Error Corrected Data

Assembly on a Single Node

The Disco assembler is invoked through the run script ./runDisco.sh. The basic quick start commands with default parameters are as follows. The default parameters are based on empherical tests on real metagenomic datasets.

#!/bin/bash

# Separated paired end reads
runDisco.sh -d ${output_directory} -in1 {read_1.fastq}  -in2 ${read2_2.fastq} -n ${num_threads} -m {max_mem_usage} -o ${64gen} 

# Interleaved paired end reads
runDisco.sh -d ${output_directory} -inP {read_P.fastq} -n ${num_threads} -m {max_mem_usage} -o ${64gen}

# Single end reads
runDisco.sh -d ${output_directory} -inS {read.fastq} -n ${num_threads} -m {max_mem_usage} -o ${64gen}

For all the options of Disco, use ./runDisco.sh -h

In case the program crashes due to exceeding wall clock time, the assembler can be restarted with the same command.

Assembly on a Distributed Nodes

The assembler can be run on a distributed machine using the three distributed assembly scripts.

Assembly Run Script Options

Usage:

runDisco.sh [OPTION]......

-inS single read filenames (comma separated fasta/fastq/fastq.gz file).

-in1 forward paired read filename (single fasta/fastq/fastq.gz file).

-in2 reverse paired read filename (single fasta/fastq/fastq.gz file).

-inP interleaved paired read filenames (comma separated fasta/fastq/fastq.gz file).

-d output directory path (DEFAULT: current directory).

-o output filename prefix (DEFAULT: disco).

-h help.

-m maximum memory to be used (DEFAULT: 125 GB).

-n number of threads (DEFAULT: 32).

-obg only build overlap graph (DEFAULT: False).

-osg only simplify existing overlap graph (DEFAULT: False).

-p assembly parameter file for 1st assembly iteration.

-p2 assembly parameter file for 2nd assembly iteration.

-p3 assembly parameter file for 3rd assembly iteration.

The assembly script has basic options to specify required parameters.

Controlling memory usage

The memory usage of Disco can be controlled using the -m option to the run script as shown above. The default memory usage is to take all the system resources. In case that has to be avoided or the program crashes ot is too slow due to memory page swapping, the user can set a ubber bound on the memory. The minumum memory to assemble a dataset is:

Min Required Memory (GB) = (Disk Space of Reads) + (1GB * num_threads)

The program will run faster if more memory is made available.

Restarting Disco for repeat assembly and handling assembly crashes

Disco assembler can be restarted with changed assembly and scaffolding parameters using the -osg option. Setting this option while invoking runDisco.sh will reuse the overlap graph constructed earlier and only perform the graph simplification step. This will significantly reduce executime time of assemblies on the same dataset with different parameters.

Disco assembler can also be restarted after a crash caused due to exceeding wall clock time or out of memory errors. The job must be restarted with the same command as before and Disco will attempt to continue the assembly. Do not set the -osg option in this case.

Setting assembly parameters

The assembly parameters can be modified to attempt better assembly. This can be done through a parameter file passed using the -p parameter to the run script.

The configurable parameters are described in the user manual http://disco.omicsbio.org/user-manual.

The default configuration parameters are in disco.cfg, disco_2.cfg, and disco_3.cfg.

Disco Assembler Output

Please see the OUTPUT.md file for description of the output files.

Questions?

disco's People

Stargazers

Watchers

Forkers

qiumingyao

disco's Issues

Installation of DISCO via conda

Hi,

DISCO has been integrated into Bioconda, to facilitate its installation via conda: http://bioconda.github.io/recipes/disco/README.html
If you want, I can open a Pull Request here to update the README file and add how to install DISCO using conda and also the badge:

Bérénice

Disco finishes with "Error 2"

Dear,

I've been trying to run Disco with:
runDisco.sh -inS INPUT.fq.gz -n 10 -m 100 -d OUT/INPUT -o INPUT

but it ends with an error:

[...]
>>> Function start: removeSimilarEdges()
<<< Function stop: removeSimilarEdges(), Elapsed time: 7.14299e-05 seconds, Memory usage: 3635 - 3635 = 0 MB.
----
numberOfEdges = 0

>>> Function start: removeDeadEndNodes()
number of dead end nodes found: 0
number of edges deleted: 0
<<< Function stop: removeDeadEndNodes(), Elapsed time: 3.18663 seconds, Memory usage: 3665 - 3635 = 30 MB.
----
numberOfEdges = 0
<<< Function stop: simplifyGraph(), Elapsed time: 12.6248 seconds, Memory usage: 3665 - 3635 = 30 MB.
----

>>> Function start: printAllEdges()
<<< Function stop: printAllEdges(), Elapsed time: 0.0738916 seconds, Memory usage: 3665 - 3665 = 0 MB.
----

>>> Function start: calculateFlowStream()
Real graph size:0
Recorded graph size:0
Set source sink edges for each node.
Finished initializing flow to edges.
Number of edges with flow 1 set is 0
Number of reads contained in these edges is 0
Calling CS2 for flow analysis

warning: infinite capacity replaced by BIGGEST_FLOW

Error 2

Should I worry about the warning?
And what is the error?
thanks,

Problems with README and runDisco.sh?

I am trying to get Disco running. Apart from this issue, in following the README I have two questions on the section under quickly running the assembler.

first, what is ${OP_PREFIX}? can it be arbitrary? I set it to xxx.
when I run runDisco it seems like things must be fail well before the error message :)

./runDisco.sh -d podar -inP SRR606245.pe.qc.fq.gz -n 8 -o xxx
Graph construction module ./buildG exists.
Partial graph simplification module ./parsimplify exists.
Graph simplification module ./fullsimplify exists.
Cresting output directory: podar
Starting Time is Mon Jun 5 14:28:14 EDT 2017
cat: podar/assembly/xxx_contigsFinal_*.fasta: No such file or directory
cat: podar/assembly/xxx_scaffoldsFinal_*.fasta: No such file or directory
Ending Time is Mon Jun 5 14:28:15 EDT 2017

the log file indicates that the file didn't exist (which was my fault - a faulty ln -s) but it seems like runDisco should exit on this kind of failure!

>>> Function start: loadReadLenghtsFromReadFile()
load reads from read file: SRR606245.pe.qc.fq.gz
         0 read lengths loaded from this read file
<<< Function stop: loadReadLenghtsFromReadFile(), Elapsed time: 9.93079e-05 seco
nds, Memory usage: 1 - 1 = 0 MB.
----
<<< Function stop: DataSet(), Elapsed time: 0.000196602 seconds, Memory usage: 1
 - 1 = 0 MB.

also, 'seperately' is misspelled.

I think 'mate pairs' is an incorrect specification in Illumina - generally we use 'mate pair' for long-insert paired reads, and 'paired end' for short-insert/standard illumina.

runDisco-MPI.sh cat error

Error at the end of the assembly with runDisco-MPI.sh

cat: Sample_D1024-Disco/assembly/Sample_D1024-Disco_contigsFinal_.fasta: No such file or directory
cat: Sample_D1024-Disco/assembly/Sample_D1024-Disco_scaffoldsFinal_.fasta: No such file or directory

The files are: Sample_D1024-Disco/assembly/Sample_D1024-Disco_contigsFinalCombined.fasta
and Sample_D1024-Disco/assembly/ Sample_D1024-Disco_scaffoldsFinalCombined.fasta

Support for pacbio/nanopore guided assemblies?

I've got a metagenome sequenced with both pacbio and illumina reads and I am interested in using DISCO for assembly. I saw this tweet a while back and I was curious if this feature is coming soon.

Is it possible for any currently available DISCO builds to use this data?

Thanks.

runtime stats for Disco running on QC community from Shakya et al., 2013

command:

./runDisco.sh -d podar -inP SRR606249.pe.qc.fq.gz -n 16

    resources_used.cput = 287:29:07
    resources_used.energy_used = 0
    resources_used.mem = 35525268kb
    resources_used.vmem = 35977764kb
    resources_used.walltime = 38:07:51

buildG-MPIRMA not found during compilation

Working on a workstation running Ubuntu 14.04, with all compilers and mpic++.openmpi.

Compilation of distributed version fails [from cloned git at https://github.com/abiswas-odu/Disco.git ] with error

cp src/BuildGraph/Release/buildG .
cp src/SimplifyGraph/Release/fullsimplify .
cp src/SimplifyGraph/Release/parsimplify .
cp src/BuildGraphMPI/Release/buildG-MPI .
cp src/BuildGraphMPIRMA/Release/buildG-MPIRMA .
cp: cannot stat ‘src/BuildGraphMPIRMA/Release/buildG-MPIRMA’: No such file or directory
make: *** [all] Error 1

Build of Version 1.2 w/BBMap 38.x.x in Spack?

Would it be plausible/possible to incorporate all source, dependencies, and other programs (BBMap's most recent version) into Spack for deployment on multiple different HPC configurations?

https://computation.llnl.gov/projects/spack-hpc-package-manager

Just curious!

Thanks!

Question: Is it possible to merge contigs with Disco?

Hi, I was wondering whether Disco as an OLC assembler could be used to merge contigs assembled by other assemblers (IDBA, SPAdes)? For example, MeGAMerge uses Newbler and minimus2 to do that. I also know that CAP3 can do that, but it is very slow.
I did a couple of tests myself, trying to merge contigs with Disco, but the results were inconclusive, it worked for one of my test data sets, but not for three others. Are there any limitations regarding the length of sequences that Disco can assemble? Thank you.

Provide statically-linked binaries

Hello, we tried using Disco in our CentOS 7 cluster, but the versions of the glibc library are older than the ones you compiled against:

/opt/Disco-1.1-beta/buildG: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /opt/Disco-1.1-beta/buildG)
/opt/Disco-1.1-beta/buildG: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /opt/Disco-1.1-beta/buildG)
/opt/Disco-1.1-beta/fullsimplify: /lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by /opt/Disco-1.1-beta/fullsimplify)
/opt/Disco-1.1-beta/fullsimplify: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found (required by /opt/Disco-1.1-beta/fullsimplify)

Since you're releasing pre-compiled binaries, would it be possible to at least have them statically linked?

typo in disco output

Cresting output directory: podar.int - Creating, not cresting :)

callvariants.sh interpreting very high frequency alleles as low frequency alleles making them absent from VCFs

From looking at my alignment in IGV I can see that there is a 3bp insertion in almost every read at a specific locus (see pic attached)

out of 775 bases at this position 658 are inserts

When I run the following command, both it and many other very high frequency variants are absent from the resulting vcf

callvariants.sh in=Sample7_ST-J1_test.sam ref=../Genomes/ST-J1.fasta minallelefraction=0.50001 rarity=0.50001 minreadmapq=20 minscore=5 out=Sample7_min_frac0.5_Mapq20.vcf overwrite=T

I only observe them when I use the clearfilters option

callvariants.sh in=Sample7_ST-J1_test.sam ref=../Genomes/ST-J1.fasta minallelefraction=0.50001 rarity=0.50001 clearfilters minreadmapq=20 minscore=5 out=Sample7_min_frac0.5_Mapq20_Readq5_CF.vcf overwrite=T

When I look at the allele frequency for these variants from the second output I can see that callvariants.sh has estimated the frequencies of many variants to be much lower that they actually are, and I suspect that this is why they are missing form the first output

Here is the output for this deletion variant from the second command with cleared filters, AF=0.0194;RAF=0.0212, which is much lower than expected.

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample7_ST-J1_test

NC_019495.1 289042 . A ACAC 5.21 PASS SN=0;STA=289042;STO=289042;TYP=INS;R1P=0;R1M=7;R2P=0;R2M=5;AD=12;DP=618;MCOV=-1;PPC=12;AF=0.0194;RAF=0.0212;LS=1999;MQS=720;MQM=60;BQS=435;BQM=38;EDS=525;EDM=98;IDS=11749;IDM=986;NVC=0;FLG=0;CED=601;HMP=0;SB=0.0703 GT:DP:AD:AF:RAF:NVC:FLG:SB:SC:PF 0:618:12:0.0194:0.0212:0:0:0.0703:5.21:PASS

I do not fully understand why I am seeing such low frequency for almost all of my high frequency variants.

I tried changing minreadmapq=0

callvariants.sh in=Sample7_ST-J1_test.sam ref=../Genomes/ST-J1.fasta minallelefraction=0.50001 rarity=0.50001 minreadmapq=0 minscore=5 out=Sample7_min_frac0.5_Mapq0.vcf overwrite=T

But without "clearfilters" this variant still does not appear in the VCF. The only way I can get it to appear in the VCF without using "clearfilters" is by setting rarity sufficiently low to capture rare variants, but it seems odd that I have to do this to detect what are actually very common variants

Its a viral genome so its small (290kb), so its very easy to see that most of the variants that are supported by a high fraction of the the reads are for some reason absent from the VCF, evidently due to them being interpreted as only being supported by a low fraction of reads. Any advice as to why this is happening would be greatly appreciated. I'm using BBMap v38.90

runDisco-MPI-ALPS.sh error written to log...

When trying to run the script above, I get the following error on line 217 (of the script):
line 217: /var//opt/cray/alps/spool/6942450/fullsimplify: No such file or directory

This error repeats >10 times before the job aborts (PBS Pro submission to aprun).

Just curious to see if anyone else has experienced this error when using the runDisco-MPI-APLS.sh script for DISCO version 1.2.

Thanks!!!

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.