Git Product home page Git Product logo

thewickedaxe / chloroplast_assembly_protocol Goto Github PK

View Code? Open in Web Editor NEW

This project forked from eead-csic-compbio/chloroplast_assembly_protocol

0.0 2.0 0.0 9.64 MB

A set of scripts for the assembly of chloroplast genomes out of whole-genome sequencing reads

License: GNU General Public License v2.0

Perl 11.39% HTML 0.90% Batchfile 0.01% Python 1.06% Makefile 0.58% C 72.14% Roff 1.79% C++ 11.54% CSS 0.12% TeX 0.13% Objective-C 0.20% Shell 0.15%

chloroplast_assembly_protocol's Introduction

Chloroplast assembly protocol

A set of scripts for the assembly of chloroplast genomes out of whole-genome sequencing reads

Authors Carlos P Cantalapiedra (1), Ruben Sancho (1,2), Bruno Contreras Moreira (1,3)

  1. Estación Experimental de Aula Dei-CSIC, Zaragoza, Spain
  2. Escuela Politécnica Superior de Huesca, U.Zaragoza, Spain
  3. Fundación ARAID, Zaragoza, Spain

Software dependencies

This protocol has been tested on Linux x86_64 systems, although it should also work on Mac-OSX settings. It requires perl, which should be installed on all Linux environments, plus some third-party programs listed below, which are provided pre-compiled. In order to check whether they work on your machine, or to re-compile them otherwise, for instance on Mac-OSX, please run on the terminal:

perl ./install.pl

The script will tell which programs are set up (OK) and which require installing a compiler (gcc or g++) or java in order to run.

This is the full list of software, located in bin/, required for the protocol:

Name shipped version Source
DUK March 03, 2011 http://duk.sourceforge.net
Trimmomatic 0.32 http://www.usadellab.org/cms/?page=trimmomatic
FastQC 0.10.1 http://www.bioinformatics.babraham.ac.uk/projects/fastqc
musket 1.0.6 http://musket.sourceforge.net
BWA 0.7.6a http://bio-bwa.sourceforge.net
velvet 1.2.08 https://github.com/dzerbino/velvet/tree/master
SSPACE 2.0 http://www.baseclear.com/bioinformatics-tools
GapFiller v1-11 http://www.baseclear.com/bioinformatics-tools
seqtk https://github.com/lh3/seqtk
split_pairs 0.5 https://github.com/eead-csic-compbio/split_pairs

Optionally samtools is used in HOWTOcheck_assembly.txt.

Input reads

This pipeline has been tested with paired-end (PE) and mate-pairs (MP) read libraries, with the following FASTQ quality encodings: Illumina 1.5 (Phred+64) and Sanger/Illumina (Phred+33). The code assumes that PE libraries have FR orientation, meaning that read pairs are "--><--". MP libraries are expected to be RF but can be set to FR as well.

Reference sequences

  • poaceae.fna is a FASTA file containing chloroplast (cp) genomes of several representative Poaceae species which provide DNA k-mers that help fishing out cp reads out of whole-genome read sets. It is used by 0_get_cp_reads.pl. This file should be adapted to the particular phylogenetic group under study.

  • reference.fna is required by this protocol, and it should contain the cp genome of a related species.

  • reference4columbus.fna is required for reference-guided assembly. Contains the cp genome of a related species split in two non-overlapping contigs so that the first conting contains the fragment LSC-IRa and the second the fragment SSC-IRb. This allows assembly of both inverted repeat (IRa and IRb) regions, as discovered by McPherson et al. Check the FASTA headers in both reference files to see how they must look like.

Examples

Input reads are within "test" directory.

Reference-guided assembly
  • Fish cp reads from whole genome library, using provided test reads (see flowchart):
./0_get_cp_reads.pl test test_cp poaceae.fna
  • Clean and trim reads to remove poor quality segments; output includes mean insert sizes and orientations (see flowchart):
./1_cleanreads.pl -folder test_cp -ref reference.fna 

######- using a single PE library

  • Create a config file "test_cp/assembly_pe" cp test_cp/cleanreads.txt test_cp/assembly_pe and edit it leaving only one row:

    1 testPE cp-testPE.wind15_28.3crop70.mlen60.corr.12.fq.gz FR 221 1.5

  • Assemble cp genome from a single PE library (see flowchart-1 and flowchart-2):

./2_assemble_reads.pl test_cp assembly_pe -ref reference.fna -refcolumbus reference4columbus.fna

######- using both PE and MP libraries

  • Then, create a different config file to use both read libraries cp test_cp/cleanreads.txt test_cp/assembly_mp and edit it reordering rows so that testPE is number 1:

    1 testPE cp-testPE.wind15_28.3crop70.mlen60.corr.12.fq.gz FR 221 Sanger

    2 testMP cp-testMP.wind15_28.3crop70.mlen60.corr.12.fq.gz RF 4295 Sanger

  • Assemble cp genome combining PE + MP libraries (see flowchart-1 and flowchart-2):

./2_assemble_reads.pl test_cp assembly_mp -ref reference.fna -refcolumbus reference4columbus.fna
Example of de-novo assembly
  • Fish cp reads from WGS library

./0_get_cp_reads.pl test test_cp_noref poaceae.fna

  • clean and trim reads to remove poor quality segments

./1_cleanreads.pl -folder test_cp_noref

  • Create config file test_cp_noref/assembly_pe cp test_cp_noref/cleanreads.txt test_cp_noref/assembly_pe and edit test_cp_noref/assembly_pe file leaving one (PE reads) or two rows (PE + MP reads; see previous examples).

    In this case we are leaving a single PE library, and note that we have to provide orientation and insert size (FR and 221 in this example):

    1 testPE cp-testPE.wind15_28.3crop70.mlen60.corr.12.fq.gz FR 221 Sanger

  • Finally, assemble cp genome

./2_assemble_reads.pl test_cp_noref assembly_pe

For more info and parameters of the scripts see README.txt

In addition, examples yielding assembled data could be run following the steps at:

Post-assembly inspection

The tutorial HOWTOcheck_assembly.txt contains recipes to analyze and validate your assemblies.

Propagation of annotated features to assemblies

If a good quality master annotation in GenBank format is at hand, it can be used to propagate its features to several assembled chloroplast genomes with accompanying script _annot_fasta_from_gbk.pl. The way to invoke it is as follows, assuming strain XYZ was assembled:

perl _annot_fasta_from_gbk.pl reference.gbk assembly.fa XYZ_chloroplast.gbk XYZ

NOTE: This script requires setting a valid path to BLAST+ binaries installed on your system, plus Bioperl installed on your system.

Generating tracks from aligned assemblies

If several chloroplast assemblies are aligned they can then be further used to produce tracks for software CIRCOS. Script _check_matrix.pl does just that; it requires setting a reference strain and a window size, please check the code. An example call would be:

perl _check_matrix.pl chloroplast_alignment.fna

chloroplast_assembly_protocol's People

Contributors

cantalapiedra avatar eead-csic-compbio avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.