haplovars: Targeted Haplotype-Assisted Reference-Guided Genome Assembly

Contributed by Chun-Yuan Huang, 3/10/2016

Aims:

Reference guided (rfguided) assembly of target sequence using tools such as TRegGA [1] often show different results on the target gene assembly when different reference genomes are used. It become critical to select the "correct" reference for optimal assembly when the sample sequence is distant in phylogeny from available references. This workflow Targeted Haplotype-Assisted Reference-Guided Genome Assembly (haplovars) aims to build a reliable targeted reference sequence in order to improve the accuracy of targeted rfguided genome assembly. Taking advantage of the Rice 3,000 Genome Project (3kRGP) [2] and SNP-Seek database [3], the haplovars takes a specific rice cultivar and the genomic coordinate of the target region of interest on rice reference IRGSP-1.0 as inputs, applies the haplotype search program [4] to identify cultivars in SNP-Seek database that have high similarity in haplotype with regard to the target region of interest (we call such cultivars as Haplovars), retrieves the haplovar reads from 3kRGP [5] for denovo assembly, then returns with a superscaffold sequence with annotation in embl format. Such superscaffold can serve to replace the original IRGSP reference sequence as a more reliable alternative reference in order to gain accurate rfguided assembly of the target gene. Combines with TRegGA, this workflow output more reliable and accurate targeted sequence assembly that should be valuable in applications such as disease gene identification and characterization.

Workflow setup: haplovars as a module of TRegGA.

git clone https://github.com/huangc/TRegGA.git
cd TRegGA
git clone https://github.com/huangc/haplovars.git

Workflow description:

Identfy haplovars to the sample with regards to the region of interest on the reference genome.
Retrieve and denovo assembly of haplovar contigs and scaffolds.
Finding Deletion Fingerprints (DFPs) of haplovar contigs for secondary validation (besides SNP fingerprinting in step1).
Mix and match of SNP/InDel fingerprint-validated haplovar reads and scaffolds for denovo assembly of haplovar superscaffolds.
(Optional) rfguided assembly of haplovar superscaffolds into haplovar pseudomolecule with [multiple] reference genomes.
rfguided assembly of sample reads/scaffolds using haplovar superscafffolds/pseudomolecule as reference.

Workflow execution:

Edit and setup the parameters as described in 0SOURCE, then source 0SOURCE
Edit and prepare for the prerequisite files and softwares as described in PREREQ.sh, then sh PREREQ.sh
(Optional) If sample vcf is not available, run whole genome variant calling: sh x1-WGvarSNP
Run haplotype search program to identify haplovars: qsub x2-HaplovarFinder
Run denovo assembly of haplovar contigs and scaffolds: qsub x3-TRegGA-denovo
Run whole genome blat alignment on haplovar contigs: sh x4-WGblat
Run Deletion Fingerprinting (DFP) of haplovar contigs: qsub x5-WGindelT
Run DFP clustering to identify close-relatives of haplovar contigs: qsub x6-DFPtree
Run superscaffold assembly of haplovar scaffolds, then use that as reference for rfguided assembly: qsub x7-TRegGA-rfguided
Find main outputs in data/.
Cleanup files with sh xcleanup

Sub-directories for workflow implementation:

prereq/: prerequisite inputs such as retrieval and storage of TRegGA assembled contigs; retrieval and storage of reference genomes, preparation of BLAST+ database for reference genome.
doc/: reference and tutorial documents.
bin/: ancillary codes and scripts.
src/: prerequisite softwares
run/: main scripts and execution results.
data/: final outputs and reports.

Reference:

TRegGA: https://github.com/BrendelGroup/TRegGA
3,000 rice genomes project. The 3,000 rice genomes project. Gigascience. 2014 May 28;3:7.
Alexandrov N, et al. SNP-Seek database of SNPs derived from 3000 rice genomes. Nucleic Acids Res. 2015 Jan;43(Database issue):D1023-7.
Murat Öztürk, https://github.com/muzcuk/find-by-SNP.git
The Rice 3000 Genomes Project Data. http://gigadb.org/dataset/200001.

huangc / haplovars Goto Github PK

haplovars's Introduction

haplovars: Targeted Haplotype-Assisted Reference-Guided Genome Assembly

Aims:

Workflow setup: haplovars as a module of TRegGA.

Workflow description:

Workflow execution:

Sub-directories for workflow implementation:

Reference:

haplovars's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent