Contributed by Chun-Yuan Huang, 3/10/2016
Reference guided (rfguided) assembly of target sequence using tools such as TRegGA [1] often show different results on the target gene assembly when different reference genomes are used. It become critical to select the "correct" reference for optimal assembly when the sample sequence is distant in phylogeny from available references. This workflow Targeted Haplotype-Assisted Reference-Guided Genome Assembly (haplovars) aims to build a reliable targeted reference sequence in order to improve the accuracy of targeted rfguided genome assembly. Taking advantage of the Rice 3,000 Genome Project (3kRGP) [2] and SNP-Seek database [3], the haplovars takes a specific rice cultivar and the genomic coordinate of the target region of interest on rice reference IRGSP-1.0 as inputs, applies the haplotype search program [4] to identify cultivars in SNP-Seek database that have high similarity in haplotype with regard to the target region of interest (we call such cultivars as Haplovars), retrieves the haplovar reads from 3kRGP [5] for denovo assembly, then returns with a superscaffold sequence with annotation in embl format. Such superscaffold can serve to replace the original IRGSP reference sequence as a more reliable alternative reference in order to gain accurate rfguided assembly of the target gene. Combines with TRegGA, this workflow output more reliable and accurate targeted sequence assembly that should be valuable in applications such as disease gene identification and characterization.
git clone https://github.com/huangc/TRegGA.git
cd TRegGA
git clone https://github.com/huangc/haplovars.git
- Identfy haplovars to the sample with regards to the region of interest on the reference genome.
- Retrieve and denovo assembly of haplovar contigs and scaffolds.
- Finding Deletion Fingerprints (DFPs) of haplovar contigs for secondary validation (besides SNP fingerprinting in step1).
- Mix and match of SNP/InDel fingerprint-validated haplovar reads and scaffolds for denovo assembly of haplovar superscaffolds.
- (Optional) rfguided assembly of haplovar superscaffolds into haplovar pseudomolecule with [multiple] reference genomes.
- rfguided assembly of sample reads/scaffolds using haplovar superscafffolds/pseudomolecule as reference.
- Edit and setup the parameters as described in 0SOURCE, then
source 0SOURCE
- Edit and prepare for the prerequisite files and softwares as described in PREREQ.sh, then
sh PREREQ.sh
- (Optional) If sample vcf is not available, run whole genome variant calling:
sh x1-WGvarSNP
- Run haplotype search program to identify haplovars:
qsub x2-HaplovarFinder
- Run denovo assembly of haplovar contigs and scaffolds:
qsub x3-TRegGA-denovo
- Run whole genome blat alignment on haplovar contigs:
sh x4-WGblat
- Run Deletion Fingerprinting (DFP) of haplovar contigs:
qsub x5-WGindelT
- Run DFP clustering to identify close-relatives of haplovar contigs:
qsub x6-DFPtree
- Run superscaffold assembly of haplovar scaffolds, then use that as reference for rfguided assembly:
qsub x7-TRegGA-rfguided
- Find main outputs in data/.
- Cleanup files with
sh xcleanup
- prereq/: prerequisite inputs such as retrieval and storage of TRegGA assembled contigs; retrieval and storage of reference genomes, preparation of BLAST+ database for reference genome.
- doc/: reference and tutorial documents.
- bin/: ancillary codes and scripts.
- src/: prerequisite softwares
- run/: main scripts and execution results.
- data/: final outputs and reports.
- TRegGA: https://github.com/BrendelGroup/TRegGA
- 3,000 rice genomes project. The 3,000 rice genomes project. Gigascience. 2014 May 28;3:7.
- Alexandrov N, et al. SNP-Seek database of SNPs derived from 3000 rice genomes. Nucleic Acids Res. 2015 Jan;43(Database issue):D1023-7.
- Murat Öztürk, https://github.com/muzcuk/find-by-SNP.git
- The Rice 3000 Genomes Project Data. http://gigadb.org/dataset/200001.