GALA is a Gap-free Long-read Assembler. GALA builds a multi-layer graph from different preliminary assemblies, long-reads, and potentially other sources of information, such as Hi-C assemblies. During this process, it identifies mis-assembled contigs and trim them. The corrected data are then partitioned into multiple scaffolding groups, each representing a single chromosome. Each scaffolding group is assembled independently with existing assembly tools and a simplified version of overlap-graph-based merging algorithm is used to merge multiple contigs if necessary.
GALA has three modules each can be used separately.
GALA assembled a human genome using (HiFi) reads. GALA used canu draft for CHM13 and the current human reference genome GRCh38.p13 as input of GALA. In this way GALA essentially created a reference-guided de novo assembly. GALA assembly comprised of 37 continuous contigs, including 8 telomer-to-telomer gap-free pseudomolecular sequences, 4 near complete chromosomes each with a small telomeric fragment unanchored, 3 with only gapped centromeric regions, and the long arm of acrocentric chromosomes. Human Genome
GALA can be run directly from the gala folder
git clone https://github.com/ganlab/gala.git cd GALA
Or
You can run install
to add it to your PATH
Using GALA pipeline to assemble a genome involves preliminary steps and three main Steps.
Use different software to construct preliminary assemblies from long reads, e.g. (Canu, Flye, MECAT, Miniasm, and Wtdbg2).
- Raw reads and corrected reads if available.
- The user needs to prepare
draft_names_paths.txt
for preliminary assemblies. Here is an example:
draft_01=path/to/draft_fasta_file
draft_02=path/to/draft_fasta_file
draft_03=path/to/draft_fasta_file
draft_n=path/to/draftfasta file
To run GALA using one command user can use the following command:
gala
draft_names_paths.txt
fa/fq
reads_file
platform
In single command mode, GALA used canu for Chromosome-by-Chromosome assembly.
To use another assembler or multiple assemblers, GALA provides three choices Canu, Flye, and Miniasm, pass it to -a
argument with a single space between them.
For sequencing_platform the user needs to provide it in this way:
-pacbio-raw
-pacbio-corrected
-nanopore-raw
-nanopore-corrected
usage: gala -h [options] <draft_names & paths> <fa/fq> <reads> <platform>
GALA Gap-free Long-reads Assembler
positional arguments:
draft_names Draft names and paths [required]
input_file input type (fq/fa) [required]
reads raw/corrected reads [required]
sequencing_platform pacbio-raw pacbio-corrected nanopore-raw nanopore-corrected [required]
optional arguments:
-h, --help show this help message and exit
-a [ASSEMBLER [ASSEMBLER ...]]
Chr-by_Chr assembler (canu flye miniasm) [default canu]
-b Alignment block length [default 5000]
-p Alignment identity percentage [default 70%]
-l lowest number of misassemblies indecator [default 1]
-c Shortest contig length [default 5000]
-k Mis-assembly block [default 175]
It is better to extend the misassembly block in case of
unpolished assemblies or expected mis-assemblies
in highly repetative regions (5000-10000)
-q Mapping quality [default 20]
-f Output files name [default gathering]
-t cut on a threshold passed by -u [default False]
-u threshold cut value [default 3]
--cut1 The length of the smallest discordance on contigs of length >= 1000000 [default 50000]
Be very careful with this parameter
--cut2 The length of the smallest discordance on contigs of length >= 100000 [default 25000]
Be very careful with this parameter
--cut3 The length of the smallest discordance on contigs of length >= 5000 [default 15000]
Be very careful with this parameter
-o output files path [default current directory]
-v, --version show program's version number and exit
JohnUrbanFork-specific options
--threads THREADS Number of threads to use with Minimap2 and BWA.
--fastmode Use Minimap2 for read-mapping steps instead of BWA.
--resume GALA will look for last successfully completed step, and continue (resume). It will redo any steps along pipeline that do not have Step_xxx.done touch files...
--continue_from CONTINUE_FROM
GALA will look try to start from step specified: 1 = Beginning of the pipeline (generating draft_compare.sh script to run). 2 = Running draft_compare.sh. 3 = Identify mis-
assembled contigs. 4 = Produce misassembly-free drafts. 5 = Generate script to compare the misassembly-free drafts. 6 = Run draft_comparison file to produce new drafts
comparison paf files. 7 = Run the ccm module to produce contigs scaffolding groups. 8 = Map all drafts against raw long reads and self-corrected reads if available. 9 =
Separate the read names mapped to each contig. 10 = Concatenate read name files belongs to the same scaffolding group. 11 = Use the readsep Module to separate each scaffold
correlated-reads. 12 = Run assemblies.
--hifi Use this flag if long read data is >99 pct accuracy on average. Default : assumes false. Affects some paramter choices. Typically for PacBioHiFi, but perhaps can work with
Nanopore Q20/Q30 chemistry (avg accuracy >99 pct).
--sac Use this flag if nanopore data is >90-95 pct accuracy on average (e.g. super accurate basecalling mode, SAC). Default : assumes false. Affects some paramter choices. See:
https://github.com/marbl/canu/issues/2121
--forcetrim Optional use with --sac option to force end trimming of reads in Canu pipeline. See: https://github.com/marbl/canu/issues/2121
--tellcanu TELLCANU Put additional parameters to feed Canu in quotes.
--tellflye TELLFLYE Put additional parameters to feed Flye in quotes.
--tellminiasm TELLMINIASM
Put additional parameters to feed Miniasm in quotes.
--debug
- Use the
comp
module to generate adraft_comparison
filecomp
draft_names_paths.txt
- Run
draft_comparison
file to produce drafts comparison paf filessh
draft_compare.sh
- Use the
mdm
module to identify mis-assembled contigs.mdm
comparison_folder
number of assembly drafts
- Use the
newgenome
module to Producemisassembly-free
drafts.newgenome
draft_names_paths.txt
cut_folder
- Use the
comp
module to generate adraft_comparison
file formisassembly-free
drafts.comp
new_draft_names_paths.txt
- Run
draft_comparison
file to produce new drafts comparison paf files.sh
draft_compare.sh
- Run the
ccm
module to produce contigsscaffolding groups
.ccm
comparison_folder
number of assembly drafts
- Note:
You can also use the
reformat
module to generate reformatted paf files and use them to confirmScaffolding groups
.
- Note:
You can also use the
-
Map all drafts against raw long reads and self-corrected reads if available.
bwa index
misassembly-free draft
bwa mem -x pacbio/ont2dmisassembly-free draft
long-reads
-
Use the following commands to separate the read names mapped to each contig
samtools view -H bam_file |grep "SQ"|cut -f 2|cut -d : -f 2 > contig_names
seprator
contig_names
mapping.bam
sh bam_seprator.sh
for i in bams/*; do samtools view $i | cut -f 1 > $i.read_names;done;
-
Use the
cat
command to concatenate read name files belongs to the samescaffolding group
.- For example:
cat contig_1.bam.read_names contig_3.bam.read_names contig_7.bam.read_names > scaffold_1.read_names
- For example:
-
Use the
readsep
Module to separate each scaffold correlated-reads.for i in
scaffold_*.read_names
; do readsepraw/correted-reads
$i
-finput reads file type fa/fq
-
Implement Chromosome-by-Chromosome assembly approach to retrieve the gap-free chromosome-scale assembly by
Assemble each read set from scaffold_*.read.fq with different assembly software, e.g.(Canu, Flye, Mecat, Miniasm, and Wtdbg).
we recommend the user to try different assembly tools especially ( Flye, MECAT/NECAT, and Miniasm)
-
Finally, map the SGAM outcomes against one of the preliminary draft assemblies to confirm that all the contigs in the
scaffolding group
are assembled to the right chromosome/Scaffold.
The comp module used to generate a genome comparison file if the user wants to compare multiple genomes against each other.
usage: comp -h [options] <draft_names & paths>
Generate genome comparison files, part of GALA Gap-free Long-read Assembler
positional arguments:
drafts Draft names and paths [required]
optional arguments:
-h, --help show this help message and exit
-o output files path [default current directory]
-v, --version show program's version number and exit
Miss-assembly Detector Module used to detect misassembled contigs. The algorithm relies on the alignment's contradictory information.
mis-assembly detection module should be applicable for error correction regardless of the specific algorithm used for assembly and can differentiate between misassembly and Structure variation
usage: mdm -h [options] path/to/mapping_files number of drafts
MDM Mis-assembly Detector Module, part of GALA Gap-free Long-read Assembler
positional arguments:
mapping_files mapping paf file [required]
drafts Number of drafts [required]
optional arguments:
-h, --help show this help message and exit
-b Alignment block length [default 5000]
-p Alignment identity percentage [default 70%]
-c Shortest contig length [default 5000]
-q Mapping quality [default 20]
-f Output files name [default gathering]
-o output files path [default current directory]
-v, --version show program's version number and exit
The newgenome module trims the misassembled contigs and gives misassembly free genome. This module used only with multiple samples
usage: newgenome -h [options] <draft_names & paths> <path to cut files>
Produce mis-assembly free genomes, part of GALA Gap-free Long-read Assembler
positional arguments:
draft Draft names and paths [required]
cut_files path_to_cut_files" [required]
optional arguments:
-h, --help show this help message and exit
-f Output files name [default new_genome]
-o output files path [default current directory]
-v, --version show program's version number and exit
Contig Clustering Module used to identify the scaffolding groups
and the contigs overlap information in multiple preliminary assemblies.
ccm could have extended applications in generating consensus assembly from multiple sequences. Besides, it is useful in reference guide scaffolding to determine Chromosomes scaffolding groups
usage: ccm -h [options] <path/to/mapping_files> <number of drafts>
CCM Contig Clustering Module, part of GALA Gap-free Long-read Assembler
positional arguments:
mapping_files mapping paf file [required]
drafts Number of drafts [required]
optional arguments:
-h, --help show this help message and exit
-b Alignment block length [default 5000]
-p Alignment identity percentage [default 70%]
-c Shortest contig length [default 5000]
-q Mapping quality [default 20]
-f Output files name [default scaffolds]
-o output files path [default current directory]
-v, --version show program's version number and exit
the reformat module filters the alignment data in paf mapping files and merge overlapping and continuous alignment intervals into a single mapping interval. So, each contig in query draft will have one alignment interval with the subject draft.
usage: reformat -h [options] <path/to/mapping_files> <number of drafts>
Re-formatting mapping files module, part of GALA Gap-free Long-read Assembler
positional arguments:
mapping_files mapping paf file [required]
drafts Number of drafts [required]
optional arguments:
-h, --help show this help message and exit
-b Alignment block length [default 5000]
-p Alignment identity percentage [default 70%]
-c Shortest contig length [default 5000]
-q Mapping quality [default 20]
-f Output files name [default reformated]
-o output files path [default current directory]
-v, --version show program's version number and exit
The seprator module used to separate contigs alignments in individual bams and separate the read names mapped to each contig in an individual file
usage: seprator -h [options] <contig_names> <bam_file>
Separate each contig correlated read names, part of GALA Gap-free Long-read Assembler
positional arguments:
contig_names contig_names [required]
bam_file mapping bam file [required]
optional arguments:
-h, --help show this help message and exit
-o output files path [default current directory]
-f Output files name [default bam_seprator]
-b output folder name [default bams]
-v, --version show program's version number and exit
Use the following command to produce contig_names file:
samtools view -H <bam_file> |grep 'SQ'|cut -f 2|cut -d : -f 2 > contig_names
The readsep module separates a set of reads from a sequencing dataset according to the read name in the definition line.
usage: readsep -h [options] <reads> <read_titles>
Extract reads from fasta or fastq, part of GALA Gap-free Long-read Assembler
positional arguments:
reads raw/corrected reads [required]
read_titles read names [required]
optional arguments:
-h, --help show this help message and exit
-f input file format (fa/fq)
-v, --version show program's version number and exit
GALA is distributed under MIT license. See the LICENSE file for details.