chhylp123 / hifiasm Goto Github PK
View Code? Open in Web Editor NEWHifiasm: a haplotype-resolved assembler for accurate Hifi reads
License: MIT License
Hifiasm: a haplotype-resolved assembler for accurate Hifi reads
License: MIT License
hifiasm.log.txt
Hi,
Thanks for the amazing tool, I'm a bit new to genome assembling, and have just tried hifiasm using public SRA data. Using default settings I've got seemingly very nice assemblies. The total size of the p_ctg is very reasonable (693M) and the N50 is over 30M, the species is apple, as we know it's highly heterozygous, my alternative assembly (a_ctg) is around 589M, which I think is also quite reasonable. You can see two obvious peaks from the kmer distribution (Please help to have a quick look at the attached working log to see if everything is okay, thanks!).
My first question is does p_ctg represent one full haplotype? As I understand p_ctg + a_ctg + collapsed homozygous regions = 2c ? How can I generate another haplotype consisting of a_ctg and the collapsed homozygous regions?
My second question is how much improvements can be made to the assembly if short illumina reads for the parents also available? (I think it's already quite good, maybe some other added-value of using trio-binning?) Although sometimes it's hard for us to know the exact parents for some of the species being sequenced (especially the wild ones).
Thank you so much.
We would like to assemble and haplotype a tree genome for which we only have the maternal genome available. The father could be any surrounding tree.
How could we generate the yak paternal file using genome and maternal kmer contents?
Hi Haoyu/Heng Li,
Thanks for the fantastic software for assembling HiFi reads! I've been testing it on human genome extensively and one of the things that I was trying to understand is the tradeoff between insert size and accuracy. I assembled the GIAB HG002 data using both the 15kb and the 20kb libraries (Version 0.5, default parameters) at approximately the same coverage and obtained the following stats:
15kb 2 cells, ~ 17 fold coverage (Average Q = 33.5, Mean insert size = 12.9 kb), N = 793 contigs, NG50 = 37 Mbp, NGA50 = 16.9 Mbp
20kb 2 cells, ~ 16 fold coverage (Average Q = 31.1, Mean insert size = 18.5 kb), N = 770 contigs, NG50 = 21 Mbp, NGA50 = 12.9 Mbp
Here's the p_utg gfa graph for 15kb:
It seems like at the unitig stage the 20kb dataset has quite a lot fewer (and shorter) nodes and edges than the 15kb dataset, and once it's collapsed to contigs the 15kb is more contiguous. The 20kb graph also appears to have more "spikes" on the contigs (likely from the lower accuracy?). Do you think there's any parameter (k-mer?) that we can tune when we're using lower accuracy reads with higher insert size?
Hi,
Thanks for the great software - it's fantastic to have an assembly tool which we can run in a modest amount of time.
I have a question (rather than an issue).
We are working on a diploid plant species (genome size 950Mbp) which is relatively heterozygous (1.5% at the base level).
We want to see just how far we can push hifi data to generate an assembly without any additional technology.
I currently have around 89Gbp of data (~88x coverage) data with Hifi data.
This gives us a pretty good assembly N50 - 37,230,366
To estimate if adding data would improve our assembly further, I downsampled the data which gave us the following N50 values
90% - 32,307,688
80% - 27,353,633
70% - 21,585,766
This would appear to indicate that adding more data would improve the N50.
I do also see an increase in the number of total contigs (which is not unexpected) as the coverage increases.
Unfortunately I do not have a good reference for this species so I am relying on comparing it to a closely related species.
I can see in some cases that a single contig spans a complete chromosome (on the closely related species).
You mentioned in a previous post that hifiasm works best at ~40x coverage and if coverage is too high - 'high coverage may introduce weird things'. Do you think that hifiasm could actually continue to improve our assembly if we were to add more data, or do you think it may start to do some of these 'weird things'.
Or would it be a better strategy to do more stringent filtering on the input data to maintain the same coverage. We currently are using the default Q20 cutoff.
Hi Haoyu/Heng Li,
Thanks for the fantastic software for assembling HiFi reads! Recently I am following hifiasm(0.8-r279) for our plant genome assembly(~350M), and the p_ctg is as follows:
Total_length 460M, Max_length 51M, N50_length 33M, Total_number 2646.
The contiguity is higher than any other assemblers I have used , however I also notice that 2542 contigs length are less than 50kb, about 96% of total contigs number.
I wonder why there are so many fragment contigs and can I remove these contigs from p_ctg?
Thank you.
I am wondering about the gfa output - is there any documentation about the lines beginning with A
? I assume that is giving the locations of the reads on the unitigs/contigs. I would really like to be able to extract out the supporting reads for any location on the contig (for example, to find how many reads support a variant between assemblies). It would also be nice to have the cigar to know exactly how the read lines up on the contig. It would save me a step and reduce the ambiguity that could result from having to align the reads back to the assembly after the fact. Would it be easy to add in the cigar?
Hi,
I just wanted to show a comparison using simulated PacBio CLR reads corrected with Illumina reads versus simulated HiFi reads for a Drosophila melanogaster genome assembly (haploid assembly "converted" to diploid with ~2% heterozygosity).
# Use Drosophila melongaster PacBio assembly
cd /genetics/elbers/test/fly2
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/003/401/745/GCA_003401745.1_ASM340174v1/GCA_003401745.1_ASM340174v1_genomic.fna.gz
# Use BCFtools 1.10.2 and SAMtools 1.10
conda activate bcftools1.10.2
# Use Seqtk to convert soft-masked bases to upper-case bases, also compress with bgzip
# https://github.com/lh3/seqtk
/genetics/elbers/bin/seqtk/seqtk seq -U \
GCA_003401745.1_ASM340174v1_genomic.fna.gz | \
bgzip -@75 > GCA_003401745.1_ASM340174v1_genomic.fna_upper.fasta.gz
# Convert to diploid with approximately 2% heterozygosity rate, max indels=20bp
# mutate.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.86/mutate.sh \
in=GCA_003401745.1_ASM340174v1_genomic.fna_upper.fasta.gz \
ow=t \
vcf=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.vcf.gz \
out=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
ziplevel=6 \
ploidy=2 \
subrate=0.0192 \
indelrate=0.001 \
maxindel=20 \
nohomopolymers=t \
hetrate=1 2> GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.log.txt
# genome size
bgzip -@75 -cd GCA_003401745.1_ASM340174v1_genomic.fna_upper.fasta.gz | \
grep -v ">"|wc -m
# 140687135
# 2% heterozygosity is how many bases
calc 0.02\*140687135
# 0.02*140687135 = 2813742.700000
# number of mutations added
bgzip -@75 -cd GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.vcf.gz| \
grep -v "^#"|wc -l
# 2830139
# ~2% het rate
# get only 1 haplotype from the "diploid" reference
bgzip -@75 -dc GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz|\
/genetics/elbers/bin/seqtk/seqtk seq -L0|paste - - |grep -P "haplo_0\t"| \
tr '\t' '\n' |\
/genetics/elbers/bin/seqtk/seqtk seq -L60 |\
bgzip -@75 \
> GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz
# make reference for randomreads.sh
# randomreads.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
seed=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.13 pbmax=0.17 \
reads=100 paired=f \
gaussianlength=t \
minlength=1000 midlength=20000 maxlength=100000 \
out=/dev/null
# make 60x haploid coverage for Illumina reads
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
coverage=30 paired=t maxinsert=550 mininsert=450 \
out1=illumina1.fastq.gz out2=illumina2.fastq.gz > random_reads_illumina.log 2>&1
# interleave the paired-end reads
# reformat.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.86/reformat.sh \
in=illumina1.fastq.gz in2=illumina2.fastq.gz out=illumina.int.fastq 2>/dev/null
# use KmerGenie 1.7051 to get an idea of k-mer with that produces longest N50
# http://kmergenie.bx.psu.edu/
mkdir -p /genetics/elbers/test/fly2/kmergenie-illumina-raw-reads
cd /genetics/elbers/test/fly2/kmergenie-illumina-raw-reads
/genetics/elbers/kmergenie-1.7051/kmergenie ../illumina.int.fastq \
> kmergenie-illumina-raw-reads.log 2>&1
rm ../illumina.int.fastq
k=`grep "^best k:" \
kmergenie-illumina-raw-reads.log | grep -Po "\d+"`
echo "best k=${k}"
# make 30x haploid coverage for PacBio CLR reads
# error rate from 13 - 15 % minimum 1000bp midlength 20000bp maximum 30000bp
cd /genetics/elbers/test/fly2
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ow=t seed=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.13 pbmax=0.15 \
coverage=15 paired=f \
gaussianlength=t \
minlength=1000 midlength=20000 maxlength=30000 \
out=pacbio.fastq.gz > random_reads_pacbio.log 2>&1
# make 30x haploid coverage for PacBio reads for Hifi reads
# error rate from 1 - 0.1 % minimum 9000bp midlength 10000bp max 12000bp
/genetics/elbers/bbmap-38.86/randomreads.sh build=1 \
ow=t seed=1 \
ref=GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.fasta.gz \
illuminanames=t addslash=t \
pacbio=t pbmin=0.001 pbmax=0.01 \
coverage=15 paired=f \
gaussianlength=t \
minlength=9000 midlength=10000 maxlength=12000 \
out=hifi.fastq.gz > random_reads_pacbio_hifi.log 2>&1
# de novo assemble Illumina reads
# with ABySS 2.2.4 using k-mer value from kmer genie
# partition command allows parallel (mpi) assembly with ABySS
# https://github.com/bcgsc/abyss/
cd /genetics/elbers/test/fly2
mkdir -p abyss
# break up Illumina read pairs into 35 paired files (70 files in total)
# to speed up assembly with ABySS mpi mode
# partition.sh part of BBTools/BBMap https://sourceforge.net/projects/bbmap/
/genetics/elbers/bbmap-38.79/partition.sh threads=75 -Xmx350g -Xms350g \
in1=illumina1.fastq.gz \
in2=illumina2.fastq.gz \
out=abyss/illumina-raw-%-R1.fasta out2=abyss/illumina-raw-%-R2.fasta \
ways=35 2>/dev/null
cd /genetics/elbers/test/fly2/abyss
# note I don't know of a simpler command for ABySS to do the following with
# 75 parallel jobs
/genetics/elbers/bin/abyss-2.2.4/build/bin/abyss-pe k=${k} \
name=illumina-raw graph=gfa1 \
lib="pe1 pe2 pe3 pe4 pe5 pe6 pe7 pe8 pe9 pe10 pe11 pe12 pe13 pe14 pe15 \
pe16 pe17 pe18 pe19 pe20 pe21 pe22 pe23 pe24 pe25 pe26 pe27 pe28 pe29 \
pe30 pe31 pe32 pe33 pe34 pe35" \
np=75 j=75 \
pe1="illumina-raw-0-R1.fasta illumina-raw-0-R2.fasta" \
pe2="illumina-raw-1-R1.fasta illumina-raw-1-R2.fasta" \
pe3="illumina-raw-2-R1.fasta illumina-raw-2-R2.fasta" \
pe4="illumina-raw-3-R1.fasta illumina-raw-3-R2.fasta" \
pe5="illumina-raw-4-R1.fasta illumina-raw-4-R2.fasta" \
pe6="illumina-raw-5-R1.fasta illumina-raw-5-R2.fasta" \
pe7="illumina-raw-6-R1.fasta illumina-raw-6-R2.fasta" \
pe8="illumina-raw-7-R1.fasta illumina-raw-7-R2.fasta" \
pe9="illumina-raw-8-R1.fasta illumina-raw-8-R2.fasta" \
pe10="illumina-raw-9-R1.fasta illumina-raw-9-R2.fasta" \
pe11="illumina-raw-10-R1.fasta illumina-raw-10-R2.fasta" \
pe12="illumina-raw-11-R1.fasta illumina-raw-11-R2.fasta" \
pe13="illumina-raw-12-R1.fasta illumina-raw-12-R2.fasta" \
pe14="illumina-raw-13-R1.fasta illumina-raw-13-R2.fasta" \
pe15="illumina-raw-14-R1.fasta illumina-raw-14-R2.fasta" \
pe16="illumina-raw-15-R1.fasta illumina-raw-15-R2.fasta" \
pe17="illumina-raw-16-R1.fasta illumina-raw-16-R2.fasta" \
pe18="illumina-raw-17-R1.fasta illumina-raw-17-R2.fasta" \
pe19="illumina-raw-18-R1.fasta illumina-raw-18-R2.fasta" \
pe20="illumina-raw-19-R1.fasta illumina-raw-19-R2.fasta" \
pe21="illumina-raw-20-R1.fasta illumina-raw-20-R2.fasta" \
pe22="illumina-raw-21-R1.fasta illumina-raw-21-R2.fasta" \
pe23="illumina-raw-22-R1.fasta illumina-raw-22-R2.fasta" \
pe24="illumina-raw-23-R1.fasta illumina-raw-23-R2.fasta" \
pe25="illumina-raw-24-R1.fasta illumina-raw-24-R2.fasta" \
pe26="illumina-raw-25-R1.fasta illumina-raw-25-R2.fasta" \
pe27="illumina-raw-26-R1.fasta illumina-raw-26-R2.fasta" \
pe28="illumina-raw-27-R1.fasta illumina-raw-27-R2.fasta" \
pe29="illumina-raw-28-R1.fasta illumina-raw-28-R2.fasta" \
pe30="illumina-raw-29-R1.fasta illumina-raw-29-R2.fasta" \
pe31="illumina-raw-30-R1.fasta illumina-raw-30-R2.fasta" \
pe32="illumina-raw-31-R1.fasta illumina-raw-31-R2.fasta" \
pe33="illumina-raw-32-R1.fasta illumina-raw-32-R2.fasta" \
pe34="illumina-raw-33-R1.fasta illumina-raw-33-R2.fasta" \
pe35="illumina-raw-34-R1.fasta illumina-raw-34-R2.fasta" \
v=-v unitigs > illumina-raw-abyss.log 2>&1
rm -f illumina-raw-?-R?.fasta illumina-raw-??-R?.fasta
# Fill in the "S" lines of the gfa file using gfakluge 1.1.2
# https://github.com/edawson/gfakluge
/genetics/elbers/gfakluge-1.1.2/gfak fillseq -f \
illumina-raw-1.fa illumina-raw-1.gfa1 > illumina-raw-1.gfa
# Error-correct the raw PacBio reads with GraphAligner 1.0.12
# https://github.com/maickrau/GraphAligner
# and the graph you made in the previous steps
conda deactivate
conda activate graphaligner
GraphAligner -t 75 -g illumina-raw-1.gfa \
-f ../pacbio.fastq.gz \
--corrected-out pacbio.raw.abyss.notclipped.fasta \
--preset dbg > pacbio.raw.abyss.notclipped.fasta.log 2>&1
conda deactivate
# Convert lower-case bases to upper-case bases
conda activate bcftools1.10.2
/genetics/elbers/bin/seqtk/seqtk seq -U \
pacbio.raw.abyss.notclipped.fasta|bgzip -@75 \
> pacbio.raw.abyss.notclipped.fasta.gz
conda deactivate
# de novo assemble with hifiasm 0.12-r304
# https://github.com/chhylp123/hifiasm
cd /genetics/elbers/test/fly2/hifiasm-hifi
# hifi reads
/genetics/elbers/bin/hifiasm/hifiasm -o hifiasm-hifi2 -t75 \
../hifi.fastq.gz --high-het > hifiasm2.log 2>&1
awk '/^S/{print ">"$2"\n"$3}' \
hifiasm-hifi2.p_ctg.gfa | fold > hifiasm-hifi2.p_ctg.fasta
# corrected reads
/genetics/elbers/bin/hifiasm/hifiasm -o hifiasm-cor -t75 \
../abyss/pacbio.raw.abyss.notclipped.fasta.gz --high-het > hifiasm3.log 2>&1
awk '/^S/{print ">"$2"\n"$3}' hifiasm-cor.p_ctg.gfa | fold \
> hifiasm-cor.p_ctg.fasta
# Run Quast 5.1.0rc1 on haplotype0 of reference and hifiasm results
# https://github.com/ablab/quast
# Remember that I converted GCA_003401745.1 to diploid with ~2% het rate
cd /genetics/elbers/test/fly2
mkdir -p quast4
cd /genetics/elbers/test/fly2/quast4
conda activate quast5.0.2
/genetics/elbers/quast-5.1.0rc1/quast.py --fragmented --fast --threads 75 \
-r ../GCA_003401745.1_ASM340174v1_genomic.fna_upper.diploid.haplotype0.fasta.gz \
--eukaryote -o ./ ../hifiasm-hifi/hifiasm-hifi2.p_ctg.fasta ../hifiasm-hifi/hifiasm-cor.p_ctg.fasta \
> quast.log 2>&1
conda deactivate
column -ts $'\t' report.tsv
Assembly hifiasm_hifi2.p_ctg hifiasm_cor.p_ctg
# contigs (>= 0 bp) 54 297
# contigs (>= 1000 bp) 54 297
# contigs (>= 5000 bp) 54 297
# contigs (>= 10000 bp) 54 297
# contigs (>= 25000 bp) 51 297
# contigs (>= 50000 bp) 38 267
Total length (>= 0 bp) 140894018 126334800
Total length (>= 1000 bp) 140894018 126334800
Total length (>= 5000 bp) 140894018 126334800
Total length (>= 10000 bp) 140894018 126334800
Total length (>= 25000 bp) 140857808 126334800
Total length (>= 50000 bp) 140365172 125113430
# contigs 54 297
Largest contig 30137170 3306885
Total length 140894018 126334800
Reference length 140694204 140694204
Reference GC (%) 42.05 42.05
N50 23562678 821324
NG50 23562678 741584
N90 2232689 180295
NG90 2232689 -
L50 3 43
LG50 3 52
L90 6 156
LG90 6 -
# misassemblies 35 33
# misassembled contigs 16 31
Misassembled contigs length 82394091 24116595
# local misassemblies 1 172
# scaffold gap ext. mis. 0 0
# scaffold gap loc. mis. 0 0
# unaligned mis. contigs 0 0
# unaligned contigs 0 + 0 part 0 + 15 part
Unaligned length 0 22898
Genome fraction (%) 97.102 89.348
Duplication ratio 1.029 1.004
# N's per 100 kbp 0.00 0.00
# mismatches per 100 kbp 1378.10 1018.66
# indels per 100 kbp 69.40 94.81
Largest alignment 27942815 3306870
Total aligned length 140637770 126237320
NA50 22378608 761401
NGA50 22378608 684993
NA90 1358953 171380
NGA90 1358953 -
LA50 3 48
LGA50 3 58
LA90 9 174
LGA90 9 -
# Analyze quality scores of reads and assemblies using
# YAK version r55 - yet another k-mer analyzer
# https://github.com/lh3/yak
cd /genetics/elbers/test/fly2
/genetics/elbers/yak/yak count -b37 -t75 -o raw.yak \
<(bgzip -@75 -cd illumina1.fastq.gz) \
<(bgzip -@75 -cd illumina2.fastq.gz) > sr.raw.yak.log 2>&1
cd /genetics/elbers/test/fly2
for ref in `ls -1 abyss/pacbio.raw.abyss.notclipped.fasta.gz hifi.fastq.gz hifiasm-hifi/hifiasm-hifi2.p_ctg.fasta hifiasm-hifi/hifiasm-cor.p_ctg.fasta`;do
/genetics/elbers/yak/yak qv -t75 /genetics/elbers/test/fly2/raw.yak ${ref} 2>/dev/null |tail -n +1034 >> test2
done
for ref in `ls -1 abyss/pacbio.raw.abyss.notclipped.fasta.gz hifi.fastq.gz hifiasm-hifi/hifiasm-hifi2.p_ctg.fasta hifiasm-hifi/hifiasm-cor.p_ctg.fasta`;do
echo $ref >> test3
done
paste test3 <(cut -f 2-3 test2) |sort -k2,2 | \
sed '1i Sequences_or_assembly\traw_quality_value\tadjusted_quality_value' |column -tx
Sequences_or_assembly raw_quality_value adjusted_quality_value
hifi.fastq.gz 22.816 22.820
abyss/pacbio.raw.abyss.notclipped.fasta.gz 33.499 33.522
hifiasm-hifi/hifiasm-cor.p_ctg.fasta 36.636 36.687
hifiasm-hifi/hifiasm-hifi2.p_ctg.fasta 49.417 50.800
Update added program links 10:30 CET 1 Sep 2020
Hi,
I am using hifiasm v0.11 in trio mode and supply lists of hap1/hap2 read names via the -3
and -4
command line options. Do I understand correctly that the read lists should only contain names of haplotype-specific reads? In other words, those reads that are contained in the input FASTQ but that are not part of these two read name files are treated by hifiasm as not haplotype-specific and will be used in the assembly process for both haplotypes. Thanks for confirming or clarifying how to properly use the -3 / -4
options.
Best,
Peter
Hi, @chhylp123
Since the hifiasm
have 3 rounds correct for HiFi reads,the quality of default assembly from hifiasm is high enough. Do I need to mapping the HiFi fastq to the assembly to do polishing using racon?
Thanks.
Zhigui Bao
Hi,
I trying to assemble metagenomics hifi reads. Is Hifiasm suited for metagenomics assembly? and if so do you have any recommended settings for that purpose?
Best,
Jean
Hi,
Canu (https://github.com/marbl/canu) assemble a genome through three steps: correct, trim and assemble. The first step will generate a file called correctedReads.fasta
and the second step will generate a file called trimmedReads.fasta
. I am wondering, it is OK or accepatable to use the canu-generated trimmed reads as input for hifiasm
?
I am asking this question because I want to assemble a genome using both the raw PacBio CLR reads and the PacBio HiFi reads. Canu/HiCanu
did not work very well for this situation (a lower N50 length than two independent assemblies), so I would like to switch to hifiasm
(the resulting assembly has a better N50 length than two independent assemblies).
Leiting
Hi,
Recently I use hifiasm (0.8-r279) to assemble Redwood(data from ./ https://sra-download.ncbi.nlm.nih.gov/traces/sra34/SRR/010948/)with asm options -t 80. The result is 35G, and N50 5.4M.
However,when I map the CCS reads to the assemble genome by using minimap2( minimap2-2.17 (r941) , --secondary=no -x asm20), I find that almost evey ccs reads can align to several contigs,and dv are in the range of 0.1~0.2,similar with ONT alignment.
I think this phenomenon is abnormal,and actural I also use minimap2 to map other ccs reads to there corresponding contigs(also assembled with hifiasm), the dv value are all about 0.005. So I want to know is there really something wrong happend? and can you give me some advice to avoid these phenomena ? Thank you!
Just checking to make sure this makes sense, but hifiasm has a bigger N50 for --high-het
(or default purge_dups settings [identical output with --high-het or defaults]) than using -l0
for the p_ctgs that are produced by either --high-het
or -l0
. This seems like it should be the opposite? All other settings are defaults hifiasm -p prefix --high-het or -l0 hifi.reads.fasta.gz
with hifiasm 0.13-r307.
Hello,
For the hifisam software to assembly the genome, we can get the *.a_ctg.gfa
and *.p_utg.noseq.gfa
file, but I want to know the
position of the *.a_ctg.gfa
sequences in *.p_utg.noseq.gfa
seqences, how to get it?
Best wishes
Hi,
I want to know the difference between "prefix.hap1.p_ctg.gfa" and "prefix.hap1.a_ctg.gfa", is there any difference between them? If I do not need to distinguish haplotypes, can I consider "prefix.hap1.p_ctg.gfa" as the final result after assembly? Thanks a lot!
Shouldn't
awk '/^S/{print ">"$1;print $2}' test.p_ctg.gfa > test.p_ctg.fa # get primary contigs in FASTA
be
awk '/^S/{print ">"$2;print $3}' test.p_ctg.gfa > test.p_ctg.fa # get primary contigs in FASTA
?
In the README.md for commit 55f95c0?
Is there an easy way to extract the coverage for each unitig/contig? I have a 5 Gb plant genome that ballooned to a 21 Gb assembly from soil contamination. Hoping I don't have to align the reads to the assembly to get the coverage and find the ~30x contigs. I am assuming the gfa files only list the reads from the tiling path. Is that right? I am assuming that -n only applies to tiling path reads??
Thanks,
KF
Hi,
Thanks for developing this cool tool.
Any recommended parameter-set to adjust for the large genome assembly with low coverage hifi data(15~20X)? like -k ?
Thanks!
It appears that much of the error in hifiasm assemblies is associated with unsupported reads integrated into the tiling path. They are found by aligning the original HiFi reads or corrected reads to the assembly and looking for clusters of SNPs/INDELs. How could this be addressed within hifiasm? How does this read survive correction?
Thanks,
KF
I am following hifiasm for our plant genome assembly, and noticed that with my code update from hifiasm v03 to v05 the assembly n50 increased but in my genome 2 appearant misassemblies got my attention.
I highlight one in the images below. We have a tetraploid plant of ~ 400MB haploid size, and we have >100x hifi coverage of the haploid genome, meaning ~25x per haplotype.
I already assembled this genome with hiCANU, and noticed also there that with changing the bogart (assemble) parameters too much, I easily introduce misassemblies. As far as I know, I can't alter the settings for hifiasm.
In this image you see the alignment of the contigs to a related public reference, with the assembly error:
If I look at the same positions (zoom in) but for the unitigs; I see that the unitigs do not contain this error:
Are their any ways to make the contigging more stringent? (version 0.3 did not have this error yet)
I have to say that I am not 100% sure this is not a true biological case, I am confindent that this is an assembly error, also from what I've seen with hiCanu. If you need more info, let me know
Hi,
I used the HiFiasm trio mode to phased my genome into maternal and paternal assembly. The first step is using yak to generate trio index. When I used yak count -b37 -t16 -k21
and yak count -b37 -t16 -k19
the genome size was different. -k21 will generate larger genome size than -k19 with the same parameters using hifiasm to assemble. So It's there any formula like Canu-trio can suggest the best k-mer size for user?
Thanks!
Hi,
I am working with highly heterozygous genome. My genome size is ~400MB with 9 chromosome.
Hifiasm gives me single contig of 731.928 MB.
This is the stats:
Main genome scaffold total: 1
Main genome contig total: 1
Main genome scaffold sequence total: 731.928 MB
Main genome contig sequence total: 731.927 MB 0.000% gap
Main genome scaffold N/L50: 1/731.928 MB
Main genome contig N/L50: 1/731.927 MB
Main genome scaffold N/L90: 1/731.928 MB
Main genome contig N/L90: 1/731.927 MB
Max scaffold length: 731.928 MB
Max contig length: 731.927 MB
Number of scaffolds > 50 KB: 1
% main genome in scaffolds > 50 KB: 100.00%
It does not look right.
Any help, would be great.
Hi, @chhylp123
hifiasm
works great on all my plant genome assembly, it produce more contiguous and complete genome in less resource usage.
But I recently assembly a fish genome with hifiasm
, it have low contig N50.
Here is the all the information I have. Do you have any suggestions for tuning?
genomscope estimation from the survey data: 789M, 0.517% het, repeat 49.3%
HiFi data: 33.6G HiFi data from two SMRT cell, ccs N50 20,226, Q30 93.96
Assembly stat:
The kmer histo from the HiFi seems different from the genomescope.
Stat | hifiasm(0.11) | wtdbg2(2.4) | flye(2.8) |
---|---|---|---|
genomesize | 919821446 | 870750720 | 1.15E+09 |
contigNumber | 2313 | 4435 | 16179 |
contigN50 | 1152427 | 823296 | 324539 |
contigL50 | 214 | 280 | 831 |
hifiasm -o fish -t 28 fish.hifi.fastq.gz
fish.p_ctg.fa
: C:93.3%[S:88.6%,D:4.7%],F:2.0%,M:4.7%,n:4584I have tried blast 50000 seqs from the HiFi reads, it mostly mapped to the fish sequences in NR database.
Typically, fish genome have more simple repeat (~8%) than plant(<2%).
Here is the log of hifiasm
fish.hifiasm.log
Here is the raw unitig graph (fish.p_utg.noseq.gfa)
fish.p_utg.noseq.gfa.zip
Thanks,
Zhigui Bao
Hi,
Thanks for the great software.
When I run hifiasm (0.7/0.8) to assemble one huge genome, more than 40Gb with 1.4Tb CCS data on a 4Tb node.
Suppose 4Tb node is enough given 35.6 Gb redwood genome only needs 699Gb memory.
Unfortunately, I failed to complete the jobs after the index step, the error is 'Segmentation fault (core dumped)', any ideas to help me out.
Thanks.
In earlier versions hifiasm stored the corrected reads in a fasta file, now it seems it stores them in a binary file. Is there any way to retrieve them back into a fasta format? For downstream analysis, its a very nice feuture to have these corrected reads.
Thank you for the nice and very efficient assembler.
I assembled a highly heterozygous genome and hifiasm did very well separating primary and haplotigs (p_ctg size is only a bit larger than the expected genome size and a_ctg length is 90% of p_ctg size).
I ran Merqury for QC from short-read kmers of the same sample. I always have higher consensus accuracy QV in a_ctg (QV=~49) than p_ctg (QV=~37) independently of the parameters I use in hifiasm.
Is it expected?
Are there some hifiasm options that could improve QV of p_ctg?
Thanks
Hi,
I have used hifiasm to try assemble a few mammals and fish and have run into problems with mis-joins in the assembly. These issues are not detected until we map hic data and attempt to scaffold the contigs and often results in us having to make manual cuts in the contigs. Below is an example hic map of 3 contigs scaffolded into one chromosome. I have highlighted the beginning and end of the middle contig with blue arrows, which appears to have an issue at the start.
It is also quite easy to see which scaffold this contig belongs to by looking at the hic interactions with another scaffold:
I have highlighted the contig of interest here in the p_ctg gfa, which is around 16Mb in length and this foreign region makes up the last ~600kb.
For completenesss, here is the p_utg gfa, however I am not sure how to trace back to our contig of interest as the graph is very difficult to interpret.
I mapped our ccs reads back to the assembly and found our break point, which shows a drop in coverage and a potential separation of the reads into the two haplotypes, although this may also be an issue with the underlying sequence in this region and the mapping.
I ran hifiasm v0.13 without any additional arguments and perfomed mapping of the CCS using pbmm2 with the CCS preset. We had around 36X coverage of this genome.
It would be great to hear if you have detected any similar issues and if we can tune the parameters to avoid such issues?
Many thanks,
Tom
I have a couple of datasets like the below where it looks like there is an initial, possibly erroneous peak at kmer frequency 3, which then leads to hifiasm dropping most of the kmers for the assembly. I've tried the -D10
, but no real change in the result.
[M::ha_analyze_count] lowest: count[2] = 6069143
[M::ha_analyze_count] highest: count[3] = 7235226
[M::ha_hist_line] 2: ************************************************************************************ 6069143
[M::ha_hist_line] 3: **************************************************************************************************** 7235226
[M::ha_hist_line] 4: ********************************************************************************************** 6793566
[M::ha_hist_line] 5: ******************************************************************************** 5819678
[M::ha_hist_line] 6: **************************************************************** 4658271
[M::ha_hist_line] 7: ************************************************ 3500309
[M::ha_hist_line] 8: ************************************ 2612406
[M::ha_hist_line] 9: **************************** 2006019
[M::ha_hist_line] 10: ************************ 1740065
[M::ha_hist_line] 11: ************************ 1751566
[M::ha_hist_line] 12: ************************** 1881720
[M::ha_hist_line] 13: **************************** 2059802
[M::ha_hist_line] 14: ********************************* 2414617
[M::ha_hist_line] 15: *************************************** 2817786
[M::ha_hist_line] 16: ********************************************* 3247131
[M::ha_hist_line] 17: *************************************************** 3675817
[M::ha_hist_line] 18: ******************************************************* 4007673
[M::ha_hist_line] 19: *********************************************************** 4263849
[M::ha_hist_line] 20: ************************************************************* 4401881
[M::ha_hist_line] 21: ************************************************************* 4444681
[M::ha_hist_line] 22: ************************************************************* 4392425
[M::ha_hist_line] 23: ********************************************************** 4207993
[M::ha_hist_line] 24: ******************************************************* 3960917
[M::ha_hist_line] 25: ************************************************** 3600819
[M::ha_hist_line] 26: ********************************************* 3219978
[M::ha_hist_line] 27: *************************************** 2835019
[M::ha_hist_line] 28: *********************************** 2558213
[M::ha_hist_line] 29: ******************************* 2256288
[M::ha_hist_line] 30: **************************** 2060429
[M::ha_hist_line] 31: ************************** 1901848
[M::ha_hist_line] 32: ************************** 1851258
[M::ha_hist_line] 33: ************************** 1850593
[M::ha_hist_line] 34: ************************** 1892168
[M::ha_hist_line] 35: **************************** 1993563
[M::ha_hist_line] 36: ***************************** 2109533
[M::ha_hist_line] 37: ****************************** 2201966
[M::ha_hist_line] 38: ******************************** 2342791
[M::ha_hist_line] 39: ********************************** 2440881
[M::ha_hist_line] 40: *********************************** 2547906
[M::ha_hist_line] 41: ************************************ 2628198
[M::ha_hist_line] 42: ************************************* 2673573
[M::ha_hist_line] 43: ************************************** 2717951
[M::ha_hist_line] 44: ************************************* 2669472
[M::ha_hist_line] 45: ************************************ 2597238
[M::ha_hist_line] 46: *********************************** 2496652
[M::ha_hist_line] 47: ********************************* 2375755
[M::ha_hist_line] 48: ****************************** 2193939
[M::ha_hist_line] 49: **************************** 2037768
[M::ha_hist_line] 50: ************************** 1886916
[M::ha_hist_line] 51: ************************ 1701060
[M::ha_hist_line] 52: ********************* 1501367
[M::ha_hist_line] 53: ****************** 1323015
[M::ha_hist_line] 54: **************** 1158475
[M::ha_hist_line] 55: ************** 985960
[M::ha_hist_line] 56: ************ 848042
[M::ha_hist_line] 57: ********** 720656
[M::ha_hist_line] 58: ******** 613567
[M::ha_hist_line] 59: ******* 522236
[M::ha_hist_line] 60: ****** 421345
[M::ha_hist_line] 61: ***** 345199
[M::ha_hist_line] 62: **** 282627
[M::ha_hist_line] 63: *** 234878
[M::ha_hist_line] 64: *** 196966
[M::ha_hist_line] 65: ** 167464
[M::ha_hist_line] 66: ** 140883
[M::ha_hist_line] 67: ** 124071
[M::ha_hist_line] 68: ** 109375
[M::ha_hist_line] 69: * 93330
[M::ha_hist_line] 70: * 83896
[M::ha_hist_line] 71: * 78482
[M::ha_hist_line] 72: * 70598
[M::ha_hist_line] 73: * 65065
[M::ha_hist_line] 74: * 60794
[M::ha_hist_line] 75: * 58992
[M::ha_hist_line] 76: * 57068
[M::ha_hist_line] 77: * 56897
[M::ha_hist_line] 78: * 54293
[M::ha_hist_line] 79: * 54742
[M::ha_hist_line] 80: * 53652
[M::ha_hist_line] 81: * 52653
[M::ha_hist_line] 82: * 53446
[M::ha_hist_line] 83: * 52509
[M::ha_hist_line] 84: * 53028
[M::ha_hist_line] 85: * 52292
[M::ha_hist_line] 86: * 53718
[M::ha_hist_line] 87: * 53726
[M::ha_hist_line] 88: * 53380
[M::ha_hist_line] 89: * 54470
[M::ha_hist_line] 90: * 55626
[M::ha_hist_line] 91: * 54319
[M::ha_hist_line] 92: * 54240
[M::ha_hist_line] 93: * 56616
[M::ha_hist_line] 94: * 56854
[M::ha_hist_line] 95: * 55967
[M::ha_hist_line] 96: * 57542
[M::ha_hist_line] 97: * 58223
[M::ha_hist_line] 98: * 57932
[M::ha_hist_line] 99: * 59190
[M::ha_hist_line] 100: * 60010
[M::ha_hist_line] 101: * 58622
[M::ha_hist_line] 102: * 58082
[M::ha_hist_line] 103: * 58887
[M::ha_hist_line] 104: * 57855
[M::ha_hist_line] 105: * 53057
[M::ha_hist_line] 106: * 52710
[M::ha_hist_line] 107: * 53964
[M::ha_hist_line] 108: * 54917
[M::ha_hist_line] 109: * 54081
[M::ha_hist_line] 110: * 53832
[M::ha_hist_line] 111: * 51629
[M::ha_hist_line] 112: * 47683
[M::ha_hist_line] 113: * 45744
[M::ha_hist_line] 114: * 40777
[M::ha_hist_line] 115: * 39379
[M::ha_hist_line] 116: * 37060
[M::ha_hist_line] 117: * 37442
[M::ha_hist_line] rest: **************************** 2031545
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_ft_gen] peak_hom: 3; peak_het: -1
[M::ha_ft_gen::527.803*[email protected]] ==> filtered out 117538870 k-mers occurring 15 or more times
[M::ha_opt_update_cov] updated max_n_chain to 100
[M::ha_pt_gen::649.205*6.83] ==> counted 14181699 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[9] = 131849
[M::ha_analyze_count] highest: count[14] = 260899
[M::ha_hist_line] 1: ****************************************************************************************************> 10770336
[M::ha_hist_line] 2: ****************************************************************************************************> 711398
[M::ha_hist_line] 3: ****************************************************************************************************> 477310
[M::ha_hist_line] 4: ****************************************************************************************************> 357784
[M::ha_hist_line] 5: ****************************************************************************************************> 279986
[M::ha_hist_line] 6: *************************************************************************************** 226170
[M::ha_hist_line] 7: ******************************************************************** 178593
[M::ha_hist_line] 8: ********************************************************* 148284
[M::ha_hist_line] 9: *************************************************** 131849
[M::ha_hist_line] 10: *************************************************** 132070
[M::ha_hist_line] 11: ******************************************************** 146009
[M::ha_hist_line] 12: *************************************************************** 165522
[M::ha_hist_line] 13: *************************************************************************** 195489
[M::ha_hist_line] 14: **************************************************************************************************** 260899
[M::ha_hist_line] rest: 0
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 14; peak_het: -1
[M::ha_pt_gen::752.412*8.78] ==> indexed 21772882 positions
[M::ha_assemble::949.210*[email protected]] ==> corrected reads for round 1
[M::ha_assemble] # bases: 8221544631; # corrected bases: 6132722; # recorrected bases: 18957
[M::ha_assemble] size of buffer: 0.512GB
[M::ha_pt_gen::1052.394*13.53] ==> counted 11170165 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[10] = 121171
[M::ha_analyze_count] highest: count[14] = 221771
[M::ha_hist_line] 1: ****************************************************************************************************> 7972140
[M::ha_hist_line] 2: ****************************************************************************************************> 562090
[M::ha_hist_line] 3: ****************************************************************************************************> 418851
[M::ha_hist_line] 4: ****************************************************************************************************> 345446
[M::ha_hist_line] 5: ****************************************************************************************************> 274879
[M::ha_hist_line] 6: ****************************************************************************************************> 223168
[M::ha_hist_line] 7: ******************************************************************************* 175033
[M::ha_hist_line] 8: ***************************************************************** 144146
[M::ha_hist_line] 9: ********************************************************* 126427
[M::ha_hist_line] 10: ******************************************************* 121171
[M::ha_hist_line] 11: ************************************************************ 133867
[M::ha_hist_line] 12: ******************************************************************** 150913
[M::ha_hist_line] 13: ***************************************************************************** 171504
[M::ha_hist_line] 14: **************************************************************************************************** 221771
[M::ha_hist_line] 15: ******************************************** 97971
[M::ha_hist_line] 16: *********** 24641
[M::ha_hist_line] 17: ** 4875
[M::ha_hist_line] rest: * 1272
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 14; peak_het: -1
[M::ha_pt_gen::1155.817*14.20] ==> indexed 21791689 positions
[M::ha_assemble::1324.871*[email protected]] ==> corrected reads for round 2
[M::ha_assemble] # bases: 8219204883; # corrected bases: 755893; # recorrected bases: 27755
[M::ha_assemble] size of buffer: 0.492GB
[M::ha_pt_gen::1425.545*16.32] ==> counted 10883111 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[10] = 120900
[M::ha_analyze_count] highest: count[14] = 219799
[M::ha_hist_line] 1: ****************************************************************************************************> 7720292
[M::ha_hist_line] 2: ****************************************************************************************************> 538136
[M::ha_hist_line] 3: ****************************************************************************************************> 408123
[M::ha_hist_line] 4: ****************************************************************************************************> 341917
[M::ha_hist_line] 5: ****************************************************************************************************> 275581
[M::ha_hist_line] 6: ****************************************************************************************************> 223669
[M::ha_hist_line] 7: ******************************************************************************* 174658
[M::ha_hist_line] 8: ****************************************************************** 144019
[M::ha_hist_line] 9: ********************************************************* 125414
[M::ha_hist_line] 10: ******************************************************* 120900
[M::ha_hist_line] 11: ************************************************************ 132795
[M::ha_hist_line] 12: ******************************************************************** 149423
[M::ha_hist_line] 13: ***************************************************************************** 169641
[M::ha_hist_line] 14: **************************************************************************************************** 219799
[M::ha_hist_line] 15: *********************************************** 102332
[M::ha_hist_line] 16: ************* 27982
[M::ha_hist_line] 17: *** 6462
[M::ha_hist_line] 18: * 1426
[M::ha_hist_line] rest: 542
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 14; peak_het: -1
[M::ha_pt_gen::1528.858*16.65] ==> indexed 21765787 positions
[M::ha_assemble::1695.380*[email protected]] ==> corrected reads for round 3
[M::ha_assemble] # bases: 8219055948; # corrected bases: 258593; # recorrected bases: 27049
[M::ha_assemble] size of buffer: 0.480GB
[M::ha_pt_gen::1798.606*17.92] ==> counted 10811427 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[10] = 120732
[M::ha_analyze_count] highest: count[14] = 219552
[M::ha_hist_line] 1: ****************************************************************************************************> 7658951
[M::ha_hist_line] 2: ****************************************************************************************************> 531565
[M::ha_hist_line] 3: ****************************************************************************************************> 404901
[M::ha_hist_line] 4: ****************************************************************************************************> 340623
[M::ha_hist_line] 5: ****************************************************************************************************> 275702
[M::ha_hist_line] 6: ****************************************************************************************************> 223799
[M::ha_hist_line] 7: ******************************************************************************* 174452
[M::ha_hist_line] 8: ****************************************************************** 144051
[M::ha_hist_line] 9: ********************************************************* 125201
[M::ha_hist_line] 10: ******************************************************* 120732
[M::ha_hist_line] 11: ************************************************************ 132550
[M::ha_hist_line] 12: ******************************************************************** 149277
[M::ha_hist_line] 13: ***************************************************************************** 168969
[M::ha_hist_line] 14: **************************************************************************************************** 219552
[M::ha_hist_line] 15: *********************************************** 103278
[M::ha_hist_line] 16: ************* 28638
[M::ha_hist_line] 17: *** 6901
[M::ha_hist_line] 18: * 1597
[M::ha_hist_line] rest: 688
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: none
[M::ha_pt_gen] peak_hom: 14; peak_het: -1
[M::ha_pt_gen::1905.213*18.06] ==> indexed 21755852 positions
[M::ha_assemble::1995.330*[email protected]] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 3361840
[M::ha_print_ovlp_stat] # strong overlaps: 224248
[M::ha_print_ovlp_stat] # weak overlaps: 3137592
[M::ha_print_ovlp_stat] # exact overlaps: 2502940
[M::ha_print_ovlp_stat] # inexact overlaps: 858900
[M::ha_print_ovlp_stat] # overlaps without large indels: 3346936
[M::ha_print_ovlp_stat] # reverse overlaps: 339323
Writing reads to disk...
Reads has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] purge duplication coverage threshold: 12
[M::purge_dups] purge duplication coverage threshold: 12
[M::adjust_utg_by_primary] primary contig coverage range: [7, infinity]
Writing primary contig GFA to disk...
Writing alternate contig GFA to disk...
[M::main] Version: 0.12-r304
[M::main] CMD: /software/vertres/bin-external/hifiasm-0.12/hifiasm -t28 -o wrPisGeom1 /lustre/scratch116/tol/projects/darwin/data/annelids/Piscicola_geometra/working/wrPisGeom1.hicanu.20201015/wrPisGeom1.trimmedReads.fasta.gz
[M::main] Real time: 2020.912 sec; CPU: 36935.352 sec; Peak RSS: 18.649 GB
Dear Haoyu,
It has been a very nice experience using hifiasm. Thanks for your exercellent wrok.
May I ask for your suggestion on how to combine HiC data with hifiasm assemblies in general, for both phasing and scaffolding?
Many thanks in advance.
Kind regards,
Weihong
Hello,
Thanks for this very useful tool. One lacking feature (or perhaps I could just not find it?) is that there is no coverage depth information available in the GFAs produced. Is there a way to access this information (or to request for it to be added in a future update of hifiasm)?
Hi
I am running HIfiasm on a dataset of D.melanogaster HiFi reads that was downloaded from NCBI.
I am using the command line;
time hifiasm -o -t 36
It has the intention of creating a folder in home directory and placing the generated files in it. However, after hrs of working I get
'writing reads to disk ....
Segmentation fault (core dumped)'
There is no new folder created and I cannot find the specified output files.
What am I doing wrong? I am rather new to assembling these genomes
Hi, thanks a lot for the great assembler.
I am trying to assemble a homozygous plant genome with 1.6gb with 40X CCS reads. Interestingly, this plant seems to have accumulated large segmental duplications that share a high similarity. In some cases over 95%. We have tried hifiasm with and without the purging option (-l0). The assembly looks much better without the purging step (N50 20Mb) in comparison with the purging step (N50 12 Mb). However, it seems that even without purging some regions after mapping seems to contain collapsed segmental duplications as checked by mapping the reads. Some regions seem to have double coverage.
Any parameter suggestion to improve the assembly?
Thanks!
André
Hi,
I'm running into some trouble running hifiasm on CCS reads. Specifically, whenever I run the program it proceeds for a while, never outputing any kind of error message (not even segmentation fault). When it finishes, it creates all the output files, but they're all empty except for the .bin files. I have a species with a big genome (estimated size above >7 Gb), I have 260 Gbp of data (average subread length 15kb), so should be over 30x coverage.
The command I used was:
./hifiasm -o assembly.asm -t 40 -f38 -z20 subreads.fq.gz
I tried to vary the -f option and excluding the -z option, but results remain unchanged. I ran this both on a computer with 1 Tb of RAM and on a different server with slurm, allocating 200 Gb to the process, results were the same on both machines. I ran the program with the test data and everything goes well, so I assume everything is ok with the installation. The output mentions zero overlaps, is this a sign something went wrong with the sequencing, or is this a problem with the options I'm using?
Thanks for the help!
[M::ha_analyze_count] lowest: count[86] = 1600
[M::ha_analyze_count] highest: count[87] = 1611
[M::ha_hist_line] 2: ****************************************************************************************************> 40070115
[M::ha_hist_line] 3: ****************************************************************************************************> 9280334
[M::ha_hist_line] 4: ****************************************************************************************************> 3692809
[M::ha_hist_line] 5: ****************************************************************************************************> 1934045
[M::ha_hist_line] 6: ****************************************************************************************************> 1185357
[M::ha_hist_line] 7: ****************************************************************************************************> 796019
[M::ha_hist_line] 8: ****************************************************************************************************> 570160
[M::ha_hist_line] 9: ****************************************************************************************************> 427354
[... lines ommitted ...]
[M::ha_analyze_count] left: none
[M::ha_analyze_count] right: count[44] = 376
[M::ha_pt_gen] peak_hom: 44; peak_het: 40
[M::ha_pt_gen::1307.63518.51] ==> indexed 7590996 positions
[M::ha_assemble::1355.419[email protected]] ==> found overlaps for the final round
[M::ha_print_ovlp_stat] # overlaps: 0
[M::ha_print_ovlp_stat] # strong overlaps: 0
[M::ha_print_ovlp_stat] # weak overlaps: 0
[M::ha_print_ovlp_stat] # exact overlaps: 0
[M::ha_print_ovlp_stat] # inexact overlaps: 0
[M::ha_print_ovlp_stat] # overlaps without large indels: 0
[M::ha_print_ovlp_stat] # reverse overlaps: 0
Writing reads to disk...
Reads has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
Writing ma_hit_ts to disk...
ma_hit_ts has been written.
bin files have been written.
Writing raw unitig GFA to disk...
Writing processed unitig GFA to disk...
[M::purge_dups] purge duplication coverage threshold: 55
[M::purge_dups] purge duplication coverage threshold: 55
[M::adjust_utg_by_primary] primary contig coverage range: [35, infinity]
Writing primary contig GFA to disk...
Writing alternate contig GFA to disk...
[M::main] Version: 0.12-r304
[M::main] CMD: hifiasm/hifiasm -o assembly.asm -t 40 -f38 subreads.fq.gz
[M::main] Real time: 1384.462 sec; CPU: 26092.516 sec; Peak RSS: 41.007 GB
Dear hifiasm team
I'm wondering how much memory the package needs for a read dataset of 30x coverage from a single human chromosome?
Hello there,
Thanks for developing such wonderful tool. There is an idea that I want to test.
I would like to know if the resulting haplotype-specific assembly from hifiasm using '-3 -4' is better than a diploid hifiasm assembly. For instance, for a region of interest, I have already information (say from Hi-C) that allows me to separate the HiFi reads into two haplotypes.
so I ran hifiasm using the following command:
hifiasm -o haplotype_binning_-3-4out -t 8 -3 haplotype1_read_list -4 haplotype2_read_list comb.fa
I wanted to know if this assembly is better than an assembly from default hifiasm, so I ran hifiasm with the following command:
hifiasm -o comb.hifiasm -t8 comb.fa
It produces the output files as shown in the attachment.
I wonder what are the files that I should look at for a direct comparison, say check N50?
I am very curious to know essentially what hifiasm did, specifically, how differently the '-3 -4' mode behave differently from the default diploid assembly.
Hope my questions are clear :-).
thanks,
zhenzhen
Hi, chhylp123
For species with high heterozygous rate (i.e., Butterfly), different haplotypes can be fully separated. It is important to remove small bubbles from the haplotype-resolved unitig graph. The reason is that some small bubbles are caused by somatic mutations or noise in data, which are not the real haplotype information. In this case, haplotype-resolved processed unitig graph without small bubbles should be better.
Haplotype-resolved untig fasta(490M) seems exactly twice of my GenomeScope estimated haploid size (260M, 1.8% het), but the result just the untig number, how can I get parental and maternal haplotype? Or the hifiasm just output the haplotype-resolved untig, but need other information (such as HiC or Trio) to separate two haplotype for diploid.
Zhigui Bao
Hi,
How can I rerun Hifiasm with different parameters based on the generated bin files?
Thanks,
André
Hi,
Do you have recommended settings for the PacBio program ccs
(https://github.com/PacificBiosciences/ccs) for converting raw subreads to HiFi reads for downstream usage with hifiasm
?
Best,
Jean Elbers
Hi chhylp123,
I have sequenced a diploid genome (repeat content >70%) with 25X coverage HiFi reads. Luckly, I got a wonderful contigs with N50 of 44 Mb by hifiasm 0.12.
Then I anchored contigs to chromosomes by Allmaps with ~1500 high quality genetic markers. Finally, I obtained 10 pesudomolecules. However, I found there was 20-MB region in chr7 which is not supported by genetic markers and synteny with other homologus species.
Furthermore, I mapped ccs reads to the final assembly, and I found that 20-Mb region with half-reduced coverage reads.
Meanwhile, I also mapped the RNA-seq reads to the genome, and no reads covered this region. So, I think this 20-Mb region maybe misassembled.
However, this 20-Mb region was located in a single contig (108 Mb) which were constructed by sereval utgs (the length of both terminal utgs (utg000064l and utg000017l) are 29 Mb and 47 Mb, separately ), and there is no obvious evidence support to break this contig.
Therefore, I am wondering whether there are other probabilities for this assembly? And have you ever met that some assembly regions covered by half depth reads before? May be high heterozygosity for 20-Mb?
Thanks!
Dong An
Hello,
I wonder if there is a way to generate the corrected Hifi fastq reads into .fq format. Right now they are in binary format, right?
thanks,
Zhenzhen
Hi,
I am trying to assemble a heterotetraploid genome. Add -l0 to the assembly parameter, and combine the information of a_ctg and p_ctg files as a result. If I want to distinguish between haplotypes, do I need to de-redundate the genome?
Thanks,
Yulin Bai
I'm in the process of contributing to a paper which used HiFiASM, do you have a preprint or something with a DOI that I can use to cite the program?
Hi,
Very nice tool as easy to run and quick which can be a major headache sometimes. I posted below and just realised I put both smrt cells fasta and fastq in so duplicated the data. I'll try that again and hopefully this helps.
Results look significantly better than HiCanu. I have 10,000 contigs. My problem is the longest contig I have is 3.5Mbp and the BUSCO has high duplication which corresponds to the genome size being 80% too large too. I think I only have x15 coverage, I have requested more data and might be able to push it to 20x but will have to wait. Is there a parameter that I could tweak to get this duplication down rather than use additional tool to do that. It is early days, and I need to play around more and read more but in case I am missing anything obvious.
C:99.9%[S:19.7%,D:80.2%],F:0.1%,M:-0.0%,n:1658
[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
-- contigs: 24233 sequences, total length 2033386837 bp (including 439 repeats of total length 7159438 bp).
-- bubbles: 6428 sequences, total length 192094085 bp.
-- unassembled: 94184 sequences, total length 1000046846 bp.
--
-- Contig sizes based on genome size 1.3gbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 968422 104 130308813
-- 20 660064 270 260179731
-- 30 498917 500 390143923
-- 40 405200 791 520051788
-- 50 328828 1148 650274462
-- 60 275039 1581 780008382
-- 70 228439 2100 910187095
-- 80 188891 2723 1040077841
-- 90 154826 3482 1170033072
-- 100 126343 4413 1300104055
-- 110 100700 5565 1430064823
-- 120 75672 7052 1560023035
-- 130 53416 9090 1690051034
-- 140 31413 12252 1820020140
-- 150 17660 17857 1950017255
--
[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
-- contigs: 24233 sequences, total length 3030398075 bp (including 439 repeats of total length 10703713 bp).
-- bubbles: 6428 sequences, total length 286330754 bp.
-- unassembled: 94184 sequences, total length 1482274716 bp.
--
-- Contig sizes based on genome size 1.3gbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 1636484 63 131210376
-- 20 1228508 155 260856458
-- 30 973050 274 390584764
-- 40 800740 422 520250313
-- 50 677931 599 650558423
-- 60 598845 802 780360444
-- 70 520751 1034 910162938
-- 80 454585 1302 1040427748
-- 90 406316 1604 1170167952
-- 100 359785 1944 1300091512
-- 110 317659 2330 1430305495
-- 120 277794 2766 1560173378
-- 130 244161 3264 1690067561
-- 140 213205 3835 1820172826
-- 150 185735 4490 1950080750
-- 160 158965 5246 2080102254
-- 170 133649 6135 2210046641
-- 180 109905 7205 2340006684
-- 190 87483 8527 2470045576
-- 200 65031 10236 2600029303
-- 210 44371 12664 2730012789
-- 220 30249 16250 2860018091
-- 230 18978 21591 2990006743
--
Any idea what this error could be about? Or if it would impact the output fasta?
Thanks,
KF
gfatools gfa2fa asm.p_ctg.gfa > p.contigs.fasta
[W::gfa_fix_utg] one or multiple A-lines on 'utg000549l' are incorrect
[W::gfa_fix_utg] one or multiple A-lines on 'utg001025l' are incorrect
[W::gfa_fix_utg] one or multiple A-lines on 'utg001466l' are incorrect
[W::gfa_fix_utg] one or multiple A-lines on 'utg003357l' are incorrect
Hi there, trying out my first assemblies with hifiasm
with an animal whose genome is around 170 Mb, and is over 4% heterozygous. We know this from >100x coverage k-mer spectra from Illumina WGS reads.
The k-mer spectrum from the PacBio HiFi data tells the same story. I made a k-51 spectrum since I saw that is what hifiasm
uses by default. The data are around 44x coverage, and you can see the peak from the het k-mers around 20 and the homozygous k-mers around 40.
The k-mer spectrum from hifiasm
matches the above spectrum:
[M::ha_ft_gen::499.056*[email protected]] ==> filtered out 3811358 k-mers occurring 95 or more times
[M::ha_opt_update_cov] updated max_n_chain to 100
[M::ha_pt_gen::698.676*5.57] ==> counted 18097002 distinct minimizer k-mers
[M::ha_pt_gen] count[4095] = 0 (for sanity check)
[M::ha_analyze_count] lowest: count[6] = 152684
[M::ha_analyze_count] highest: count[19] = 451630
[M::ha_hist_line] 1: ****************************************************************************************************> 7830671
[M::ha_hist_line] 2: ****************************************************************************************************> 770193
[M::ha_hist_line] 3: ******************************************************************** 309013
[M::ha_hist_line] 4: ****************************************** 191939
[M::ha_hist_line] 5: ********************************** 154466
[M::ha_hist_line] 6: ********************************** 152684
[M::ha_hist_line] 7: *********************************** 156256
[M::ha_hist_line] 8: ************************************** 170569
[M::ha_hist_line] 9: ****************************************** 191057
[M::ha_hist_line] 10: ************************************************ 215175
[M::ha_hist_line] 11: ******************************************************* 247482
[M::ha_hist_line] 12: *************************************************************** 283626
[M::ha_hist_line] 13: *********************************************************************** 319430
[M::ha_hist_line] 14: ****************************************************************************** 354206
[M::ha_hist_line] 15: ************************************************************************************** 388878
[M::ha_hist_line] 16: ******************************************************************************************** 417222
[M::ha_hist_line] 17: ************************************************************************************************** 441975
[M::ha_hist_line] 18: **************************************************************************************************** 450105
[M::ha_hist_line] 19: **************************************************************************************************** 451630
[M::ha_hist_line] 20: *************************************************************************************************** 445987
[M::ha_hist_line] 21: ********************************************************************************************** 426016
[M::ha_hist_line] 22: **************************************************************************************** 397584
[M::ha_hist_line] 23: ******************************************************************************** 360297
[M::ha_hist_line] 24: ********************************************************************** 315880
[M::ha_hist_line] 25: ************************************************************* 276191
[M::ha_hist_line] 26: **************************************************** 234698
[M::ha_hist_line] 27: ******************************************** 196798
[M::ha_hist_line] 28: ************************************ 163054
[M::ha_hist_line] 29: ***************************** 132947
[M::ha_hist_line] 30: ************************* 111858
[M::ha_hist_line] 31: ********************* 93468
[M::ha_hist_line] 32: ****************** 81804
[M::ha_hist_line] 33: **************** 73618
[M::ha_hist_line] 34: *************** 68704
[M::ha_hist_line] 35: ************** 64746
[M::ha_hist_line] 36: ************** 63374
[M::ha_hist_line] 37: ************** 62412
[M::ha_hist_line] 38: ************** 62932
[M::ha_hist_line] 39: ************** 62554
[M::ha_hist_line] 40: ************** 61114
[M::ha_hist_line] 41: ************* 58898
[M::ha_hist_line] 42: ************* 57009
[M::ha_hist_line] 43: ************ 54254
[M::ha_hist_line] 44: *********** 51216
[M::ha_hist_line] 45: *********** 48483
[M::ha_hist_line] 46: ********** 45364
[M::ha_hist_line] 47: ********* 41100
[M::ha_hist_line] 48: ******** 37727
[M::ha_hist_line] 49: ******** 34807
[M::ha_hist_line] 50: ******* 31219
[M::ha_hist_line] 51: ****** 27505
[M::ha_hist_line] 52: ***** 24763
[M::ha_hist_line] 53: ***** 22128
[M::ha_hist_line] 54: **** 20128
[M::ha_hist_line] 55: **** 17958
[M::ha_hist_line] 56: **** 15938
[M::ha_hist_line] 57: *** 14744
[M::ha_hist_line] 58: *** 13672
[M::ha_hist_line] 59: *** 12817
[M::ha_hist_line] 60: *** 12142
[M::ha_hist_line] 61: *** 11357
[M::ha_hist_line] 62: ** 10574
[M::ha_hist_line] 63: ** 10200
[M::ha_hist_line] 64: ** 9739
[M::ha_hist_line] 65: ** 9206
[M::ha_hist_line] 66: ** 8849
However, hifiasm
gets the position of the homozygous peak incorrect, and calls it at 19, while it should be around 40. The results of running this assembly with --high-het
was that the primary assembly was twice as big as it should have been (400Mb instead of 200Mb). So, both haplotypes are ending up in the same primary assembly.
Main genome scaffold total: 753
Main genome contig total: 753
Main genome scaffold sequence total: 422.8 MB
Main genome contig sequence total: 422.8 MB (-> 0.0% gap)
Main genome scaffold N/L50: 95/1.4 MB
Main genome contig N/L50: 95/1.4 MB
Number of scaffolds > 50 KB: 558
% main genome in scaffolds > 50 KB: 98.5%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- --------- ------- ----------- ----------- --------
All 753 753 422,765,292 422,765,292 100.00%
1 kb 753 753 422,765,292 422,765,292 100.00%
2.5 kb 753 753 422,765,292 422,765,292 100.00%
5 kb 753 753 422,765,292 422,765,292 100.00%
10 kb 753 753 422,765,292 422,765,292 100.00%
25 kb 727 727 422,203,962 422,203,962 100.00%
50 kb 558 558 416,220,132 416,220,132 100.00%
100 kb 458 458 409,207,212 409,207,212 100.00%
250 kb 361 361 393,262,727 393,262,727 100.00%
500 kb 274 274 361,301,464 361,301,464 100.00%
1 mb 140 140 263,063,929 263,063,929 100.00%
2.5 mb 25 25 83,986,696 83,986,696 100.00%
5 mb 2 2 10,227,160 10,227,160 100.00%
The secondary assembly is too small:
Main genome scaffold total: 1011
Main genome contig total: 1011
Main genome scaffold sequence total: 25.6 MB
Main genome contig sequence total: 25.6 MB (-> 0.0% gap)
Main genome scaffold N/L50: 270/26.2 KB
Main genome contig N/L50: 270/26.2 KB
Number of scaffolds > 50 KB: 49
% main genome in scaffolds > 50 KB: 21.7%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- --------- ------- ----------- ----------- --------
All 1,011 1,011 25,632,451 25,632,451 100.00%
1 kb 1,011 1,011 25,632,451 25,632,451 100.00%
2.5 kb 1,011 1,011 25,632,451 25,632,451 100.00%
5 kb 1,011 1,011 25,632,451 25,632,451 100.00%
10 kb 1,008 1,008 25,607,535 25,607,535 100.00%
25 kb 298 298 13,556,140 13,556,140 100.00%
50 kb 49 49 5,556,899 5,556,899 100.00%
100 kb 16 16 3,360,515 3,360,515 100.00%
250 kb 4 4 1,596,952 1,596,952 100.00%
500 kb 1 1 798,418 798,418 100.00%
1 mb 0 0 0 0 0.00%
Is there anything I can do to specify where hifiasm
should expect the het and homozygous peaks? I think this would be helpful to others working with highly heterozygous species. Thanks so much!
Hi,
Recently, I have used Hifiasm to generate haplotype-resolved assemblies with trio binning. But I want to know which the reads name which used for assembly two haplotype-resolved genome separately. So I can get these two reads set from raw pacbio hifi reads then feed to Hicanu to compare the resultes.
Can you give me any suggestion or parameter?
Pardon the ignorance, but how did you generate HiFi reads from a butterfly? We were told by our PacBio provider that we need 15 ug DNA (and we are working with Dipteran fly, so that was not possible from a single individual and thus we could only use the low-input DNA protocol with continuous long reads). We were also told that we couldn't use the PacBio Sequel I and could only generate HiFi reads from the PacBio Sequel II, but that doesn't seem to be the case according to PacBio's website.
Edit...so it does seem possible to use the low DNA input workflow for HiFi read generation
https://youtu.be/PJXDhusSexY
Hello, thank you for your software, I had a problem, my fasta sequence has 200G, I used version=0.5, no log file is generated and then the program automatically kills, What's the reason for this? Is it a lack of memory?
Curious if there is a recommendation on what level of heterozygosity to use the “--high-het“ option?
Hi,
I used hifiasm for trio-binning assembly and it works very well, giving reference-size haplotypes.
However there is one contig showing higher number of switching error, exemplified from the blob plot below.
And I was able to identify the contig with yak trioeval.
The output from yak is "S h1tg000081l 13640 5824 13621 19 18 5805"
What is the best way to correct this contig? And how should I do it?
Thanks,
Zhen
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.