lh3 / miniasm Goto Github PK

Ultrafast de novo assembly for long noisy reads (though having no consensus step)

License: MIT License

Makefile 0.48% C 38.15% Objective-C 1.67% JavaScript 7.46% Shell 2.93% TeX 45.25% Perl 1.59% Roff 2.47%

miniasm's Introduction

Getting Started

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap2 && (cd minimap2 && make)
git clone https://github.com/lh3/miniasm  && (cd miniasm  && make)
# Overlap for PacBio reads (or use "-x ava-ont" for nanopore read overlapping)
minimap2/minimap2 -x ava-pb -t8 pb-reads.fq pb-reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Introduction

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in early development stage. It has only been tested on a dozen of PacBio and Oxford Nanopore (ONT) bacterial data sets. Including the mapping step, it takes about 3 minutes to assemble a bacterial genome. Under the default setting, miniasm assembles 9 out of 12 PacBio datasets and 3 out of 4 ONT datasets into a single contig. The 12 PacBio data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set. ONT data are acquired from the Loman Lab.

For a C. elegans PacBio data set (only 40X are used, not the whole dataset), miniasm finishes the assembly, including reads overlapping, in ~10 minutes with 16 CPUs. The total assembly size is 105Mb; the N50 is 1.94Mb. In comparison, the HGAP3 produces a 104Mb assembly with N50 1.61Mb. This dotter plot gives a global view of the miniasm assembly (on the X axis) and the HGAP3 assembly (on Y). They are broadly comparable. Of course, the HGAP3 consensus sequences are much more accurate. In addition, on the whole data set (assembled in ~30 min), the miniasm N50 is reduced to 1.79Mb. Miniasm still needs improvements.

Miniasm confirms that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio or ONT reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.

Algorithm Overview

Crude read selection. For each read, find the longest contiguous region covered by three good mappings. Get an approximate estimate of read coverage.
Fine read selection. Use the coverage information to find the good regions again but with more stringent thresholds. Discard contained reads.
Generate a string graph. Prune tips, drop weak overlaps and collapse short bubbles. These procedures are similar to those implemented in short-read assemblers.
Merge unambiguous overlaps to produce unitig sequences.

Limitations

Consensus base quality is similar to input reads (may be fixed with a consensus tool).
Only tested on a dozen of high-coverage PacBio/ONT data sets (more testing needed).
Prone to collapse repeats or segmental duplications longer than input reads (hard to fix without error correction).

miniasm's People

Contributors

Stargazers

Watchers

miniasm's Issues

Reassembling polished unitigs with the raw reads

I had the idea of polishing the unitigs using Racon, then reassembling these polished unitigs with the original raw reads to possibly hopefully reads that might join some of the dead-end unitigs. It didn't quite have the intended result. The three circular unitigs in the assembly were assembled, and all of the linear unitigs disappeared. I'm just experimenting here. Feel free to close this issue.

assembling canu corrected reads

Hi,
So I am trying to to assemble ONT reads that have been corrected with canu. But I am getting a strange error:
Laurens-MacBook-Pro:Ecoli_644 laurencowley$ ../git_repo/miniasm/miniasm -f source_canu_lowcov/Ecoli644.correctedReads.fasta.gz Ecoli644_correctedminimap.paf.gz > Ecoli644_miniasm.gfa
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::1.850_1.00] read 1687161 hits; stored 1257327 hits and 14342 sequences (124608869 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::2.042_1.00] 14340 query sequences remain after sub
[M::ma_hit_cut::2.068_1.00] 1257303 hits remain after cut
[M::ma_hit_flt::2.096_0.99] 1112831 hits remain after filtering; crude coverage after filtering: 54.73
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::2.153_0.99] 14335 query sequences remain after sub
[M::ma_hit_cut::2.179_0.99] 1112203 hits remain after cut
[M::ma_hit_chimeric::2.207_0.99] identified 143 chimeric reads
[M::ma_hit_contained::2.236_0.99] 749 sequences and 10538 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 10294 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 3498 arcs
[M::asg_arc_del_multi] removed 1652 multi-arcs
[M::asg_arc_del_asymm] removed 0 asymmetric arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 8 tips
[M::asg_pop_bubble] popped 11 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 345 asymmetric arcs
[M::asg_arc_del_short] removed 1725 short overlaps
[M::asg_cut_tip] cut 8 tips
[M::asg_pop_bubble] popped 127 bubbles and trimmed 0 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 29 asymmetric arcs
[M::asg_arc_del_short] removed 47 short overlaps
[M::asg_cut_tip] cut 1 tips
[M::asg_pop_bubble] popped 15 bubbles and trimmed 0 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 9 asymmetric arcs
[M::asg_arc_del_short] removed 11 short overlaps
[M::asg_cut_tip] cut 1 tips
[M::asg_pop_bubble] popped 4 bubbles and trimmed 0 tips
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 0 internal sequences
[M::asg_cut_biloop] cut 0 small bi-loops
[M::asg_cut_tip] cut 0 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 1 asymmetric arcs
[M::asg_arc_del_short] removed 1 short overlaps
[M::asg_cut_tip] cut 0 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 5: generating unitigs <===
Assertion failed: (sub[id].e - sub[id].s <= ks->seq.l), function ma_ug_seq, file asm.c, line 267.
Abort trap: 6
Do you know why this might be?

miniasm for large genome with daligner overlaps

Hi,
I just ran a falcon assembly on a large plant genome (1.4 Gb) using 50 smrtcells.
I kept the .las files and was wondering if i could use them with your new las2paf.pl script.
Or would you rather compute the overlaps from scratch?

Is a similar thing possible from Celera/PBcR assemblies? So far i couldnt find overlap information stored for the error-correction.

I was also wondering what resources miniasm needs in terms of memory and time to finish on a large dataset. Or would you advise miniasm not to use on such data yet?
Thanks,
Michel

Assembling mixtures of PacBio and Nanopore reads

Dear Heng,
I am trying to assemble a mixture of PacBio and Nanopore reads using minimap2 and miniasm, what would be the best switch for the overlap step? I tried minimap2 -t 32 -X -Hk15 -w5 -Xp0 -m100 -g10000 --max-chain-skip 25 but it seems to take forever on my 32 Gbp long-read dataset... Would you recommend another combination of switches?
So far my .paf file is 334 Gbp and growing. Is it possible to find out what percentage of the overlap search is already done (so that I can know whether it is near completion or will still be running for weeks)? When I look at the .paf output I cannot find an obvious logic in the order of the reads (perhaps because of the multithreading?).
Thanks a lot in advance, and best regards,
Jean-François

No output of Minisam?

Hi lh3,
I have minimap output, but no minisam output. The error may be because the first step of minisam: read 0 hits; stored 0 hits and 0 sequences.
Can you help to figure out the problem? Thank you!

[hujie@YanfaCenter01 miniasm]$ minimap2 --cs=long -w5 -L100 -m0 -t8 /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/.fastq | gzip -1 > micro-bc1.paf.gz
[M::mm_idx_gen::1.2161.00] collected minimizers
[M::mm_idx_gen::1.4771.65] sorted minimizers
[M::main::1.4771.65] loaded/built the index for 4000 target sequence(s)
[M::mm_mapopt_update::1.5721.61] mid_occ = 24
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 4000
[M::mm_idx_stat::1.6391.58] distinct minimizers: 3219472 (79.81% are singletons); average occurrences: 1.354; average spacing: 2.949
[M::worker_pipeline::8.4365.45] mapped 4000 sequences
[M::worker_pipeline::16.5626.01] mapped 4000 sequences
[M::worker_pipeline::24.8736.23] mapped 4000 sequences
[M::worker_pipeline::29.1506.21] mapped 2033 sequences
[M::worker_pipeline::31.3966.20] mapped 4000 sequences
[M::worker_pipeline::37.8926.28] mapped 4000 sequences
[M::worker_pipeline::45.9146.35] mapped 4000 sequences
[M::worker_pipeline::53.7966.37] mapped 4000 sequences
[M::worker_pipeline::57.9586.35] mapped 2033 sequences
[M::main] Version: 2.10-r770-dirty
[M::main] CMD: minimap2 --cs=long -w5 -L100 -m0 -t8 /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_0.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_1.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_2.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_3.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_4.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_0.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_1.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_2.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_3.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_4.fastq
[M::main] Real time: 57.998 sec; CPU: 368.039 sec

[hujie@YanfaCenter01 miniasm]$ miniasm -f /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/.fastq micro-bc1.paf.gz > micro-bc1.gfa
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::0.1761.00] read 0 hits; stored 0 hits and 0 sequences (0 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::0.1761.00] 0 query sequences remain after sub
[M::ma_hit_cut::0.1761.00] 0 hits remain after cut
[M::ma_hit_flt::0.1761.00] 0 hits remain after filtering; crude coverage after filtering: -nan
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::0.1761.00] 0 query sequences remain after sub
[M::ma_hit_cut::0.1761.00] 0 hits remain after cut
[M::ma_hit_contained::0.1761.00] 0 sequences and 0 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 0 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 0 arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 0 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 0 asymmetric arcs
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 0 internal sequences
[M::asg_cut_biloop] cut 0 small bi-loops
[M::asg_cut_tip] cut 0 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 5: generating unitigs <===
[M::main] Version: 0.2-r168-dirty
[M::main] CMD: miniasm -f /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_0.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_1.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_2.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_3.fastq /home/hujie/nanopore/albacore/micro-S9/workspace/pass/barcode01/fastq_runid_9d4de8c78879334fb7318ec861eb6d27b74228c6_4.fastq micro-bc1.paf.gz
[M::main] Real time: 0.240 sec; CPU: 0.241 sec

Add read count (RC) attribute to GFA [feature request]

Bandage will draw the thickness of the unitig segments according to their depth when the segments include the read count (RC:i) attribute.
See https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md#optional-fields-2

Miniasm not making it past Step 1

Hello, I'm trying to assemble some reads with minimap and miniasm. Minimap seems to have worked, but my process is failing on miniasm:

[M::mm_idx_gen::0.010*2.97] collected minimizers
[M::mm_idx_gen::0.022*2.77] sorted minimizers
[M::main::0.022*2.77] loaded/built the index for 25 target sequence(s)
[M::main] max occurrences of a minimizer to consider: 7
[M::main] Version: 0.2-r124-dirty
[M::main] CMD: minimap -Sw5 -L100 -m0 Gap_Support/Contig17_1_1011970_pilon|arrow.gap.3/reads.fa Gap_Support/Contig17_1_1011970_pilon|arrow.gap.3/reads.fa
[M::main] Real time: 0.035 sec; CPU: 0.082 sec
[M::main] ===> Step 1: reading read mappings <===
awk: '/^S/{print(">"$2"
awk: ^ invalid char ''' in expression

Any pointers on how to get around this would be appreciated. Could it have something to do with the headers in my fasta?

Document GFA `a` record

Hi, Heng. If you think that the a record that you've used in the GFA output of miniasm would be useful to others, would you like to send a pull request documenting it to https://github.com/pmelsted/GFA-spec, or document it here and I'll add it to the GFA spec? Thanks, Shaun

output string graph in GFA format

Would it be straightforward to output the string graph obtained in step 3 of the algorithm in GFA format?

Short read lengths - genome assembly

Hello,

I came across your software mentioned in a pre-print on assembly a plant genome. I was wondering if miniasm could be used to get a rough assemble of large (1 Gb+) genome with short reads.

I was able to run minimap:
./minimap -Sw5 -L100 -m0 -t8 orc13-A2_R1.fastq.gz orc13-A2_R2.fastq.gz | gzip -1 > reads.paf.gz
to generate the paf.gz.

But I am not sure on what might be good starting values for the parameters in miniasm. I've tried a few different permutations and always end up with an out file that is 0 kb.

Thanks for any input or help.
ara

layout step produces empty gfa

Hi guys.

I am using minimap/miniasm to get the raw contigs by rapid assembly as preliminary step in order to perform error correction using @isovic 's racon.

My input file is: r7_2d.fastq, a file containing 2D MinION reads (RNA, not DNA sequencing).

Following your instructions, I installed latest minimap (Version: 0.2-r124-dirty) & miniasm (Version: 0.2-r137-dirty). Overlap step was completed succesfully (and got a non-empty paf.gz file), but output for layout (gfa file) is empty.

Here is my log:

# overlap: successful.
./minimap -Sw5 -L100 -m0 -t8 r7_2d.fastq r7_2d.fastq | gzip -1 > r7_2d.paf.gz
[M::mm_idx_gen::8.484*1.20] collected minimizers
[M::mm_idx_gen::9.471*1.72] sorted minimizers
[M::main::9.471*1.72] loaded/built the index for 113249 target sequence(s)
[M::main] max occurrences of a minimizer to consider: 37
[M::main] Version: 0.2-r124-dirty
[M::main] CMD: /home/aechchik/wgets/minimap/minimap -Sw5 -L100 -m0 -t8 r7_2d.fastq r7_2d.fastq
[M::main] Real time: 17.636 sec; CPU: 60.304 sec

# layout: empty gfa file
./miniasm -f r7_2d.fastq r7_2d.paf.gz > r7_2d.gfa
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::1.021*0.99] read 664850 hits; stored 219221 hits and 16136 sequences (50720456 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::1.049*0.99] 11922 query sequences remain after sub
[M::ma_hit_cut::1.055*0.99] 206693 hits remain after cut
[M::ma_hit_flt::1.059*0.99] 206477 hits remain after filtering; crude coverage after filtering: 10.55
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::1.069*0.99] 11026 query sequences remain after sub
[M::ma_hit_cut::1.074*0.99] 202966 hits remain after cut
[M::ma_hit_chimeric::1.078*0.99] identified 1 chimeric reads
[M::ma_hit_contained::1.083*0.99] 756 sequences and 556 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 541 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 97 arcs
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 2 asymmetric arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 547 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 0 internal sequences
[M::asg_cut_biloop] cut 0 small bi-loops
[M::asg_cut_tip] cut 0 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 5: generating unitigs <===
[M::main] Version: 0.2-r137-dirty
[M::main] CMD: /home/aechchik/wgets/miniasm/miniasm -f r7_2d.fastq r7_2d.paf.gz
[M::main] Real time: 1.920 sec; CPU: 1.745 sec

Just to let you know, I tried the very same commands with the toy example on [PB data](wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz) (in your readme) and the gfa file is not empty.

Do you happen to know what is the source of this unexpected gfa file?

Thanks in advance for your help,
Amina

Do you keep on tagging releases

Hi,
Debian has packaged the released version 0.2 of miniasm since we (the Debian maintainers) trust what software authors are marking as "release with distribution quality". Recently I've found some software (Unicycler) which ships with a code copy of the latest miniasm release. This made me wonder whether you consider every commit as some releasable state that can be distributed to the users. It would be nice if you could clarify this situation by either keep on marking releases with tags or explain somewhere in the docs that users are supposed to use the latest Git commit. I personally would prefer the first option (release tags) since it is more explicit.
Kind regards, Andreas.

Segmentation fault

Hi Liheng,
When I use miniasm to assembly an 100x Pacbio genome, there was an error as follow:
Segmentation fault
What is the matter? I followed the mannual, and the logs were:

 [M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::40487.179*0.93] read 11945248497 hits; stored 12916879108 hits and 10398788 sequences (101800136514 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::45821.251*0.94] 9975919 query sequences remain after sub
[M::ma_hit_cut::48542.296*0.94] 11845662951 hits remain after cut
[M::ma_hit_flt::50820.915*0.94] 7422102329 hits remain after filtering; crude coverage after filtering: 476.06
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::51346.118*0.94] 9907831 query sequences remain after sub
[M::ma_hit_cut::52976.403*0.95] 7118498960 hits remain after cut
[M::ma_hit_chimeric::53791.988*0.95] identified 474580 chimeric reads

Can you give me some advice to solve this problems? Thank you very much!

Minimap or Minimap2

The readme is talking about minimap but minimap2 has been released. Is it possible to just replace minimap by minimap2 or does it require more tweaks?

# Download sample PacBio from the PBcR website
wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq
# Install minimap and miniasm (requiring gcc and zlib)
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlap
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz
# Layout
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

miniasm output with 0 with test data

Hello,

I have tried miniasm with the test dataset and there seems to be something wrong with miniasm in my setting.
I have followed the "getting started" scripts as to test whether the tool works properly

Download sample PacBio from the PBcR website

wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -
ln -s selfSampleData/pacbio_filtered.fastq reads.fq

Install minimap and miniasm (requiring gcc and zlib)

git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)

Overlap

minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz

Layout

miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

And I ended up with the final output reads.gfa with 0 (size).
Also 0 sized file of reads.paf.gz were obtaied, and I do not understand why it's getting output like this.
I have also tried with other datasets and still getting very small or almost non gfa files.

I am asking for your generous help. Thank you

No sequences in gfa - unable to convert to fasta

Hi, how do you convert .gfa to .fasta? I've found a few posts recommending awk one-liner, but there are no sequences in .gfa.

awk '/^S/{print ">"$2"\n"$3}' run.gfa | head
>utg000001l
*
>utg000002l
*
>utg000003l
*
>utg000004l
*
>utg000005l
*

bad gateway for test data

Hi @lh3,

I get this error when trying to download the test data:

➜  minis  wget -O reads.fa http://nanopore.climb-radosgw01.bham.ac.uk/MAP006-1_2D_pass.fasta
--2015-11-27 15:23:33--  http://nanopore.climb-radosgw01.bham.ac.uk/MAP006-1_2D_pass.fasta
Resolving nanopore.climb-radosgw01.bham.ac.uk (nanopore.climb-radosgw01.bham.ac.uk)... 147.188.173.15
Connecting to nanopore.climb-radosgw01.bham.ac.uk (nanopore.climb-radosgw01.bham.ac.uk)|147.188.173.15|:80... connected.
HTTP request sent, awaiting response... 502 Bad Gateway
2015-11-27 15:23:33 ERROR 502: Bad Gateway.

Looks like this isn't up: http://nanopore.climb-radosgw01.bham.ac.uk/

Is there another suitable public dataset we could use, or could we host this elsewhere?

failed at index.c

hi, I tried miniasm to deal with one cell pacbio data, it was OK. however, when i used miniasm to assemble two cells pacbio data, it failed, and the error information was listed below:
4 minimap: index.c:205: worker_pipeline: Assertion `strlen(s->seq[i].name) <= 254' failed.
is the seq[i].name mean the name of the sequence? i check the name of the sequence, and it was like that "@m170607_042556_42221_c101205872550000001823287710171764_s1_p0/12/0_2474 RQ=0.843". any advice? Thank you very much!

Xiaoting

Miniasm outputs genome with 100% Ns

Hello @lh3,

I'm testing miniasm using mhap overlaps generated with canu. I converted canu internal read IDs back to the original ones, and miniasm doesn't complain about any potential issues. It actually writes the unitigs as expected (different lengths,etc), but they are all filled with Ns.

Ps: I didn't have any problem with the first trial using minimap/miniasm approach.

Any help?
Thanks,
Pedro

P6C4 devnet: Assertion error

E coli data from https://github.com/PacificBiosciences/DevNet/wiki/E.-coli-Bacterial-Assembly caused minasm to crash for me.

wget https://s3.amazonaws.com/files.pacb.com/datasets/secondary-analysis/e-coli-k12-P6C4/p6c4_ecoli_RSII_DDR2_with_15kb_cut_E01_1.tar.gz
Uncompress
Convert to fastq using smrtanalysis 2.3.0:

ls *.bax.h5 >input.fofn
pls2fasta input.fofn -trimByRegion -minSubreadLength 50 -fastq -minReadScore 750 m141013_011508_sherri_c100709962550000001823135904221533_s1_p0.filtered_subreads.fastq

Convert to fasta using an in- house perl script

used minimap@c137b17 and miniasm@24ddd20 compiled with gcc 5.2.0

minimap/minimap -Sw5 -L100 -m0 -t8 P6C4.fasta P6C4.fasta | gzip -1 >P6C4.paf.gz
miniasm/miniasm -f P6C4.fasta P6C4.paf.gz >P6C4.gfa

miniasm output:

[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::34.899*0.98] read 15801032 hits; stored 11922618 hits and 66110 sequences (690111467 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::37.896*0.98] 64162 query sequences remain after sub
[M::ma_hit_cut::38.285*0.98] 11520836 hits remain after cut
[M::ma_hit_flt::38.734*0.98] 10989741 hits remain after filtering; crude coverage after filtering: 116.71
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::39.762*0.98] 61410 query sequences remain after sub
[M::ma_hit_cut::40.115*0.98] 10727924 hits remain after cut
[M::ma_hit_contained::40.496*0.98] 1581 sequences and 22558 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 21900 arcs
[M::asg_arc_del_trans] transitively reduced 17740 arcs
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 630 asymmetric arcs
[M::asg_pop_bubble] popped 107 bubbles and trimmed 13 tips
[M::asg_cut_short_utg] dropped [0,0,0] short unitigs
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 22 asymmetric arcs
[M::asg_arc_del_short] removed 52 short overlaps
[M::asg_pop_bubble] popped 17 bubbles and trimmed 7 tips
[M::asg_cut_short_utg] dropped [6,0,0] short unitigs
[M::asg_cut_short_utg] dropped [1,0,0] short unitigs
[M::asg_cut_short_utg] dropped [0,0,0] short unitigs
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 5: generating unitig graph <===
miniasm: asm.c:265: ma_ug_seq: Assertion `sub[id].e - sub[id].s < ks->seq.l' failed.

GFA file input to Bandage

Is the GFA file produced by miniasm compatible with Bandage? Thanks!

Assertion error

Read files from:
http://www.ncbi.nlm.nih.gov/sra/SRX1433261%5Baccn%5D

Converted to fasta using seqtk seq -A.

$ cat SRR2917853_11.fasta SRR2917853_13.fasta SRR2917853_15.fasta SRR2917853_17.fasta SRR2917853_19.fasta SRR2917853_1.fasta SRR2917853_21.fasta SRR2917853_23.fasta SRR2917853_25.fasta SRR2917853_27.fasta SRR2917853_29.fasta SRR2917853_2.fasta SRR2917853_31.fasta SRR2917853_3.fasta SRR2917853_4.fasta SRR2917853_5.fasta SRR2917853_7.fasta SRR2917853_9.fasta | gzip -1 >reads.fa.gz

$ minimap -Sw5 -L100 -m0 -t4 reads.fa.gz reads.fa.gz | gzip -1 > reads.paf.gz

$ miniasm -f reads.fa.gz reads.paf.gz > reads.gfa

[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::26.445*1.00] read 19770006 hits; stored 33249361 hits and 69144 sequences (1261322966 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::32.234*1.00] 61791 query sequences remain after sub
[M::ma_hit_cut::32.857*1.00] 28346032 hits remain after cut
[M::ma_hit_flt::33.466*1.00] 19148813 hits remain after filtering; crude coverage after filtering: 211.85
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::34.669*1.00] 61081 query sequences remain after sub
[M::ma_hit_cut::35.021*1.00] 18761267 hits remain after cut
[M::ma_hit_contained::35.507*1.00] 3085 sequences and 79066 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 64665 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 35554 arcs
[M::asg_arc_del_multi] removed 4 multi-arcs
[M::asg_arc_del_asymm] removed 2273 asymmetric arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 310 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 3034 asymmetric arcs
[M::asg_arc_del_short] removed 8080 short overlaps
[M::asg_cut_tip] cut 432 tips
[M::asg_pop_bubble] popped 6 bubbles and trimmed 3 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 975 asymmetric arcs
[M::asg_arc_del_short] removed 1177 short overlaps
[M::asg_cut_tip] cut 233 tips
[M::asg_pop_bubble] popped 28 bubbles and trimmed 4 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 814 asymmetric arcs
[M::asg_arc_del_short] removed 908 short overlaps
[M::asg_cut_tip] cut 272 tips
[M::asg_pop_bubble] popped 51 bubbles and trimmed 3 tips
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 215 internal sequences
[M::asg_cut_biloop] cut 384 small bi-loops
[M::asg_cut_tip] cut 71 tips
[M::asg_pop_bubble] popped 21 bubbles and trimmed 4 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 269 asymmetric arcs
[M::asg_arc_del_short] removed 317 short overlaps
[M::asg_cut_tip] cut 130 tips
[M::asg_pop_bubble] popped 43 bubbles and trimmed 14 tips
[M::main] ===> Step 5: generating unitigs <===
miniasm: asm.c:267: ma_ug_seq: Assertion `sub[id].e - sub[id].s <= ks->seq.l' failed.

Used minimap 7a2e4df and miniasm c0a8e44.

miniasm seems to not work with minimap2 -x ava-ont PAF files of latest minimap2 version?

Using a .paf file generated with the latest minimap2 version by:
minimap2 -x ava-ont ../reads.fq ../reads.fq > hcov_only_ava.paf

and trying to use miniasm on that:
miniasm -f ../reads.fq hcov_only_ava.paf > hcov_only_asm.gfa

yields an empty (0 bytes) file.

Using minimap2 -> miniasm has worked for me before (about 1 month ago).

miniasm log:
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::27.5911.00] read 22177213 hits; stored 9112072 hits and 11204 sequences (32198655 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::29.8381.00] 11042 query sequences remain after sub
[M::ma_hit_cut::30.0281.00] 9104542 hits remain after cut
[M::ma_hit_flt::30.3111.00] 8958079 hits remain after filtering; crude coverage after filtering: 521.79
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::31.1741.00] 10995 query sequences remain after sub
[M::ma_hit_cut::31.3561.00] 8955787 hits remain after cut
[M::ma_hit_contained::31.723*1.00] 23 sequences and 184 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 74 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 32 arcs
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 6 asymmetric arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 12 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 0 internal sequences
[M::asg_cut_biloop] cut 0 small bi-loops
[M::asg_cut_tip] cut 3 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 5: generating unitigs <===
[M::main] Version: 0.2-r168-dirty
[M::main] CMD: miniasm -f ../reads.fq hcov_only_ava.paf
[M::main] Real time: 32.243 sec; CPU: 32.240 sec

Core Dump error while generating unitigs(Step 5)

Hi,

I am doing a denovo assembling a nanopore data of R7.3 chemistry. At first I converted the fast5 to fasta using poretools. After that I followed the pipeline, for the denovo assembly using minimap and miniasm.

The commands I ran was:

/opt/programs/minimap-master/minimap -Sw5 -L100 -m0 Ecoli_R73.fasta Ecoli_R73.fasta | gzip -1 > Ecoli_R73.paf.gz

/opt/programs/miniasm-master/miniasm -f Ecoli_R73.fasta Ecoli_R73.paf.gz > Ecoli_R73.gfa

but the error pops up when using miniasm and the error is 👍
[M::main] ===> Step 1: reading read mappings <=== [M::ma_hit_read::0.144*0.92] read 92809 hits; stored 127633 hits and 8123 sequences (59841178 bp) [M::main] ===> Step 2: 1-pass (crude) read selection <=== [M::ma_hit_sub::0.159*0.92] 5645 query sequences remain after sub [M::ma_hit_cut::0.164*0.92] 100054 hits remain after cut [M::ma_hit_flt::0.167*0.93] 93030 hits remain after filtering; crude coverage after filtering: 9.35 [M::main] ===> Step 3: 2-pass (fine) read selection <=== [M::ma_hit_sub::0.171*0.93] 5336 query sequences remain after sub [M::ma_hit_cut::0.175*0.93] 80419 hits remain after cut [M::ma_hit_chimeric::0.178*0.93] identified 1 chimeric reads [M::ma_hit_contained::0.181*0.93] 855 sequences and 5139 hits remain after containment removal [M::main] ===> Step 4: graph cleaning <=== [M::ma_sg_gen] read 1953 arcs [M::main] ===> Step 4.1: transitive reduction <=== [M::asg_arc_del_trans] transitively reduced 538 arcs [M::asg_arc_del_multi] removed 180 multi-arcs [M::asg_arc_del_asymm] removed 13 asymmetric arcs [M::main] ===> Step 4.2: initial tip cutting and bubble popping <=== [M::asg_cut_tip] cut 420 tips [M::asg_pop_bubble] popped 25 bubbles and trimmed 0 tips [M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <=== [M::asg_arc_del_multi] removed 0 multi-arcs [M::asg_arc_del_asymm] removed 26 asymmetric arcs [M::asg_arc_del_short] removed 28 short overlaps [M::asg_cut_tip] cut 32 tips [M::asg_pop_bubble] popped 2 bubbles and trimmed 0 tips [M::asg_arc_del_short] removed 0 short overlaps [M::asg_arc_del_short] removed 0 short overlaps [M::main] ===> Step 4.4: removing short internal sequences and bi-loops <=== [M::asg_cut_internal] cut 2 internal sequences [M::asg_cut_biloop] cut 0 small bi-loops [M::asg_cut_tip] cut 2 tips [M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips [M::main] ===> Step 4.5: aggressively cutting short overlaps <=== [M::asg_arc_del_short] removed 0 short overlaps [M::main] ===> Step 5: generating unitigs <=== miniasm: asm.c:267: ma_ug_seq: Assertionsub[id].e - sub[id].s <= ks->seq.l' failed.
Aborted (core dumped)
`

Can you please tell me why this happen?

Thanks in advance.
Athul

Add version flag -V

Similar to -V in minimap

Can minimap and miniasm be used for human genome assembly?

Dear Heng,
I would like to ask if minimap and miniasm are scalable for human genome assembly? Or would you recommend using Falcon or other softwares?

Best,
Siyang

multithreaded

Hi,

I am looking for the multithread flag in the manual, but cannot find one. Should I assume that miniasm is still single-threaded? Or does it use all the threads available in the node?

Cheers

Working with corrected Nanopore reads?

Hi,
Could you suggest some preset options of minimap2 for corrected Nanopore reads (about 1-3% error rate ) to do assembly?

Thanks.

Error message when providing wrong filenames

First of all thanks for an amazing set of tools! They really make genome assembly much more fun.

I have noticed that miniasm does not handle it as I would expect when I mistakenly run it with a filename that does not match a file (I know I should just provide the right name but typos happen).
When I run:
"miniasm -f nanopore.fastq nonexisting.paf > miniasm_assembly.gfa"
I get
"[M::main] ===> Step 1: reading read mappings <===
Segmentation fault (core dumped)"
I would have expected something like "nonexisting.paf: No such file or directory"
If I on the other hand run
"miniasm -f nonexisting.fastq reads.paf > miniasm_assembly.gfa"
It seems as if miniasm does not check if the file exists before it starts running. Here again it would be nice to get something like "nonexisting.fastq: No such file or directory".

Best regards
Rasmus Kirkegaard

High scoring segment pairs

How to calculate high scoring segment pairs with minimap2 for nanopore cDNA reads?

improving assembly graph contiguity ("connected component N50")

In many assemblies we are finding that assembly graph contiguity is not much higher than that of contigs extracted from it. This is true not just for miniasm but also other assemblers.

Why might this be, and is there any way to improve contiguity of the graph?

A big hit to the N50 of the output would be acceptable in exchange for an improved "connected component N50".

miniasm segmentation fault

Hi LiHeng,

I use miniasm and run into segmentation fault. I reproduced the error. Both times it only used 52% of the total 239G memory before it fail. The ovlp_corn.paf.gz is 511GB and reads.fa is 43GB raw Pacbio long-read for about 20X coverage with genome size about 2.2Gb. Please note the second time when I repeat the error, I only did miniasm step. Is there anything I missed? What it can go wrong?

$ tail log-0605.out
[M::worker_pipeline::39280.93813.59] mapped 64199 sequences
[M::worker_pipeline::39324.00913.58] mapped 69836 sequences
[M::worker_pipeline::39356.45913.58] mapped 70597 sequences
[M::worker_pipeline::39368.77713.58] mapped 73223 sequences
[M::worker_pipeline::39369.057*13.58] mapped 10406 sequences
[M::main] Version: 2.10-r761
[M::main] CMD: /opt/conda/bin/minimap2 -x ava-pb -t 20 reads.fa reads.fa
[M::main] Real time: 39370.033 sec; CPU: 534467.581 sec
[M::main] ===> Step 1: reading read mappings <===
miniasm_cluster_corn.sh: line 20: 22168 Segmentation fault (core dumped) /opt/conda/bin/miniasm -f reads.fa ovlp_corn.paf.gz > reads_corn.gfa

The repeated log as following:
Wed Jun 6 15:42:21 UTC 2018
/opt/conda/bin/miniasm -f reads.fa ovlp_corn.paf.gz > reads_corn.gfa
Wed Jun 6 18:50:23 UTC 2018

The repeated error as following:
==> log_0606_2.out <==
[M::main] ===> Step 1: reading read mappings <===
miniasm_cluster_corn.sh: line 20: 27 Segmentation fault (core dumped) /opt/conda/bin/miniasm -f reads.fa ovlp_corn.paf.gz > reads_corn.gfa

I hope to hear from you soon.
Best,
Xin

Segmentation fault at step 1

Hi Heng,

miniasm throws the following segmentation fault error already during step 1:
[M::main] ===> Step 1: reading read mappings <===
/var/spool/slurmd/job9221389/slurm_script: line 26: 3636 Segmentation fault (core dumped) miniasm -f $FASTQ minimap_GRW_reads.paf.gz > miniasm_GRW_contings.gfa

I successfully ran minimap and got minimap_GRW_reads.paf.gz file. I allocated 128GB of RAM and using a profiler script checked that the RAM consumption was never above 80GB including the moment of crush. Could you please advise me on how to fix it? Thanks!

Best,
Nikolay

How to estimate computational resource?

Hi there,

I would like to use miniasm and minimap2 to assembly ~20X coverage Pacbio long reads corn genome (genome size = 2G ). May I know how to estimate memory, cpu and disk space it will use not only at the end but also during the process of assembly? I will have more genome coming on the way. Therefore, it will be great I can get some guideline from you. Thank you.

Best,
Xin

GFA usage help

Hi, pliz help me understand how I use the GFA file. I want to assemble ONT reads but I have no idea how to get statistics on the miniasm assembly.

Color output

Dear Heng,
Since minimap2 now estimates per-base sequence divergence, it would be great if there was a way to represent this divergence level with some color in minidot; either as a threshold (e.g. everything below x% divergence would be in blue and everything above in red) or a color scale. Is there a chance you might implement something like this in the future?
Best regards,
Jean-François

Please tag a stable release

The option -R was included in r131, and the most recent tagged release is 0.2 r128. Could you please tag a new stable release that includes the -R option?

miniasm: illegal option -- R

Currently miniasm -R prints a warning and continues. Could this please instead exit with an error?

Thanks, Heng!

sam2paf javascript code

Hi,

I am trying to run the sam2paf javascript code to do the format conversion. Due to no prior experience with javascript, I have been following multiple ways, mentioned online to get the code running on mac/linux, but saw no success so far.
Do you have the conversion script written in other language? or can you supply some tips on how to compile/execute this file?

On mac, I tried the following:

/System/Library/Frameworks/JavaScriptCore.framework/Versions/Current/Resources/jsc miniasm/misc/sam2paf.js bwa.30K.sam

Exception: ReferenceError: Can't find variable: File
global code@miniasm/misc/sam2paf.js:1:63
Could not open file: bwa.30K.sam

Parameter tweaking for assembling error-corrected reads

I am assembling error corrected pacbio reads from canu for a heterogeneous sample (pool of sexually reproducing individuals) using parameters in the README. Raw coverage was 80X that reduces to 26X after correction. I don't get any better assemblies with the error corrected reads.

Would you recommend any alternative parameter choices when handling error corrected pacbio reads? Thanks

Core Dumped while Generating Unitigs

Hello: I have checked my reads for duplicate entries as several previous entries have recommended and nothing has printed to STDOUT-- so I do not have duplicate entries in either my Self-to-Self minimap mapping (*.paf.gz) or my input fastq files.

I am getting the following error during de novo assembly with miniasm during Step 5: generating unitigs:

M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::318.0801.00] read 112487066 hits; stored 163806093 hits and 252748 sequences (3081524130 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::364.3611.00] 226439 query sequences remain after sub
[M::ma_hit_cut::368.5081.00] 160297166 hits remain after cut
[M::ma_hit_flt::374.0541.00] 145797682 hits remain after filtering; crude coverage after filtering: 460.63
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::389.2671.00] 225559 query sequences remain after sub
[M::ma_hit_cut::392.9511.00] 145087145 hits remain after cut
[M::ma_hit_contained::399.349*1.00] 3193 sequences and 58708 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 39392 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 20135 arcs
[M::asg_arc_del_multi] removed 1 multi-arcs
[M::asg_arc_del_asymm] removed 1586 asymmetric arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 1119 tips
[M::asg_pop_bubble] popped 1 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 1381 asymmetric arcs
[M::asg_arc_del_short] removed 2873 short overlaps
[M::asg_cut_tip] cut 941 tips
[M::asg_pop_bubble] popped 10 bubbles and trimmed 7 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 155 asymmetric arcs
[M::asg_arc_del_short] removed 197 short overlaps
[M::asg_cut_tip] cut 387 tips
[M::asg_pop_bubble] popped 16 bubbles and trimmed 8 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 63 asymmetric arcs
[M::asg_arc_del_short] removed 65 short overlaps
[M::asg_cut_tip] cut 149 tips
[M::asg_pop_bubble] popped 4 bubbles and trimmed 3 tips
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 23 internal sequences
[M::asg_cut_biloop] cut 17 small bi-loops
[M::asg_cut_tip] cut 31 tips
[M::asg_pop_bubble] popped 1 bubbles and trimmed 1 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 5 asymmetric arcs
[M::asg_arc_del_short] removed 5 short overlaps
[M::asg_cut_tip] cut 8 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 5: generating unitigs <===
miniasm: asm.c:267: ma_ug_seq: Assertion `sub[id].e - sub[id].s <= ks->seq.l' failed.
Aborted (core dumped)

Any assistance with respect to this error would be greatly appreciated.

Thanks!

What are the usage parameter of miniasm for aligning a reference.fasta and a PAF file?

miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa
here reads.fq is a read file. But I use a reference file in stead of read file what parameters would miniasm need?

Segmentation fault happened at step3

Hi Heng,

I have a problem while running the miniasm to assemble a plant genome. My genome is nearly 400MB. It was sequenced by PacBio with depth of 85x. Before running miniasm, I used the minimap to do the overlap. Do you have some suggestions to solve this problem?

Thanks,
Wen-Biao

[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::10096.666_1.00] read 2876324180 hits; stored 4112957381 hits and 3335749 sequences (31132741397 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::11325.809_1.00] 3223555 query sequences remain after sub
[M::ma_hit_cut::11845.499_1.00] 3940583267 hits remain after cut
[M::ma_hit_flt::12142.079_1.00] 2353809445 hits remain after filtering; crude coverage after filtering: 465.37
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::12272.969_1.00] 3203496 query sequences remain after sub
[M::ma_hit_cut::12411.309_1.00] 2326599127 hits remain after cut
Segmentation fault

empty .gfa

Hello,
I would like to assemble my MinIon output (fastq file) data with minimap and miniasm. I was able to create the .paf.gz file with minimap which looks good. However when I’m trying to assemble with miniasm, I get an empty .gfa file.
The command lines I am running are :

minimap/minimap -Sw5 -L100 -m0 -t8 reads.fq reads.fq | gzip -1 > reads.paf.gz #this one works
and
miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa #here reads.gfa is empty

Would you any idea what is the problem and how to solve it ?

Best regards,

Emma Corre

Unitig id suffix 'l' vs 'c'

Hello Heng,

I ran miniasm on some sets of reads of different lengths and found that in some cases the same unitig would have an 'l' suffix in one file and a 'c' suffix in another. For example, utg000028c in assemblyA.gfa and utg000028l in assemblyB.gfa. I was just wondering what these identifiers stand for.

Thanks in advance!
Audrina

miniasm: asm.c:267: ma_ug_seq: Assertion `sub[id].e - sub[id].s <= ks->seq.l' failed.

Hi, I have got miniasm: asm.c:267: ma_ug_seq: Assertion sub[id].e - sub[id].s <= ks->seq.l' failed.`

[M::main] ===> Step 0: removing contained reads <===
[M::ma_hit_no_cont::2767.695*1.00] dropped 609653 contained reads
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::7011.622*1.00] read 2185758957 hits; stored 2333379196 hits and 2773338 sequences (40113856176 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::7353.708*1.00] 2659876 query sequences remain after sub
[M::ma_hit_cut::7483.210*1.00] 2139917924 hits remain after cut
[M::ma_hit_flt::7547.623*1.00] 547748330 hits remain after filtering; crude coverage after filtering: 109.85
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::7562.296*1.00] 2639454 query sequences remain after sub
[M::ma_hit_cut::7590.590*1.00] 488640251 hits remain after cut
[M::ma_hit_contained::7619.274*1.00] 339002 sequences and 8459816 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 5136921 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 2569347 arcs
[M::asg_arc_del_multi] removed 6952 multi-arcs
[M::asg_arc_del_asymm] removed 161996 asymmetric arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 13178 tips
[M::asg_pop_bubble] popped 3970 bubbles and trimmed 84 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 194125 asymmetric arcs
[M::asg_arc_del_short] removed 1070031 short overlaps
[M::asg_cut_tip] cut 15410 tips
[M::asg_pop_bubble] popped 5473 bubbles and trimmed 415 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 58017 asymmetric arcs
[M::asg_arc_del_short] removed 91873 short overlaps
[M::asg_cut_tip] cut 8879 tips
[M::asg_pop_bubble] popped 2763 bubbles and trimmed 337 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 34260 asymmetric arcs
[M::asg_arc_del_short] removed 49438 short overlaps
[M::asg_cut_tip] cut 9207 tips
[M::asg_pop_bubble] popped 2694 bubbles and trimmed 555 tips
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 8866 internal sequences
[M::asg_cut_biloop] cut 16632 small bi-loops
[M::asg_cut_tip] cut 919 tips
[M::asg_pop_bubble] popped 308 bubbles and trimmed 104 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 15051 asymmetric arcs
[M::asg_arc_del_short] removed 19751 short overlaps
[M::asg_cut_tip] cut 3665 tips
[M::asg_pop_bubble] popped 1094 bubbles and trimmed 601 tips
[M::main] ===> Step 5: generating unitigs <===
miniasm: asm.c:267: ma_ug_seq: Assertion `sub[id].e - sub[id].s <= ks->seq.l' failed.
/var/spool/PBS/mom_priv/jobs/2106550.pbs.SC: line 23: 38560 Aborted                 miniasm -Rc2 -f /work/waterhouse_team/fruit/contamination-free-pacbio/RSnSeQ_110fruit-rm-clean-rmsmrtbell-gt-10k.fastq fruit-gt-10k/fruit.paf.gz > fruit-gt-10k/fruit.gfa
[M::mm_idx_gen::0.005*0.00] collected minimizers

What did I miss?

Thank you in advance.

Michal

miniasm over corrected long reads

Hi Heng,

If we want to run miniasm over a set of error-corrected long reads (~99% accuracy),
which parameters would you recommend us to adjust?
as mniasm seems to be optimized for overlapping noisy long reads.
It looked to us the -k (kmer=15bp) and -r (bandwidth=500) in minimap stage,
and -i (min identity [0.05]) and -I (min end-to-end match ratio [0.8]) in miniasm stage, might be worth trying but we are not sure.
Your comments will be highly appreciated.

Thanks,
Yao-Ting

assembling short references

I'm curious if miniasm works for the assembly of multiple high-quality sequences. For instance, the GRCh38 ALTs that are being used in the graph challenge in the ga4gh.

So, we tried to assemble some short genes in the MHC. I store some in the vg/test directory. For instance,

wget https://raw.githubusercontent.com/ekg/vg/master/test/GRCh38_alts/FASTA/HLA/A-3105.fa
minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa | gzip -1 > reads.paf.gz
miniasm -f reads.fq reads.paf.gz > reads.gfa

reads.gfa is empty.

It looks like the mapping works as expected,

➜  x  minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa | gzip -1 > reads.paf.gz
[M::mm_idx_gen::0.007*0.59] collected minimizers
[M::mm_idx_gen::0.014*1.47] sorted minimizers
[M::main::0.014*1.47] loaded/built the index for 11 target sequence(s)
[M::main] max occurrences of a minimizer to consider: 16
[M::main] Version: r122
[M::main] CMD: minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa
[M::main] Real time: 0.038 sec; CPU: 0.064 sec

but the graph shrinks dramatically during "containment removal":

➜  x  miniasm -f reads.fq reads.paf.gz > reads.gfa
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::0.000*0.00] read 186 hits; stored 211 hits and 11 sequences (147718 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::0.000*0.00] 11 query sequences remain after sub
[M::ma_hit_cut::0.000*0.00] 111 hits remain after cut
[M::ma_hit_flt::0.000*0.00] 111 hits remain after filtering; crude coverage after filtering: 7.41
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::0.000*0.00] 11 query sequences remain after sub
[M::ma_hit_cut::0.000*0.00] 111 hits remain after cut
[M::ma_hit_contained::0.000*0.00] 2 sequences and 2 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 2 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 0 arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 1 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 0 asymmetric arcs
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 0 internal sequences
[M::asg_cut_biloop] cut 0 small bi-loops
[M::asg_cut_tip] cut 0 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 5: generating unitigs <===
[M::main] Version: r104
[M::main] CMD: miniasm -f reads.fq reads.paf.gz
[M::main] Real time: 0.001 sec; CPU: 0.000 sec

Out of curiosity, I poked around in the code to try to get a sense of the state of the assembly graph at the point where containment reduction happens, but I don't have a good enough sense of how it is working to know what I'm looking at.

Have you tried this kind of assembly with miniasm? Is it possible? If so how should miniasm be parameterized to get it to happen?

I think it would be very useful to get this going. In the abstract, it seems it should work. Tolerating high error rates between between reads is analogous to the same problem between homologous but divergent haplotypes.

minidot plot in specific order

When mapping contigs VS a reference, is there a way to plot in a specific order?

Eg: From reference start position, column 8 of PAF

Repeat masking

Hi,
Is it possible to use daligner output in order to be able to take advantages of:

Thank you in advance.

Michal

genomes longer than 100MB

Is there an upper limit on genome size for miniasm? which parameters can be adjusted to have a better chance to get a successful run?

lh3 / miniasm Goto Github PK

miniasm's Introduction

Getting Started

Introduction

Algorithm Overview

Limitations

miniasm's People

Contributors

Stargazers

Watchers

Forkers

miniasm's Issues

I have tried miniasm with the test dataset and there seems to be something wrong with miniasm in my setting. I have followed the "getting started" scripts as to test whether the tool works properly

Download sample PacBio from the PBcR website

Install minimap and miniasm (requiring gcc and zlib)

Overlap

Layout

miniasm/miniasm -f reads.fq reads.paf.gz > reads.gfa

Recommend Projects

Recommend Topics

Recommend Org

I have tried miniasm with the test dataset and there seems to be something wrong with miniasm in my setting.
I have followed the "getting started" scripts as to test whether the tool works properly