vpc-ccg / haslr Goto Github PK

View Code? Open in Web Editor NEW

74.0 6.0 9.0 159 KB

A fast tool for hybrid genome assembly of long and short reads

License: GNU General Public License v3.0

Makefile 1.39% Python 5.48% C++ 88.27% C 4.86%

bioinformatics genomics genome-assembly hybrid-assembly long-reads pacbio nanopore

haslr's People

Contributors

Stargazers

Watchers

Forkers

pythseq zhanmengtao tiramisutes shulp2211 gsaltintas sajjadasaf wangdi2014 fredrickkebaso

haslr's Issues

Assembling the test data does not generate an assembly

Conda and make version do not generate an assembly for the E. coli dataset.

ll ecoli
total 224M
drwxr-xr-x 2 cjh info 4.0K Jul 25 19:00 asm_contigs_k49_a3_lr25x_b500_s3_sim0.85
-rw-r--r-- 1 cjh info 2.2K Jul 25 19:00 asm_contigs_k49_a3_lr25x_b500_s3_sim0.85.err
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 asm_contigs_k49_a3_lr25x_b500_s3_sim0.85.out
-rw-r--r-- 1 cjh info 111M Jul 25 16:32 lr25x.fasta
-rw-r--r-- 1 cjh info 736 Jul 25 19:00 map_contigs_k49_a3_lr25x.log
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 map_contigs_k49_a3_lr25x.paf
-rw-r--r-- 1 cjh info 86 Jul 25 16:32 sr.fofn
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 sr_k49_a3.contigs.nooverlap.250.fa
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 sr_k49_a3.contigs.nooverlap.fa
-rw-r--r-- 1 cjh info 114M Jul 25 19:00 sr_k49_a3.h5
-rw-r--r-- 1 cjh info 37K Jul 25 19:00 sr_k49_a3.log
-rw-r--r-- 1 cjh info 0 Jul 25 17:06 sr_k49_a3.unitigs.fa.doubledKmers.4
-rw-r--r-- 1 cjh info 0 Jul 25 17:06 sr_k49_a3.unitigs.fa.doubledKmers.5
-rw-r--r-- 1 cjh info 0 Jul 25 17:06 sr_k49_a3.unitigs.fa.doubledKmers.6
-rw-r--r-- 1 cjh info 0 Jul 25 17:06 sr_k49_a3.unitigs.fa.doubledKmers.7

ll ecoli/asm_contigs_k49_a3_lr25x_b500_s3_sim0.85
total 28M
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 asm.final.ann
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 asm.final.fa
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.01.init.gfa
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.01.init.stat
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.02.weakEdge.gfa
-rw-r--r-- 1 cjh info 42 Jul 25 19:00 backbone.02.weakEdge.stat
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.03.tip.gfa
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.03.tip.log
-rw-r--r-- 1 cjh info 42 Jul 25 19:00 backbone.03.tip.stat
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.04.simplebubble.gfa
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.04.simplebubble.log
-rw-r--r-- 1 cjh info 42 Jul 25 19:00 backbone.04.simplebubble.stat
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.05.superbubble.gfa
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.05.superbubble.log
-rw-r--r-- 1 cjh info 42 Jul 25 19:00 backbone.05.superbubble.stat
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.06.smallbubble.gfa
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.06.smallbubble.log
-rw-r--r-- 1 cjh info 42 Jul 25 19:00 backbone.06.smallbubble.stat
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 backbone.branching.log
-rw-r--r-- 1 cjh info 44K Jul 25 19:00 compact_uniq.txt
-rw-r--r-- 1 cjh info 16 Jul 25 19:00 index.contig
-rw-r--r-- 1 cjh info 28M Jul 25 19:00 index.longread
-rw-r--r-- 1 cjh info 0 Jul 25 19:00 log_asmfinal.txt

cat ecoli/asm_contigs_k49_a3_lr25x_b500_s3_sim0.85.err
[NOTE] number of threads: 8

[NOTE] loading contig sequences...
processing file: ecoli/sr_k49_a3.contigs.nooverlap.fa... Done in 0.00 CPU seconds (0.00 real seconds)
loaded 0 contigs
elapsed time 0.00 CPU seconds (0.00 real seconds)

[NOTE] calculating kmer frequency of unique contigs
mean: -nan
elapsed time 0.00 CPU seconds (0.00 real seconds)

[NOTE] loading long read sequences...
processing file: ecoli/lr25x.fasta... Done in 0.40 CPU seconds (0.40 real seconds)
loaded 6453 long reads
elapsed time 0.41 CPU seconds (0.41 real seconds)

[NOTE] loading alignment between contigs and long reads...
processing file: ecoli/map_contigs_k49_a3_lr25x.paf... Done in 0.00 CPU seconds (0.00 real seconds)
loaded 0 alignments
elapsed time 0.44 CPU seconds (0.45 real seconds)

[NOTE] fixing overlapping alignments...
elapsed time 0.44 CPU seconds (0.45 real seconds)

[NOTE] building compact long reads...
elapsed time 0.45 CPU seconds (0.45 real seconds)

[NOTE] building the backbone graph...
elapsed time 0.45 CPU seconds (0.45 real seconds)

[NOTE] cleaning weak edges...
removed 0 edges
elapsed time 0.45 CPU seconds (0.45 real seconds)

[NOTE] cleaning tips...
removed 0 tips
elapsed time 0.45 CPU seconds (0.46 real seconds)

[NOTE] cleaning simple bubbles...
removed 0 simple bubbles
elapsed time 0.45 CPU seconds (0.46 real seconds)

[NOTE] cleaning super bubbles...
removed 0 super bubbles
elapsed time 0.45 CPU seconds (0.46 real seconds)

[NOTE] cleaning small bubbles...
removed 0 small bubbles
elapsed time 0.45 CPU seconds (0.46 real seconds)

[NOTE] calculating long read coordinates between anchors...
elapsed time 0.45 CPU seconds (0.46 real seconds)

[NOTE] calling consensus sequence between anchors...
elapsed time 0.45 CPU seconds (0.46 real seconds)

[NOTE] generating the assembly from the cleaned backbone graph...
elapsed time 0.45 CPU seconds (0.46 real seconds)

[NOTE] cleaning up the memory!
[NOTE] elapsed time 0.45 CPU seconds (0.47 real seconds)

*** BYE ***

Short Homo Sapien Assembly from Genome in a Bottle Data

Hi, I'm trying HASLR using data from GIAB: https://github.com/genome-in-a-bottle/giab_data_indexes/tree/master/NA12878

All PacBio HIFI reads (~30X), cat into a single fastq.gz file.
Subsampled Illumina Short Reads from 300X to 55X, cat into 2 paired-end fastq files.

I used haslr with this command

haslr.py \
    -t 14 \
    -g 3g \
    -l $long_read \
    -x corrected \
    -s $short_read \
    -o ~/NA12878/Assembly_HASLR_55X_30X \

However, the result asm.final.fa only have 576MB in size, and only cover around 10% of the GRCh38, reported by QUAST. I even tried to increase genome_size option to 4G, and --cov-l from 25 to 30, but HASLR still generate exactly the same lr*x.fasta and asm.final.fa. I even tried using only cat on 2 original long read file, but the lr*x.fasta is still the same.

What have possibly go wrong in my case?

kmer determination

Hi,
What would be a good way to determine the kmer?

Thank you in advance,

Michal

minimum coverage requirements?

Do you have a sense of minimum coverage required. It seems that you tried for 50X long read coverage in your manuscript. And other long read assemblers want 30-60X. But what about 10X long-read coverage? Any chance that might work?

assembling long reads using HASLR... failed

Hi,
I'm trying to assemble 10x chromium data with nanopore data.

$ /home/std/hyli/haslr/bin/haslr.py -t 4 -g 1g -l /home/std/hyli/nanopore/reEEG357mix.fastq.gz /home/std/hyli/nanopore/reEEG376mix.fastq.gz /home/std/hyli/nanopore/EEG388_100fmol_KAPA60_mix.fastq.gz /home/std/hyli/nanopore/reEEG388_100fmol_ONT30mix.fastq.gz /home/std/hyli/nanopore/EEG389_100fmol_LNB15_mix.fastq.gz /home/std/hyli/nanopore/EEG389_100fmol_LNB60_mix.fastq.gz /home/std/hyli/nanopore/EEG389_200fmol_LNB30_mix.fastq.gz /home/std/hyli/nanopore/EEG850_EB_15min_mix.fastq.gz /home/std/hyli/nanopore/EEG850_H2O_15min_mix.fastq.gz /home/std/hyli/nanopore/EEG851_EB_30min_mix.fastq.gz /home/std/hyli/nanopore/EEG851_H2O_30min_mix.fastq.gz -x nanopore -s /home/std/hyli/WGS10X/EEG50_51/raw_data/EEG-50-51_S1_L005_R1_001.fastq.gz /home/std/hyli/WGS10X/EEG50_51/raw_data/EEG-50-51_S1_L005_R2_001.fastq.gz

Here is the error information,

[18-Jun-2020 16:40:17] subsampling 25x long reads to /home/std/hyli/hybrid_assembly/miniaceus/HASLR/miniaceus/lr25x.fasta... done
[18-Jun-2020 16:57:53] assembling short reads using Minia... done
[18-Jun-2020 22:48:58] removing overlaps in short read assembly... done
[18-Jun-2020 22:50:29] removing short sequences in short read assembly... done
[18-Jun-2020 22:50:40] aligning long reads to short read assembly using minimap2... done
[19-Jun-2020 01:57:58] assembling long reads using HASLR... failed
ERROR: "haslr_assemble" returned non-zero exit status

Any suggestion for fixing this problem?
Thank you in advance for your time!

Possible to lower memory usage for haslr_assemble

Hi,

I have a single node Ubuntu 16.0.4 system with 378 GB RAM and 40 cores (80 threads). During the haslr assemble stage, memory usage jumps to 100 % and haslr_assemble starts using Swap, so I dropped --cov-lr from 25 to 20 to 15 and now to 10. The genome is 450 Mbp with ~ 30x PacBio CLR (simulated reads) and ~60x Illumina short-reads (simulated). I will see if the the --cov-lr 10 setting works on my system (i.e., it doesn't use too much RAM before completion), but I was wondering if there might be some way to minimize RAM usage during this step. Any ideas?

Update: `--cov-lr 10` ran out of memory as well, so I am playing around with `--aln-block` and `--aln-sim` settings (was using defaults)

AttributeError: 'module' object has no attribute 'run'

Hi,

git clone https://github.com/vpc-ccg/haslr.git
cd haslr
make

export PATH=/work/team/apps/haslr/bin/:$PATH

checking /work/team/apps/haslr/bin/haslr_assemble: Traceback (most recent call last):
  File "/work/team/apps/haslr/bin/haslr.py", line 381, in <module>
    main()
  File "/work/team/apps/haslr/bin/haslr.py", line 25, in main
    check_program(path_haslr_assemble)
  File "/work/team/apps/haslr/bin/haslr.py", line 271, in check_program
    completed = subprocess.run([prog, '-h'], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
AttributeError: 'module' object has no attribute 'run'

What did I miss?

Why all backbone and asm file are empty?

Hi,

I used ~30X ONT data and ~300X illumina data to assemble a high heterozygous genome. All backbone and asm file are empty, why?
The several last line of the 3sr_k49_a3.log file are
traversal : contig
nb_contigs : 563614877
nb_small_contigs_discarded : 45039966
nt_assembled : 38008131973
max_length : 22004
graph simpification stats
tips removed : 367186126 + 17766381 + 1374406 + 191579
bulges removed : 2859263 + 63029 + 4006
EC removed
assembly traversal stats
time : 194384.097
assembly : 27581.812
graph construction : 166802.285

Best,
Kun

Small discrepancies in docs

Hi @haghshenas ,

I just noticed two very small discrepancies in the docs.

wget is a dependency for installation and is not listed.
In the building from source instructions, for the contents of bin/, you have listed bin/nooverlap when it is actually bin/minia_noooverlap

assembling long reads using HASLR... failed ERROR

Hi：
I downsample the ONT data to 60X，and MGI data to 40X, and running the script as follow:
sh ~/USER/Assembly/haslr/script/lowdepth_script/haslr_base.sh ~/USER/Assembly/outdir2/fastq/ONT_60X/NA24385_ONT.60X.fastq.gz ~/USER/Assembly/outdir2/fastq/T7_40X/NA24385_T7.40X.clean_1.fq.gz ~/USER/Assembly/outdir2/fastq/T7_40X/NA24385_T7.40X.clean_2.fq.gz ~/Assembly/outdir2/haslr_out/T740_ONT60

and My haslr_base.sh :
ontfq=$1
r1=$2
r2=$3
outdir=$4
time ~/backup_data/anaconda3/haslr/bin/haslr.py -t 64 -o $outdir -g 3.1g -l $ontfq -x nanopore -s $r1 $r2

The errors is :
checking /home/ubuntu/backup_data/anaconda3/haslr/bin/haslr_assemble: ok
checking /home/ubuntu/backup_data/anaconda3/haslr/bin/minia_nooverlap: ok
checking /home/ubuntu/backup_data/anaconda3/haslr/bin/fastutils: ok
checking /home/ubuntu/backup_data/anaconda3/haslr/bin/minia: ok
checking /home/ubuntu/backup_data/anaconda3/haslr/bin/minimap2: ok
number of threads: 64
output directory: /home/ubuntu/USER/lizhichao/Assembly/outdir2/haslr_out/T740_ONT60
[27-Sep-2020 19:23:57] subsampling 25x long reads to /home/ubuntu/USER/lizhichao/Assembly/outdir2/haslr_out/T740_ONT60/lr25x.fasta... done
[27-Sep-2020 20:39:33] assembling short reads using Minia... done
[28-Sep-2020 03:27:39] removing overlaps in short read assembly... done
[28-Sep-2020 03:28:04] removing short sequences in short read assembly... done
[28-Sep-2020 03:28:11] aligning long reads to short read assembly using minimap2... done
[28-Sep-2020 04:11:15] assembling long reads using HASLR... failed
ERROR: "haslr_assemble" returned non-zero exit status
Command exited with non-zero status 70
366544.02user 252752.33system 12:17:11elapsed 1400%CPU (0avgtext+0avgdata 309192052maxresident)k
977298512inputs+838263248outputs (163major+24284513190minor)pagefaults 0swaps

so, could you tell me how can i handle it?

new release for bioconda

Hi,
Could you please create a new release (bioconda/bioconda-recipes#31228 (review) )?

Thank you in advance,

Michal

Error - std::bad_alloc during "calling consensus sequence between anchors"

[NOTE] calling consensus sequence between anchors...
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

Compute resource seems to be ample from a memory perspective...node has 768GB of ram. Monitoring with top, the task seemed to be around 330GB when error occurred.

Input data:
(Haslr) pinnacle-l4:jpummil:/scrfs/storage/jpummil/C.vittatus$ ls -lh *.fastq
-rw-r--r--. 1 jpummil jpummil 13G Feb 6 11:07 NU2WGS_R1.fastq
-rw-r--r--. 1 jpummil jpummil 13G Feb 6 11:09 NU2WGS_R2.fastq
-rw-r--r--. 1 jpummil jpummil 31G Feb 6 11:16 Q1133andQ1171.fastq

input script:
haslr.py -t 24 -g 700m -o RUN1 -l Q1133andQ1171.fastq -x pacbio -s NU2WGS_R1.fastq NU2WGS_R2.fastq

compatibility with HiFi CCS PacBio reads

Is HASLR compatible with highly accurate HiFi long reads from PacBio? Can I mix them with ONT reads?

Does this tool do phasing?

I am interested to use your tool. In your example, Is the backbone.06.smallbubble.gfa the final graph of assembly? and does it include duplications and haplotigs that can be used for subsequent possible phasing?

I was wondering whether polishing with short reads (which happens as part of HASLR) will remove duplications or not; especially at .gfa stage/files.

Arranging files with paired end short-reads

Dear developers,

This looks like a great tool!

With regards to Forward/Reversed PE reads:

Can the F/R reads be provided one after another in the same fastq file or do they have to be provided as two separate files?

Similarly, can the program read more two or more pairs of F/R files?

Assembling pacbio, nanopore and illumina reads

Thanks for providing and maintaining HASLR. Is it possible to use HASLR to assembly sequences from pacbio, nanopore, and Illumina all together at the same time? I have used HASLR to assembly pacbio and Illumina before and it worked well. But now I have data from nanopore on the same individuals. This is a bat genome that is estimated t be at 3Gb.
Humble regards
Charles

Much Shorter Assembly than Expected

Hello, I'm using HASLR with nanopore and Illumina data to assemble a P. falciparum genome.

All nanopore data has around ~50x coverage.
All Illumina short reads are set as 2 paired-end fastq files (the paired-end doesn't matter for haslr, I believe?)

I used this command: haslr.py -t 10 -o pfalciparum -g 23m -l nanopore_data.fasta nanopore -s illumina_data.fasta

The resulting asm.final.fa contains about 10 million base pairs, which is much shorter than the expected 23 million base pairs for plasmodium falciparum. I've run this on several different nanopore samples and gotten the same result: a much shorter assembly than expected.

Do you have any suggestions? Thank you so much.

make clean required before make

Hi,

If I follow

git clone https://github.com/vpc-ccg/haslr.git
cd haslr
make

For commit e67b1eb

I get the following error

src/Assemble.cpp:6:20: fatal error: spoa.hpp: No such file or directory

Unless I do

make clean
make

Not a big issue, I am using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609

Update: Actually I used `make -j 75` and got the above error, but don't get that error if I use just plain `make`.

Using haslr to scaffold a pre-assembled assembly

Hi,

I have already assembled a draft assembly and would like to use haslr to scaffold the assembly using Nanopore long reads. Does haslr have this function?

Conda version does not produce final assembly

Hi,

Nice work on this! I initially installed the conda version: haslr=0.8a1=py38h1c8e9b9_1 from bioconda. It seems to run fine, there are no error message, it even says that the long read assembly is done but I can't find the assembly file anywhere.
I then cloned the git repo and tried that version and it works with the same exact command.

Cheers

Multiple long read files in the latest version

Hi , It seems that the latest version of haslr allows the user to provide multiple fastq files as input based on the code here:

haslr/bin/haslr.py

Line 300 in 0123746

 parser_req.add_argument('-l', '--long', type=str, help='long read file', nargs='+') 

. This is not allowed in https://github.com/vpc-ccg/haslr/releases/tag/v0.8a1

Could you please update the bioconda version: https://anaconda.org/bioconda/haslr? As of today, it's v0.8a1.
Could you please also let me know that whether gzipped fastq files are allowed? I didn't get this information from the help message. Thanks in advance!

Set coverage to use all

Is it possible to specify that you don't want to subsample to a certain coverage?

I can obviously set --cov-lr to something huge, but is there something like setting it to 0 to tell it to just use all?

speed running

Hi,
im running the software to assembly the human genome, i have runned one day, and it is still running,so how can i speed it? generally speaking , what memory perl thread? if i have sufficient memory ,can i set a bigger thread? my machine is 64cores,500G memory,here is my script:
~/backup_data/anaconda3/haslr/bin/haslr.py -t 8 -o ~/USER/lizhichao/Assembly/outdir/Assemblyoutput -g 3g -l ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_ONT.fastq.gz -x nanopore -s ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_T7.clean_1.fq.gz ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_T7.clean_2.fq.gz &&
echo "haslr finished

ERROR: "haslr_assemble" returned non-zero exit status

Hi,

$ haslr.py --contig megahit-final.contigs.fa --cov-lr 0 -t 8 -o results -g 2.7g -l N_sylvestris_UL.fastq -x nanopore

checking /work/waterhouse_team/apps/haslr/bin/haslr_assemble: ok
checking /work/waterhouse_team/apps/haslr/bin/minia_nooverlap: ok
checking /work/waterhouse_team/apps/haslr/bin/fastutils: ok
checking /work/waterhouse_team/apps/haslr/bin/minia: ok
checking /work/waterhouse_team/apps/haslr/bin/minimap2: ok
number of threads: 8
output directory: /lustre/scratch/waterhouse_team/Nsylvestris/megahit/haslr/results
[24-Feb-2022 06:55:55] renaming long reads and storing in /lustre/scratch/waterhouse_team/Nsylvestris/megahit/haslr/results/lrall.fasta... done
[24-Feb-2022 06:58:13] removing overlaps in short read assembly... done
[24-Feb-2022 06:58:26] removing short sequences in short read assembly... done
[24-Feb-2022 06:58:37] aligning long reads to short read assembly using minimap2... done
[24-Feb-2022 09:39:49] assembling long reads using HASLR... failed
ERROR: "haslr_assemble" returned non-zero exit status

What did I miss?

Thank you in advance,

Michal

Bug while running script: "could not find file"

If I run haslr.py after it was aborted, the shell returns "could not find file log". If I run a recently compiled haslr, the shell returns "haslr.py: error: could not find file haslr".

Once I have done an export into the ass directory in command line arguments, but the haslr raised error "could not find file ass" and "ass is not file or directory". If I change command line arguments to another it still returned such error.

final.fa is 0 kb

Hello,
I assembled the plant mitochondrial genome with the following command:
haslr.py -t 40 -o pl_mt -g 1.36 -l ONT.fastq.gz -x nanopore -s 1.fastq.gz 2.fastq.gz

assembling short reads using Minia... done
removing overlaps in short read assembly... done
removing short sequences in short read assembly... done
aligning long reads to short read assembly using minimap2... done
assembling long reads using HASLR... done

When I use the above command. It finished in two mins. I checked the output folder and generated final.fa file. But it does not contain any data.
How to solve this issue?
Thank you.
With best regards,
Raman. G

HASLR on heterozygous genomes

Hi,

I wonder if you have any idea of how your assembler will work with highly heterozygous genomes: do you think it will be able to recognize allelic long reads and maintain the phasing within a read?
My plant genome is ~2.5 Gb, but assembling it I get almost a 4 Gb assembly - i.e. lots of sites are very different and don't collapse in a single contig per locus, rather one for each allele.
Do you think it will be worth giving it a try, or will HASLR smash both alleles together anyway?
Thanks,

Dario

Use preexisting contig assembly instead of minia

Would lit be possible to use a pre-existing contig assembly for input to HASLR similar to what has been implemented for DBG2OLC (https://www.sciencedirect.com/science/article/pii/S0888754318305603)?

Failing while checking minia

Hello I have installed haslr and my bin includes all the 6 files:

bin/fastutils
bin/haslr_assemble
bin/haslr.py
bin/minia
bin/minimap2
bin/minia_nooverlap

I tried with the example command:

python haslr.py -t 8 -o /haslr/ecoli -g 4.6m -l /haslr/ecoli_filtered.fastq.gz -x pacbio -s /haslr/ecoli_miseq.1.fastq.gz /haslr/ecoli_miseq.2.fastq.gz`

And I am getting this error:

checking /haslr/bin/haslr_assemble: ok
checking /haslr/bin/minia_nooverlap: ok
checking /haslr/bin/fastutils: ok
checking /haslr/bin/minia: failed

I know it is because it is failing to locate minia but it's installed. After make I got this (Just adding the minia part):

Saving to: "minia-v3.2.1-bin-Linux.tar.gz"

100%[=============================================================================================================>] 6,978,468 39.3M/s in 0.2s

2021-02-20 19:33:54 (39.3 MB/s) - "minia-v3.2.1-bin-Linux.tar.gz" saved [6978468/6978468]

minia-v3.2.1-bin-Linux/LICENSE
minia-v3.2.1-bin-Linux/lib/
minia-v3.2.1-bin-Linux/lib/libhdf5.settings
minia-v3.2.1-bin-Linux/bin/
minia-v3.2.1-bin-Linux/bin/gatb-h5dump
minia-v3.2.1-bin-Linux/bin/minia
minia-v3.2.1-bin-Linux/test/
minia-v3.2.1-bin-Linux/test/genome10K.fasta
minia-v3.2.1-bin-Linux/test/10k_test.sh
minia-v3.2.1-bin-Linux/test/bubble.solution.fa
minia-v3.2.1-bin-Linux/test/1seq_90bp.reads.fq
minia-v3.2.1-bin-Linux/test/buchnera_test.sh
minia-v3.2.1-bin-Linux/test/buchnera.fasta
minia-v3.2.1-bin-Linux/test/ERR039477.md5
minia-v3.2.1-bin-Linux/test/1seq_90bp_circ.fa
minia-v3.2.1-bin-Linux/test/compare_fasta.py
minia-v3.2.1-bin-Linux/test/1seq.fa
minia-v3.2.1-bin-Linux/test/ERR039477.md5-gcc47
minia-v3.2.1-bin-Linux/test/bubble.fa
minia-v3.2.1-bin-Linux/test/X.solution.fa
minia-v3.2.1-bin-Linux/test/1seq_90bp.fa
minia-v3.2.1-bin-Linux/test/README
minia-v3.2.1-bin-Linux/test/tip.fa
minia-v3.2.1-bin-Linux/test/1seq_90bp_simulate_reads.sh
minia-v3.2.1-bin-Linux/test/ec.solution.fa
minia-v3.2.1-bin-Linux/test/X.fa
minia-v3.2.1-bin-Linux/test/high_abundance.fa
minia-v3.2.1-bin-Linux/test/ec.fa
minia-v3.2.1-bin-Linux/test/read50x_ref10K_e001.fa
minia-v3.2.1-bin-Linux/test/simple_test.sh
minia-v3.2.1-bin-Linux/test/buchnera_simulate_reads.sh
minia-v3.2.1-bin-Linux/test/test_ERR039477.sh
minia-v3.2.1-bin-Linux/test/tip.solution.fa
minia-v3.2.1-bin-Linux/README.md

May I know where I am going wrong?

Polishing example

Hi (not really an issue).

Could you be kind and add a little info/example on polishing the assembly? It was not clear from the ms how you did this.

It's probably just me (haven't assembled large genomes in a while - I'm used to unicycler which just takes care of everything).

lr0x.fasta... [ERROR] option -d/--depth is required

Hi,
I have got lr0x.fasta... [ERROR] option -d/--depth is required

conda activate haslr
haslr.py --minia-kmer $k --cov-lr 0 -t 8 -o results -g 3.2g -l ../allPacBio.fasta -x pacbio -s ../1740D-43-03_S0_L001_R1.fastq.gz ../1740D-43-03_S0_L001_R2.fastq.gz

checking /work/waterhouse_team/miniconda2/envs/haslr/bin/haslr_assemble: ok
checking /work/waterhouse_team/miniconda2/envs/haslr/bin/minia_nooverlap: ok
checking /work/waterhouse_team/miniconda2/envs/haslr/bin/fastutils: ok
checking /work/waterhouse_team/miniconda2/envs/haslr/bin/minia: ok
checking /work/waterhouse_team/miniconda2/envs/haslr/bin/minimap2: ok
number of threads: 8
output directory: /lustre/scratch/waterhouse_team/NWA/haslr-ks/95/results
subsampling 0x long reads to /lustre/scratch/waterhouse_team/NWA/haslr-ks/95/results/lr0x.fasta... [ERROR] option -d/--depth is required

done
assembling short reads using Minia... done
removing overlaps in short read assembly... done
removing short sequences in short read assembly... done
aligning long reads to short read assembly using minimap2... done
assembling long reads using HASLR... done

What did I miss?

Thank you in advance,

Michal

Clean raw data prior to assembly

I am planning to use Haslr for my project. I have ~100X illumina short read data and ~15X long read data from oxford nanopore. I was wondering does Haslr perform any cleanup prior to assembly or we should first clean the raw data and then use Haslr for genome assembly.

Thank you
Best regards
Kritika

question multiple libraries?

Hi, this program support multiple Illumina libraries?

Using CCS/HiFi PacBio reads

Is there anything "wrong" with me using CCS reads for the long reads? i.e. is there some kind of inferred error rate attached to the long reads parameter that will mess this up? (CCS reads have a very low error rate compared to raw ont/pacbio)

Missing result files

Hi,

Everything runs very well and no error message is given. However, the final result files are just missing. I tried it twice on an Arabidopsis thaliana data set.

PacBio reads from this study: https://doi.org/10.1371/journal.pone.0216233
Illumina reads from this study: https://doi.org/10.1371/journal.pone.0164321

There is no time given for the last step listed in "asm_contigs_k49_a3_lr25x_b500_s3_sim0.85.err":
[...]
[NOTE] cleaning small bubbles...
removed 228 small bubbles
elapsed time 325.79 CPU seconds (512.21 real seconds)

[NOTE] calculating long read coordinates between anchors...
elapsed time 670.53 CPU seconds (548.19 real seconds)

[NOTE] calling consensus sequence between anchors...

Cheers,
Boas

Debug Message in log File

Changing K-mer size of Minia leads to final fasta of zero bits

Hi,
I am trying to assemble a genome of ~2.5 GB with ~15X of Nanopore data and ~65X of 100bps short-read data.
I did one run of Abyss and I got a pretty bad result and then I learned that I used the wrong k-mer size, too far away from the optimum (95). Then, I found out that there is a software kmergenie to find the optimal k-mer size, and I did it for my short-read data and it is 63, MaSurca said it was 67. For Haslr, I did not tune up the first time, and use the default parameters, and with k=43 I got an assembly of 672 MB, but the BUSCO was very bad, 13.3% complete, [S:13.3%,D:0.0%], 3.4% fragmented, and 83.3% missing.
The configuration I used was:
haslr.py -t 48 --minia-kmer 43 -x nanopore plus the Nanopore dataset in fasta and the short-reads _1 and _2 in fq.gz

Gave 600 Gb and 48 cores to run.
Since Haslr, uses Minia, I looked in the manual for parameter optimization and I found this:
kmer-size The k-mer length is the length of the nodes in the de Bruijn graph. It strongly depends on the input dataset. For proper assembly, we recommend that you use the Minia-pipeline that runs Minia multiple times, with an iterative multi-k algorithm. That way, you won't need to choose k. If you insist on running with a single k value, the KmerGenie software
can automatically find the best k for your dataset.
The configuration was:
compared to the first one, I only changed the k-mer size from 43 to 63.

And so I did, and I used the kmer found in kmergenie, and the final file asm.final.fa was zero bites!

Could you help me explain why I ended up with a zero fasta file:
The log file of the first run, showed no problem

checking /users/PHS0338/jpac1984/.conda/envs/assembly-Y/bin/haslr_assemble: ok
checking /users/PHS0338/jpac1984/.conda/envs/assembly-Y/bin/minia_nooverlap: ok
checking /users/PHS0338/jpac1984/.conda/envs/assembly-Y/bin/fastutils: ok
checking /users/PHS0338/jpac1984/.conda/envs/assembly-Y/bin/minia: ok
checking /users/PHS0338/jpac1984/.conda/envs/assembly-Y/bin/minimap2: ok
number of threads: 48
output directory: /fs/scratch/PHS0338/appz/haslr/ONTq_NoSplit_RAPL-k43
subsampling 25x long reads to /fs/scratch/PHS0338/appz/haslr/ONTq_NoSplit_RAPL-k43/lr25x.fasta... done
assembling short reads using Minia... done
removing overlaps in short read assembly... done
removing short sequences in short read assembly... done
aligning long reads to short read assembly using minimap2... done
assembling long reads using HASLR... done

The second pretty much the same:
checking /users/PHS0338/jpac1984/.conda/envs/assembly-Y/bin/haslr_assemble: ok
checking /users/PHS0338/jpac1984/.conda/envs/assembly-Y/bin/minia_nooverlap: ok
checking /users/PHS0338/jpac1984/.conda/envs/assembly-Y/bin/fastutils: ok
checking /users/PHS0338/jpac1984/.conda/envs/assembly-Y/bin/minia: ok
checking /users/PHS0338/jpac1984/.conda/envs/assembly-Y/bin/minimap2: ok
number of threads: 48
output directory: /fs/scratch/PHS0338/appz/haslr/ONTq_NoSplit_RAPL-k63
subsampling 25x long reads to /fs/scratch/PHS0338/appz/haslr/ONTq_NoSplit_RAPL-k63/lr25x.fasta... done
assembling short reads using Minia... done
removing overlaps in short read assembly... done
removing short sequences in short read assembly... done
aligning long reads to short read assembly using minimap2... done
assembling long reads using HASLR... done

Thanks;

vpc-ccg / haslr Goto Github PK

haslr's People

Contributors

Stargazers

Watchers

Forkers

haslr's Issues

Update: --cov-lr 10 ran out of memory as well, so I am playing around with --aln-block and --aln-sim settings (was using defaults)

Update: Actually I used make -j 75 and got the above error, but don't get that error if I use just plain make.

Recommend Projects

Recommend Topics

Recommend Org

Update: `--cov-lr 10` ran out of memory as well, so I am playing around with `--aln-block` and `--aln-sim` settings (was using defaults)

Update: Actually I used `make -j 75` and got the above error, but don't get that error if I use just plain `make`.