Git Product home page Git Product logo

dashing2's Introduction

Hi, I'm Daniel 👋

Senior Scientist at Pacific Biosciences (PacBio). Previously, I was a PhD Candidate at Johns Hopkins University in the department of Computer Science. Before that I was a Bioinformatics Scientist at ARUP Laboratories, where I worked on cell-free circulating tumor DNA (ctDNA) analysis and clinical genomics after my training in Physics [BS] and Biophysics/Computational Biology [MS]. I've worked with biological data (sequence, molecular modeling, metabolomics, transcriptomics, metagenomics), telecommunications data, as well as graph algorithms, machine learning, and numerical optimization.

🔭 I've worked on similarity search, and clustering, and indexing for large-scale biological data, simd/gpu-accelerated and randomized algorithms. Most recently, I've been developing methods for human genetics, including long RNA-seq, VNTRs, and haplotype phasing.

😄 Pronouns: He/Him/His

A quick tour of my interests

  1. Practical randomized algorithms

This ranges from libraries providing sketch data structures and coresets, as well as projects using random projections and DCI.

My work on coresets and clustering is primarily part of the minicore project, with the aims of providing a standard utility for coreset construction and weighted clustering, especially for exponential family models and shortest-paths metrics.

  1. Computational Biology

The bonsai project provides methods for metagenomic analysis, along with k-mer encoding/decoding and I/O, while the Dashing performs scalable sketching and comparison of sequence data.

BMFtools performs molecular demultiplication over sequencing barcoded data, reducing error rates while eliminating redundant information. Designed for ctDNA, this method can reduce error rates by orders of magnitude, allowing confident detection of very rare events.

scavenger has rust implementations using tch-rs for VAEs for count-based data, applied to single-cell transcriptomics.

I also co-developed pbfusion, a fast tool for characterizing transcriptional abnormalities.

  1. General C++

Most of my projects fall into this category, serving as tools I can reuse in various projects.

Some of my favorites:

  • vec provides type-generic abstractions over x86-64 vectorization, making it easy to write fast, portable code.
  • kspp is an RAII-based variant of kstring from klib with extra niceties making appending printf-style formatting easy.
  • aesctr provides STL-style random number generators built on fast aes-ctr and wyhash
  • circularqueue provides a range-based circular queue container that uses power-of-two sizes

dashing2's People

Contributors

benlangmead avatar dnbaker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

dashing2's Issues

no dashing2 cmp help info

Hello Daniel,

I compute sketches using dashing2 sketch and store sketch using - - outfile. There is a sketch and a sketch.name.txt. But I am not sure how to feed those to dashing2 cmp since no help is provided. I looked into the code and it confuses me. I can use -cmpout to have the distance but I want to check how long sketching and cmp take.

Thanks,

Jianshu

Print cardinality estimate

It would be great if there was a command that could print a cardinality estimate, perhaps from a single cached sketch (i.e. with --cache functionality). This was possible before with dashing hll.

Multi-fasta functionality

Is it in the plans to add multi-fasta functionality, instead of having to have separate fasta files for each sequence?

--mash-distance outputs similarity with --set option

Hi, with version 2.1.9 binaries, I keep getting similarity matrix instead of distance matrix when using:
./dashing2_s128 sketch --cmpout distance.txt -F paths.txt -p16 -k31 --set --mash-distance
I.e. the distance.txt output is identical (with ones at diagonal) as for similariry.txt produced with:
./dashing2_s128 sketch --cmpout similariry.txt -F paths.txt -p16 -k31 --set
If I do weighted distance:
./dashing2_s128 sketch --cmpout distance.txt -F paths.txt -p16 -k31 --prob --mash-distance
It seems works correctly - I recieve distance matrix (which has -0 at diagonal).
How could I get mash distance with --set?
Thank you very much, dasa

Differences in mash distances between dashing2 vs dashing1/mash

I am puzzled about the (for us) large dashing1/dashing2 differences for mash distances for some genome pairs.
In the example below with dashing1 and mash we get values around 0.25 - which reflects that these genomes are not related.

However, dashing2 reports a much lower distance value of 0.11 which strongly suggests a relationship - which is not there.

Is there a way to reproduce dashing1 results with dashing2?

Below are the steps to reproduce with the genomes from NCBI
genomeA https://www.ncbi.nlm.nih.gov/nuccore/KC139520.1?report=fasta

genomeB https://www.ncbi.nlm.nih.gov/nuccore/MZ375344.1?report=fasta

dashing sketch -k 15 -S16 -p 1 -F list_of_genomes
dashing cmp -p 1  -M -W -k15 -S16 -F list_of_genomes -Q list_of_genomes  -T -O result.tab -o result.labels

cat result.tab
genomeA.fasta   -0      0.252614
genomeB.fasta   0.252614        -0


dashing2 sketch -p 1 -k 15 -S 65536 -F list_of_genomes

dashing2 cmp -p 1 --cmpout result_D2.out  -F list_of_genomes -Q list_of_genomes --mash-distance

cat result_D2.out
#Dashing2 Panel (Query/Refernce) Output
#Dashing2Options: Dashing2Options;k:32;parsebyfile;trimchr;sketchsize:1024;sketchtype:onepermsetsketch;Fastx;canon
#Sources        genomeA.fasta   genomeB.fasta   genomeA.fasta   genomeB.fasta
genomeA.fasta   -0      0.115187205
genomeB.fasta   0.115187205     -0

`sketch -L` doesn't work

The -L parameter for the sketch subcommand does not work as advertised. I get:

sketch: invalid option -- 'L'

Followed by usage message. If I replace -L with the long version of the parameter, --sketch-size-l2, it works fine. I took a brief look at the code and found the getopt struct for this option but didn't immediately see what might be wrong with it.

dartminhash implenmentation

Hello Daniel,

Just want to ask whether there are plans for also implement dartminhash (https://arxiv.org/abs/2005.11547), which is very fast for sparse dataset (genomes, metagenomes), and much faster than bagminhsh according to my reading for very sparse dataset and large k (minhashes).

Thanks,

Jianshu

how to get assymetric matrix for containment index?

Dear Daniel,
(first thank you very much for new binaries which emit distances with --set /--countdict, it works perfectly.)

Could you please help me with another issue - the containment index?
I cannot figure out the correct settings to emit the assymetric matrix for the containment index.
Running this:
./dashing2_s128 sketch -k31 -p8 --cmpout containment.txt -o card -F paths.txt --set --containment --asymmetric-all-pairs
I am getting full square matrix, but it is symmetric, the outputs are the same for the upper and the lower parts. They are indeed containment indexes, i.e. they fit to (A intersection B)/A but, the values for (A intersection B)/B are not reported.
The matrix is identical to the one produced by:
./dashing2_s128 sketch -k31 -p8 --cmpout containment.txt -o card -F paths.txt -Q paths.txt --set --containment
I tried also
./dashing2_s128 sketch -k31 -p8 --cmpout containment.txt -o card -F paths.txt -Q paths.txt --set --containment --asymmetric-all-pairs
This results in two full symmetric matrices (each one is the same as those mentioned above) stick toghether, i.e. each F-Q sample pair comparison is done twice, but still in only "one direction".

Could you please help me with that?
Thank you, dasa

Please include license file

I see no license file. I note that the dependencies (bonsai, fat, libBigWig), which are included by reference, are all MIT-licensed.

please allow builds without recursive cloning

Recursive cloning results in a mess for distro packagers and can be a headache for both security and functionality. An example is bonsai pulling in sketch, which pulls in a pybind11 from 2 years ago. That version of pybind11 won't even work with python-3.11.

Please list the direct dependencies in the documentation and make sure the dependencies can actually be installed. As a bonus, your CI scripts will only show problems from the package being tested.

Sketching/comparing never completes on specific Fasta file

The following Fasta file never completes (has been running for 5 days). The following command was used: dashing2_binaries/linux/v2.1.16/dashing2_savx2 sketch --cmpout ../ab_data/ -k 7 --parse-by-seq -p 8 ../ab_data/ab.fa

ab.fa.xz.gz

If trying to only do sketching, it completes but does not output any sketch file dashing2_binaries/linux/v2.1.16/dashing2_savx2 sketch -k 7 --parse-by-seq -p 8 ../ab_data/ab.fa --cache --prefix ../ab_data/

Sorry about the weird file format, xz is better for compression but Github won't let me upload them.

compilation fails under gcc > 12

Works under gcc-12.3.1_p20230623
First failure under gcc-13.1.1_p20230527 as below:

x86_64-pc-linux-gnu-g++ -IlibBigWig -Ibonsai/include -Ibonsai -Ibonsai/hll -Ibonsai/hll/include -Ibonsai -I. -Isrc -Ifmt/include -O3 -march=native -fopenmp -pipe -DD2_CACHE_SIZE=4194304 -std=c++20 -Wall -Wextra -Wno-unused-function -Wno-char-subscripts -pedantic -Wno-array-bounds src/emitrect.cpp -c -o src/emitrect.o -DNOCURL -DDASHING2_VERSION="v2.1.16-1-gbf82" -DFMT_HEADER_ONLY -DNDEBUG -flto -O3
src/emitrect.cpp: In function ‘void dashing2::print_tabs(size_t, std::back_insert_iterator<fmt::v8::basic_memory_buffer >&)’:
src/emitrect.cpp:51:28: error: call of overloaded ‘format_to(std::back_insert_iterator<fmt::v8::basic_memory_buffer >&, const std::string_view&)’ is ambiguous
51 | for(;n > 256; format_to(biof, tabstr), n -= 256);
| ~~~~~~~~~^~~~~~~~~~~~~~
In file included from fmt/include/fmt/format.h:44,
from src/emitrect.cpp:2:
fmt/include/fmt/core.h:3073:17: note: candidate: ‘OutputIt fmt::v8::format_to(OutputIt, format_string<T ...>, T&& ...) [with OutputIt = std::back_insert_iterator<basic_memory_buffer >; T = {}; typename std::enable_if<detail::is_output_iterator<OutputIt, char>::value, int>::type = 0; format_string<T ...> = basic_format_string]’
3073 | FMT_INLINE auto format_to(OutputIt out, format_string<T...> fmt, T&&... args)
| ^~~~~~~~~
In file included from /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/bits/chrono_io.h:39,
from /usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/chrono:3330,
from bonsai/include/bonsai/util.h:4,
from bonsai/include/bonsai/kmerutil.h:7,
from bonsai/include/bonsai/entropy.h:3,
from bonsai/include/bonsai/encoder.h:10,
from src/d2.h:11,
from src/cmp_main.h:3,
from src/emitrect.cpp:1:
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/format:3745:5: note: candidate: ‘_Out std::format_to(_Out, format_string<_Args ...>, _Args&& ...) [with _Out = back_insert_iterator<fmt::v8::basic_memory_buffer >; _Args = {}; format_string<_Args ...> = basic_format_string]’
3745 | format_to(_Out __out, format_string<_Args...> __fmt, _Args&&... __args)
| ^~~~~~~~~
src/emitrect.cpp:53:14: error: call of overloaded ‘format_to(std::back_insert_iterator<fmt::v8::basic_memory_buffer >&, const char [3], const std::basic_string_view&)’ is ambiguous
53 | format_to(biof, "{}", substr);
| ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
fmt/include/fmt/core.h:3073:17: note: candidate: ‘OutputIt fmt::v8::format_to(OutputIt, format_string<T ...>, T&& ...) [with OutputIt = std::back_insert_iterator<basic_memory_buffer >; T = {const std::basic_string_view<char, std::char_traits >&}; typename std::enable_if<detail::is_output_iterator<OutputIt, char>::value, int>::type = 0; format_string<T ...> = basic_format_string<char, const std::basic_string_view<char, std::char_traits >&>]’
3073 | FMT_INLINE auto format_to(OutputIt out, format_string<T...> fmt, T&&... args)
| ^~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/format:3745:5: note: candidate: ‘_Out std::format_to(_Out, format_string<_Args ...>, _Args&& ...) [with _Out = back_insert_iterator<fmt::v8::basic_memory_buffer >; _Args = {const basic_string_view<char, char_traits >&}; format_string<_Args ...> = basic_format_string<char, const basic_string_view<char, char_traits >&>]’
3745 | format_to(_Out __out, format_string<_Args...> __fmt, _Args&&... __args)
| ^~~~~~~~~
src/emitrect.cpp: In function ‘void dashing2::emit_rectangular(const Dashing2DistOptions&, const SketchingResult&)’:
src/emitrect.cpp:351:30: error: call of overloaded ‘format_to(std::back_insert_iterator<fmt::v8::basic_memory_buffer >&, const char [3], std::string&)’ is ambiguous
351 | format_to(biof, "{}", fn);
| ~~~~~~~~~^~~~~~~~~~~~~~~~
fmt/include/fmt/core.h:3073:17: note: candidate: ‘OutputIt fmt::v8::format_to(OutputIt, format_string<T ...>, T&& ...) [with OutputIt = std::back_insert_iterator<basic_memory_buffer >; T = {std::__cxx11::basic_string<char, std::char_traits, std::allocator >&}; typename std::enable_if<detail::is_output_iterator<OutputIt, char>::value, int>::type = 0; format_string<T ...> = basic_format_string<char, std::__cxx11::basic_string<char, std::char_traits, std::allocator >&>]’
3073 | FMT_INLINE auto format_to(OutputIt out, format_string<T...> fmt, T&&... args)
| ^~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/13/include/g++-v13/format:3745:5: note: candidate: ‘_Out std::format_to(_Out, format_string<_Args ...>, _Args&& ...) [with _Out = back_insert_iterator<fmt::v8::basic_memory_buffer >; _Args = {__cxx11::basic_string<char, char_traits, allocator >&}; format_string<_Args ...> = basic_format_string<char, __cxx11::basic_string<char, char_traits, allocator >&>]’
3745 | format_to(_Out __out, format_string<_Args...> __fmt, _Args&&... __args)
| ^~~~~~~~~
make: *** [Makefile:117: src/emitrect.o] Error 1

Expected memory usage

What's the expected memory usage per genome for Dashing2? I'm trying to run it on 500,000 viral isolates, and am running out of memory even with 500GB

--outprefix option and cmp subcommand are invaild

Hi Daniel,
I run the dashing2 with "./dashing2_savx2 sketch -F bacteria.list -S 1024 --threads 48 -o bacteria.sketch" to get the sketches, and the bacteria.sketch and the bacteria.sketch.name.txt are generated.
The cached sketch files are saved adjacent to the input file, and I try to specify the directory for the cached files by the option "--outprefix or --prefix", but it does not work. This makes the directory of the original input genome file directory chaotic.
Without the option of "--cache", the cached file will be in the input genome directory as well. How can I cancel the cached file?

In addition, I want to use the cmp or dist subcommand to compute the all-vs-all pairwise distances by the bacteria.sketch, but I cannot get the help information of this subcommand by "./dashing2_savx2 dist --help" and do not know how to use it.

Best,
Xiaoming

support for proteome? but not just genome in nt

Hell Dashing2 developer,

Dashing can be use for genome derelication in nt format but I am wondering wither it is an option to have it also work for amino acid sequences (all gene of a genome in amino acid format et.al.). This is widely used for microbiologists

Thanks,

Jianshu

aminoacid distance to AAI?

Hello Daniel,

For nt Jaccard distance, estimated by either MinHash (e.g. probminhash) , we can follow the MASH paper to do a log function transformation (-1/k*(2log(J)/(log(J)+1))) to approximate ANI, what if it is the Jaccard distance of amino acid/preotein sequences? We should make some adjustment to it right to approximate AAI (average amino acid identity)?

Thanks,

Jianshu

matrix wrong format (?)

Hi Dan,

While updating my version of dashing I saw this one. I successfully compiled it with g++-11. However, the matrix is weird, twisted, hard to tell what goes with what:

% ../dashing2/bin/dashing2 sketch -k 21 -S 16384 --cmpout test.dist -F test.list
[Dashing2] Invocation: /Users/gmh/TestGenomes/../dashing2/bin/dashing2 sketch -k 21 -S 16384 --cmpout test.dist -F test.list Dashing2 made with k = 21, w = -1, DNA target, space = SetSpace, datatype = Fastx and result = FullSetSketch
Nothing written to disk, as no output file provided.
Emitting human-readable: HumanReadable
%
% cat test.dist 
#Dashing2 PHYLIP pairwise Output
#Dashing2Options: Dashing2Options;k:21;parsebyfile;trimchr;sketchsize:16384;sketchtype:fullsetsketch;Fastx;canon;command:"/Users/gmh/TestGenomes/../dashing2/bin/dashing2 sketch -k 21 -S 16384 --cmpout test.dist -F test.list"
#Sources	FNA/GCF_000005845.fna.gz	FNA/GCF_000006925.fna.gz	FNA/GCF_000009885.fna.gz
3
FNA/GCF_000005845.fna.gz	0.484191895	0.0108032227
FNA/GCF_000006925.fna.gz	0.0113525391
FNA/GCF_000009885.fna.gz

Self with self should be 1 or - (if this is jaccard), but there's no such thing, and, instead, the self with self seems to have results for something else. For example GCF_000005845 with GCF_000005845 has a result of 0.484191895

With dashing these are correctly positioned:

% dashing cmp -k 21 -S 20 -O test.dist -F test.list
Dashing version: v1.draft-3-g90f0
#Path	Size (est.)
FNA/GCF_000009885.fna.gz	5394357
FNA/GCF_000006925.fna.gz	4370118
FNA/GCF_000005845.fna.gz	4544097
%
% cat test.dist 
##Names	FNA/GCF_000009885.fna.gz	FNA/GCF_000006925.fna.gz	FNA/GCF_000005845.fna.gz
FNA/GCF_000009885.fna.gz	-	0.010112	0.00974668
FNA/GCF_000006925.fna.gz	-	-	0.479668
FNA/GCF_000005845.fna.gz	-	-	-

There's also an inf result (maybe should be 1) with --mash when genomes are very ... different? but for that the matrix is larger and harder to show here.

I know it's still under development, so I'll be patient. I just wanted you to know these issues before it becomes too complicated.

Specifying `--verbose` tends to give segmentation fault

$ dashing2 sketch --verbose
#Calling Dashing2 version v2.1.11-1-g3b71 with command '/home/blangme2/scr16_blangme2/langmead/dashing2/dashing2-64 /home/blangme2/scr16_blangme2/langmead/dashing2/dashing2-64 sketch --verbose'
Segmentation fault (core dumped)
$ dashing2 cmp --verbose
#Calling Dashing2 version v2.1.11-1-g3b71 with command '/home/blangme2/scr16_blangme2/langmead/dashing2/dashing2-64 /home/blangme2/scr16_blangme2/langmead/dashing2/dashing2-64 cmp --verbose'
Segmentation fault (core dumped)

dashing2 sketch --parse-by-seq doesn't work with --save-kmers

Hi,

In order to use the dashing2 containment command, I was looking for the use of both sketch options "--parse-by-seq" and "--save-kmers" together but it doesn't work.
Whether I use "--parse-by-seq", it doesn't render the ".kmer64" file.
Kind Regards !

large Metagenome comparison bad_alloc

Hello Danial,

I am comparing 100 metagenomes, total size about 1 TB, I assigned 2Tb memory but I always have the follow error after a few hours:

#Calling Dashing2 version v2.1.9 with command '/scratch/jianshu/interleaved_GWMC2/dashing2 dashing2 sketch --threads 24 --pminhash -k 21 -S 12000 AUBR2B_rmdup_trim_filter.interleaved.fa AUEP2AB_rmdup_trim_filter.interleaved.fa AUEP2AC_rmdup_trim_filter.interleaved.fa AUEP2BC_rmdup_trim_filter.interleaved.fa BRBH3C_rmdup_trim_filter.interleaved.fa CLVPT2_rmdup_trim_filter.interleaved.fa CNCD2C_rmdup_trim_filter.interleaved.fa CNCD4C_rmdup_trim_filter.interleaved.fa CNDL1AC_rmdup_trim_filter.interleaved.fa CNDL1BC_rmdup_trim_filter.interleaved.fa CNJN2C_rmdup_trim_filter.interleaved.fa CNJN4C_rmdup_trim_filter.interleaved.fa CNSH1C_rmdup_trim_filter.interleaved.fa CNSH2C_rmdup_trim_filter.interleaved.fa CNSH3C_rmdup_trim_filter.interleaved.fa CNSH4C_rmdup_trim_filter.interleaved.fa CNSH5C_rmdup_trim_filter.interleaved.fa CNSY3A_rmdup_trim_filter.interleaved.fa CNSY3B_rmdup_trim_filter.interleaved.fa CNSY3C_rmdup_trim_filter.interleaved.fa CNSZ1C_rmdup_trim_filter.interleaved.fa CNSZ2C_rmdup_trim_filter.interleaved.fa CNSZ3AB_rmdup_trim_filter.interleaved.fa CNSZ3AC_rmdup_trim_filter.interleaved.fa CNSZ4AC_rmdup_trim_filter.interleaved.fa CNWH1C_rmdup_trim_filter.interleaved.fa CNWH2C_rmdup_trim_filter.interleaved.fa CNWH4C_rmdup_trim_filter.interleaved.fa CNWX1AC_rmdup_trim_filter.interleaved.fa CNWX2C_rmdup_trim_filter.interleaved.fa CNWX3AC_rmdup_trim_filter.interleaved.fa CNWX3BC_rmdup_trim_filter.interleaved.fa CNWX4C_rmdup_trim_filter.interleaved.fa CNXA2C_rmdup_trim_filter.interleaved.fa CNXA4C_rmdup_trim_filter.interleaved.fa CNXM1C_rmdup_trim_filter.interleaved.fa CNXM3C_rmdup_trim_filter.interleaved.fa DEKS1B_rmdup_trim_filter.interleaved.fa ITLF2B_rmdup_trim_filter.interleaved.fa SAKB5_rmdup_trim_filter.interleaved.fa SEGL1C_rmdup_trim_filter.interleaved.fa SESD1C_rmdup_trim_filter.interleaved.fa TWKS2C_rmdup_trim_filter.interleaved.fa TWTN2C_rmdup_trim_filter.interleaved.fa USAG1C_rmdup_trim_filter.interleaved.fa USAG2C_rmdup_trim_filter.interleaved.fa USAK1D2_rmdup_trim_filter.interleaved.fa USAK1D3_rmdup_trim_filter.interleaved.fa USAT02C_rmdup_trim_filter.interleaved.fa USAT04C_rmdup_trim_filter.interleaved.fa USBT2C_rmdup_trim_filter.interleaved.fa USBT5C_rmdup_trim_filter.interleaved.fa USCB1AC_rmdup_trim_filter.interleaved.fa USCB1CC_rmdup_trim_filter.interleaved.fa USCG2C_rmdup_trim_filter.interleaved.fa USCG3C_rmdup_trim_filter.interleaved.fa USCG4AC_rmdup_trim_filter.interleaved.fa USCG4BC_rmdup_trim_filter.interleaved.fa USDC2C_rmdup_trim_filter.interleaved.fa USFT4C_rmdup_trim_filter.interleaved.fa USFT5C_rmdup_trim_filter.interleaved.fa USHS3A_rmdup_trim_filter.interleaved.fa USKN1AB_rmdup_trim_filter.interleaved.fa USMD1C_rmdup_trim_filter.interleaved.fa USMD2C_rmdup_trim_filter.interleaved.fa USMD3C_rmdup_trim_filter.interleaved.fa USMD4C_rmdup_trim_filter.interleaved.fa USMI1C_rmdup_trim_filter.interleaved.fa USMI4C_rmdup_trim_filter.interleaved.fa USNO2D13_rmdup_trim_filter.interleaved.fa USOK01C1_rmdup_trim_filter.interleaved.fa USOK03A1_rmdup_trim_filter.interleaved.fa USOK06C_rmdup_trim_filter.interleaved.fa USOP1AC_rmdup_trim_filter.interleaved.fa USOP1BC_rmdup_trim_filter.interleaved.fa USPT1C_rmdup_trim_filter.interleaved.fa USPT3C_rmdup_trim_filter.interleaved.fa USRE6C_rmdup_trim_filter.interleaved.fa USSD2BC_rmdup_trim_filter.interleaved.fa USTE3B_rmdup_trim_filter.interleaved.fa USTF1AC_rmdup_trim_filter.interleaved.fa USTF1BC_rmdup_trim_filter.interleaved.fa USVA2C_rmdup_trim_filter.interleaved.fa USVA3C_rmdup_trim_filter.interleaved.fa USVA5C_rmdup_trim_filter.interleaved.fa USVA6C_rmdup_trim_filter.interleaved.fa USVA8C_rmdup_trim_filter.interleaved.fa USVA9C_rmdup_trim_filter.interleaved.fa USVD1AC_rmdup_trim_filter.interleaved.fa USVD1BC_rmdup_trim_filter.interleaved.fa USVD1CC_rmdup_trim_filter.interleaved.fa USVD1DC_rmdup_trim_filter.interleaved.fa USWR1BC_rmdup_trim_filter.interleaved.fa USWR2BC_rmdup_trim_filter.interleaved.fa UYUC06_rmdup_trim_filter.interleaved.fa --cmpout ../GWMC_pminhash.txt'
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc

It seem large sketch size for metagenomes takes a lot of memory. Do you have some suggestions for this?

Many Thanks,

Jianshu

dashing2 for metagenome

Hello Daniel,

I am comparing dashing2 with bindash and Mash for metagenome. I am well aware of the fact that canonical kmer was used in Mash, so that for metagenomic reads (always pair end due to sequencing), pair-end reads can be merged into one single reads by overlap detection, so that we do not need to process so many reads but only half of them since it is the same if we use canonical k-mer. I did not see a suggestion from Mash or dashing to do merge first (very fast), then we can reduce computation time to half without changing results at all. what do you think

Thanks,

Jianshu

Segmentation fault error

Hi,
I use dashing2 on two different files:

  1. Genome file: Used dashing 2 basic sketch commands, which nearly took 2 days for processing.

$repo/dashing2 sketch --parse-by-seq --cmpout $outfile $genome

Can you suggest something for improving runtime efficiency?

  1. Protein file: This input file is converted from the genome file in (1), and after processing for 96 hours, I get this error:

image

Can you please advise regarding this error?

Silicon/M1/M2/arm64?

Any chance these, dashing and dashing2, will be compiled for apple's "new" processors (arm64/M1/M2)?
Apparently the NEON/SIMDe libraries take care of threads for all kinds of systems.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.