Git Product home page Git Product logo

dashing's Introduction

Hi, I'm Daniel 👋

Software Engineer at Roche. Previously, I was a Senior Scientist at Pacific Biosciences (PacBio) after earning my PhD at Johns Hopkins University in the department of Computer Science. Before that I was a Bioinformatics Scientist at ARUP Laboratories, where I worked on cell-free circulating tumor DNA (ctDNA) analysis and clinical genomics after my training in Physics [BS] and Biophysics/Computational Biology [MS]. I've worked with biological data (sequence, molecular modeling, metabolomics, transcriptomics, metagenomics), telecommunications data, as well as graph algorithms, machine learning, and numerical optimization.

🔭 I've worked on similarity search, and clustering, and indexing for large-scale biological data, simd/gpu-accelerated and randomized algorithms. Most recently, I've been developing methods for human genetics, including long RNA-seq, VNTRs, and haplotype phasing.

😄 Pronouns: He/Him/His

A quick tour of my interests

  1. Practical randomized algorithms

This ranges from libraries providing sketch data structures and coresets, as well as projects using random projections and DCI.

My work on coresets and clustering is primarily part of the minicore project, with the aims of providing a standard utility for coreset construction and weighted clustering, especially for exponential family models and shortest-paths metrics.

  1. Computational Biology

The bonsai project provides methods for metagenomic analysis, along with k-mer encoding/decoding and I/O, while the Dashing performs scalable sketching and comparison of sequence data.

BMFtools performs molecular demultiplication over sequencing barcoded data, reducing error rates while eliminating redundant information. Designed for ctDNA, this method can reduce error rates by orders of magnitude, allowing confident detection of very rare events.

scavenger has rust implementations using tch-rs for VAEs for count-based data, applied to single-cell transcriptomics.

I also co-developed pbfusion, a fast tool for characterizing transcriptional abnormalities.

  1. General C++

Most of my projects fall into this category, serving as tools I can reuse in various projects.

Some of my favorites:

  • vec provides type-generic abstractions over x86-64 vectorization, making it easy to write fast, portable code.
  • kspp is an RAII-based variant of kstring from klib with extra niceties making appending printf-style formatting easy.
  • aesctr provides STL-style random number generators built on fast aes-ctr and wyhash
  • circularqueue provides a range-based circular queue container that uses power-of-two sizes

dashing's People

Contributors

benlangmead avatar dnbaker avatar olgabot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dashing's Issues

Standard Mash-like DB lookup

Hi, thanks for this fast alternative!
I'd like to test dashing as a replacement for Mash in an inhouse-pipeline and maybe to replace Kraken in https://github.com/oschwengers/asap.

I'm a little bit puzzled by the different subcommands and parameters. What is the best way to:

  1. create a sketch database from a set of fasta files (bacterial genomes, 1 genome per file)
  2. compute and get a list of the closest genomes from this db against a query genome (preferably fasta or pre-sketched) fulfilling a certain dist threshold.

Could you comment on this and maybe provide a short list of commands?
Thanks a lot and best regards!

compile error: ./bonsai/bonsai/include/util.h:32:25: fatal error: lazy/vector.h: No such file or directory

when I do "make update dashing", it report errors:

cc -Ibonsai/clhash/include -I. -Ibonsai/zlib -Ibonsai/libpopcnt -Iinclude -Ibonsai/circularqueue -Ibonsai/zstd/zlibWrapper -Ibonsai/zstd/lib/common -Ibonsai/zstd/lib -Ibonsai/hll -Ibonsai/hll/vec -Ibonsai -Ibonsai/bonsai/include/ -L. -Lbonsai/zlib -DNDEBUG -c bonsai/zstd/zlibWrapper/gzclose.c -o bonsai/zstd/zlibWrapper/gzclose.o -lz
cc -Ibonsai/clhash/include -I. -Ibonsai/zlib -Ibonsai/libpopcnt -Iinclude -Ibonsai/circularqueue -Ibonsai/zstd/zlibWrapper -Ibonsai/zstd/lib/common -Ibonsai/zstd/lib -Ibonsai/hll -Ibonsai/hll/vec -Ibonsai -Ibonsai/bonsai/include/ -L. -Lbonsai/zlib -DNDEBUG -c bonsai/zstd/zlibWrapper/gzlib.c -o bonsai/zstd/zlibWrapper/gzlib.o -lz
cc -Ibonsai/clhash/include -I. -Ibonsai/zlib -Ibonsai/libpopcnt -Iinclude -Ibonsai/circularqueue -Ibonsai/zstd/zlibWrapper -Ibonsai/zstd/lib/common -Ibonsai/zstd/lib -Ibonsai/hll -Ibonsai/hll/vec -Ibonsai -Ibonsai/bonsai/include/ -L. -Lbonsai/zlib -DNDEBUG -c bonsai/zstd/zlibWrapper/zstd_zlibwrapper.c -o bonsai/zstd/zlibWrapper/zstd_zlibwrapper.o -lz
cc -Ibonsai/clhash/include -I. -Ibonsai/zlib -Ibonsai/libpopcnt -Iinclude -Ibonsai/circularqueue -Ibonsai/zstd/zlibWrapper -Ibonsai/zstd/lib/common -Ibonsai/zstd/lib -Ibonsai/hll -Ibonsai/hll/vec -Ibonsai -Ibonsai/bonsai/include/ -L. -Lbonsai/zlib -DNDEBUG -c bonsai/zstd/zlibWrapper/gzread.c -o bonsai/zstd/zlibWrapper/gzread.o -lz
cc -Ibonsai/clhash/include -I. -Ibonsai/zlib -Ibonsai/libpopcnt -Iinclude -Ibonsai/circularqueue -Ibonsai/zstd/zlibWrapper -Ibonsai/zstd/lib/common -Ibonsai/zstd/lib -Ibonsai/hll -Ibonsai/hll/vec -Ibonsai -Ibonsai/bonsai/include/ -L. -Lbonsai/zlib -DNDEBUG -c bonsai/zstd/zlibWrapper/gzwrite.c -o bonsai/zstd/zlibWrapper/gzwrite.o -lz
g++ -O3 -funroll-loops -pipe -fno-strict-aliasing -march=native -mpclmul -DUSE_PDQSORT -DNOT_THREADSAFE -DENABLE_COMPUTED_GOTO -fopenmp -fno-rtti -std=c++14 -Wall -Wextra -Wno-char-subscripts -Wpointer-arith -Wwrite-strings -Wdisabled-optimization -
Wformat -Wcast-align -Wno-unused-function -Wno-unused-parameter -pedantic -Wunused-variable -Wno-attributes -Wno-pedantic -Ibonsai/clhash/include -I. -Ibonsai/zlib -Ibonsai/libpopcnt -Iinclude -Ibonsai/circularqueue -Ibonsai/zstd/zlibWrapper -Ibonsai/zstd/lib/common -Ibonsai/zstd/lib -Ibonsai/hll -Ibonsai/hll/vec -Ibonsai -Ibonsai/bonsai/include/ -L. -Lbonsai/zlib bonsai/zstd/zlibWrapper/gzclose.o bonsai/zstd/zlibWrapper/gzlib.o bonsai/zstd/zlibWrapper/zstd_zlibwrapper.o bonsai/zstd/zlibWrapper/gzread.o bonsai/zstd/zlibWrapper/gzwrite.o libzstd.a bonsai/bonsai/clhash.o bonsai/klib/kthread.o -DNDEBUG src/dashing.cpp -o dashing -DZWRAP_USE_ZSTD=1 -lzstd -lz
In file included from src/dashing.cpp:4:0:
./bonsai/bonsai/include/util.h:32:25: fatal error: lazy/vector.h: No such file or directory
compilation terminated.
Makefile:103: recipe for target 'dashing' failed
make: *** [dashing] Error 1

Updated releases

Hi,

For my machine it seems release/linux/dashing_s512 fails but dashing_s256, which I imagine is in line with my particular CPU.

However,

$ ./dashing_s256
Dashing version: v0.4.7-2-g6ad3
...

Which is a few versions behind, and also the bioconda package is out of date - any chance of an update on both fronts?
Thanks.

unique exact matches hll

Hi,...

i have a question about the unique exact matches, can I use the (./dashing hll) not for exact matches, I need to know the whole number of matches, not just the unique one. In my sensitivity is it important to check the whole number of matches, only the unique exact matches is not really useful in my experiment.

Thanks!

Cheers
Ahmad

Output file name when empty sketch error

Hi!

A small request to help with debugging.

When a file does not exist when using -F/-Q Dashing outputs the following error:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Could not densify empty sketch

It would be very useful to know which file caused the error. I'm currently facing a situation of finding the bad file among thousands of files.

Edit: Turned out for this case, assuming that the whole dir consists of fastas and using find "$FASTA_DIR_PATH" -type f > $FASTA_PATHS, was not a good idea while using --cache-sketches because both fastas and sketches got written into the -F input file and Dashing was trying to create a sketch from a sketch file which produced an empty sketch.

All the best,
Mihkel

K>12 results in too many distances being 0

Hi
I have been using 'dashing' since November and am quite impressed.
I mainly calculate full symmetric distance matrices for large datasets downloaded from genbank, i.e. the 'plant' or 'fungi' clades.
I am using a k-mer length value of k=12 or k=13 to get reasonable distance matrices, meaning that
whenever I increase beyond k=15, the number of distance values that is 0 becomes too large. Likewise when k gets much smaller, the number of distance values that is 1 becomes too large.
Running with k=31 is entirely out of the question.
So I am asking myself, is that to be expected or is this maybe due to some error I made during setup? I am using precompiled binaries.
I'd be glad for a hint. Thanks and
Cheers

Distances > 0.05 but < 1 are unreliable?

Hi again,

I've been using dashing as a prefilter for genome dereplication, since it is much faster than FastANI. I'd previously been using mash for this. I've noticed that some genomes are given distances that are between 0.05 and 0.10, but seem to be spurious. For instance, here's mash distance vs. dashing distance calculated with -M:

image

I tested 10 randomly chosen genomes from that top stripe where mash=1 and dashing<1, and none seemed closely related genomes, so it doesn't seem that dashing is simply producing better estimates. The issue does seem to be reasonably widespread at least in this dataset - dashing predicts 49% of genome pairs < 1, where mash predicts 4%.

Is this a known issue? Am I not using dashing correctly? Is there some way I can detect these cases?

Thanks, ben.

No rule to make target `zlib/libz.a`

Dear Daniel,
I have some problems in installing dashing. I listed the errors below. Could you give me some suggessions?
Thanks.

make dashing

cd bonsai && make zlib/libz.a && cd ..
make[1]: Entering directory `/data/6190113/biosoft/dashing_git/bonsai'
make[1]: *** No rule to make target `zlib/libz.a'.  Stop.
make[1]: Leaving directory `/data/6190113/biosoft/dashing_git/bonsai'
make: *** [bonsai/zlib/libz.a] Error 2

and I checked the Makefile in bonsai, No target named `zlib/libz.a,
then I typed :

cd bonsai
make all

Another error occurred.

g++ -O3 -funroll-loops -pipe -fno-strict-aliasing -march=native -mpclmul   -fopenmp -fno-rtti -std=c++14 -Wall -Wextra -Wno-char-subscripts -Wpointer-arith -Wwrite-strings -Wdisabled-optimization -Wformat -Wcast-align -Wno-unused-function -Wno-unused-parameter -pedantic -DUSE_PDQSORT -Wunused-variable -Wno-attributes -Wno-cast-align -Wno-gnu-zero-variadic-macro-arguments -Wno-ignored-attributes -Wno-missing-braces -DBONSAI_VERSION=\"v0.2.4-39-ga168\" -DNDEBUG -Iclhash/include -I. -I.. -Ihll/libpopcnt -I.. -Iinclude -Icircularqueue -Izstd/zlibWrapper -Izstd/lib/common -Izstd/lib  -Ihll/vec -Ihll -Ihll/include -Ipdqsort -Iinclude/bonsai -Iinclude -Ihll/vec/blaze -L.  clhash.o klib/kthread.o -DNDEBUG bin/bonsai.cpp -o bin/bonsai -lz
In file included from hll/include/sketch/common.h:62,
                 from hll/include/sketch/hll.h:3,
                 from include/bonsai/./popcnt.h:13,
                 from include/bonsai/./util.h:30,
                 from include/bonsai/encoder.h:3,
                 from include/bonsai/feature_min.h:3,
                 from bin/bonsai.cpp:4:
hll/include/sketch/./exception.h: In static member function ‘static const char* sketch::exception::ZlibError::es(int)’:
hll/include/sketch/./exception.h:67:14: error: ‘z_const’ was not declared in this scope
             (z_const char *)"need dictionary",     /* Z_NEED_DICT       2  */
              ^~~~~~~
hll/include/sketch/./exception.h:67:21: error: expected ‘)’ before ‘char’
             (z_const char *)"need dictionary",     /* Z_NEED_DICT       2  */
             ~       ^~~~~
                     )
At global scope:
cc1plus: warning: unrecognized command line option ‘-Wno-gnu-zero-variadic-macro-arguments’
make: *** [bin/bonsai] Error 1

dist: with lot of data binary output "-b" crashes but csv output "-T" does not

When running dashing dist with 400,000 genomes the program succeeds when asking for csv "-T" output, but crashes when asking for binary "-b" output. Both commands succeed without problem when using 10,000 geomes only.

This works:

cat list_of_genomes|wc -l
412656
./dashing_s512 dist -F list_of_genomes  -k 15 -S16 -p 39 -M -T --use-nthash --cache-sketches |pigz > NGOT.dashdist.tsv.gz

Memory usage 330Gb over ca. 3 days on a 1.5TB machine with 40 processors (most of the time is spent writing the matrix to disk)

This crashes after ca. 3 minutes

 ./dashing_s512 dist -F list_of_genomes  -k 15 -S16 -p 39 -M -b -o labelsB --use-nthash -O  NGOT.dashdists.bin

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
[1]    40318 abort (core dumped)  ./dashing_s512 dist -F list_of_genomes -k 15 -S16 -p 39 -M -b -o labelsB  -O

Peak memory 30.6Gb

Update (genome sizes are around 30kb):
300,000 genomes does not seem to crash (8_762_264_277 nucleotides)
310,000 genomes crash (8_958_253_751 nucleotides)

Processing each sequence in a file separately

First of all, thank you for the great tool!

Being used to Mash, I currently miss the option to tell the tool to process the sequences in a FASTA file separately rather than belonging to one genome. I checked the available flags multiple times but could not find anything related to this functionality.

Thank you in advance!

`setdist` doesn't output distances

Hello,
When running setdist on these four fastq.gz files, the program calculates the absolute sizes but doesn't show the distances as it did with dashing dist

Here's the dashing dist output:

oot@f5b123e1af02:/data/catted_reads# time dashing dist *.fastq.gz
#Path    Size (est.)
A1-MAA000487-3_10_M-1-1.fastq.gz    25503239.355288
A1-B002764-3_38_F-1-1.fastq.gz    6140027.666877
A1-D042253-3_9_M-1-1.fastq.gz    12316236.863717
A1-MAA000779-3_11_M-1-1.fastq.gz    6934912.683602
##Names     A1-MAA000487-3_10_M-1-1.fastq.gz    A1-B002764-3_38_F-1-1.fastq.gz    A1-D042253-3_9_M-1-1.fastq.gz    A1-MAA000779-3_11_M-1-1.fastq.gz
A1-MAA000487-3_10_M-1-1.fastq.gz    -    0.049051    0.074235    0.000000
A1-B002764-3_38_F-1-1.fastq.gz    -    -    0.039803    0.055432
A1-D042253-3_9_M-1-1.fastq.gz    -    -    -    0.014889
A1-MAA000779-3_11_M-1-1.fastq.gz    -    -    -    -

real    0m13.896s
user    0m13.630s
sys    0m0.220s

Here are the reads:

(base) 
 Wed 23 Jan - 14:40  ~/data/catted_reads 
  aws s3 ls s3://olgabot-maca/dashing-test/                                                            
2019-01-23 09:24:57   73938115 A1-B002764-3_38_F-1-1.fastq.gz
2019-01-23 09:24:57   52176348 A1-D042253-3_9_M-1-1.fastq.gz
2019-01-23 09:24:57  168623288 A1-MAA000487-3_10_M-1-1.fastq.gz
2019-01-23 09:24:57   51031125 A1-MAA000779-3_11_M-1-1.fastq.gz

And here's the dashing setdist output:

$ /home/olga/code/dashing/dashing setdist *.fastq.gz
#Path	Size (est.)
A1-B002764-3_38_F-1-1.fastq.gz	6122553
A1-D042253-3_9_M-1-1.fastq.gz	12569210
A1-MAA000487-3_10_M-1-1.fastq.gz	26727514
A1-MAA000779-3_11_M-1-1.fastq.gz	7587041
##Names 	A1-B002764-3_38_F-1-1.fastq.gz	A1-D042253-3_9_M-1-1.fastq.gz	A1-MAA000487-3_10_M-1-1.fastq.gz	A1-MAA000779-3_11_M-1-1.fastq.gz

Do you know what may be happening?
Thank you!
Warmest,
Olga

Typo in output with -U?

dashing dist -k16 -p7 *.gz-U -O myout
produces

3
genome1.fna.gz t0 t0.01022848
genome2.fna.gz t0.0408798
genome3.fna.gz

notice the "t"-s. A missing backslash for tab perhaps?

Using bloom filter to perform many to many comparison

Hi!

Problem: compare a lot of bigger k-mer lists (n) with somewhat smaller k-mer lists (also n).

The goal is to find how many k-mers are in both lists (intersection size) and bloom filters seem to be ideal for this as the same bit array can be used multiple times and the intersection size is expected to be very small (usually 0).
I can convert the k-mer lists into fastas so Dashing can read it but working with bf sketches is a bit obscure.

Should the workflow be something like this?:

  1. Create bloom filters for every big list
  2. Create a sketch for every small list
  3. Compare each bloom filter with every small list sketch (how?)
  4. (Optional) check if resulting probably-yes k-mers are actually present (how?)
  5. Output the intersection sizes (number of k-mers present in both lists) (--full-containment-dist??)

Note: I want to use all of the k-mers, so no subsetting. Is --use-full-khash-sets sufficient or do any other parameters needed?

This is part of a larger problem where I'm trying to find the count 1 mismatch edit distance between two sets of k-mer lists. Currently, just creating all possible mismatched k-mers (bigger lists) and intersecting them with smaller lists seems to be the fastest approach, which is still slow.

Can't build using gcc 5.4 (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

Hi Daniel,
When I tried to build dashing on my server, I got these errors:

In file included from src/dashing.cpp:8:0:
bonsai/hll/mh.h: In member function ‘sketch::minhash::HyperMinHash<T, Hasher>& sketch::minhash::HyperMinHash<T, Hasher>::operator+=(const sketch::minhash::HyperMinHash<T, Hasher>&)’:
bonsai/hll/mh.h:985:32: error: ‘max_fn16’ is not a member of ‘sketch::hll::detail::SIMDHolder’
                     do {*ptr = SIMDHolder::op(*ptr, *optr++);} while(++ptr != eptr); break;
                                ^
bonsai/hll/mh.h:987:17: note: in expansion of macro ‘CASE_U’
                 CASE_U(U16, max_fn16)
                 ^
bonsai/hll/mh.h:985:32: error: ‘max_fn32’ is not a member of ‘sketch::hll::detail::SIMDHolder’
                     do {*ptr = SIMDHolder::op(*ptr, *optr++);} while(++ptr != eptr); break;
                                ^
bonsai/hll/mh.h:988:17: note: in expansion of macro ‘CASE_U’
                 CASE_U(U32, max_fn32)
                 ^
bonsai/hll/mh.h:985:32: error: ‘max_fn64’ is not a member of ‘sketch::hll::detail::SIMDHolder’
                     do {*ptr = SIMDHolder::op(*ptr, *optr++);} while(++ptr != eptr); break;
                                ^
bonsai/hll/mh.h:989:17: note: in expansion of macro ‘CASE_U’
                 CASE_U(U64, max_fn64)

I'm not sure whether it has anything to do with gcc5.4. But the README says it has been tested using this version of gcc.

installation wasn't a pleasant experience

Hi, I very much look forward to use dashing, however three small issues with how it is distributed made me lose some time during installation:

  1. source code is huge due to its dependencies, took me more than 10mins to git clone
  2. releases didn't work on my (old) cluster, so I needed to build from source. Could you please provide more portable binaries?
  3. bioconda install worked :) but it wasn't mentioned on the README. EDIT: ahh I take it back, there is a badge! I didn't see it

Produce single sketch from multiple input files

I'd like to produce a single sketch from multiple input files – forward and reverse reads in this case. Is this possible with Dashing?

Mash lets me do it like this:

mash sketch -m 5 -I ec_r12 -o ec_r12_sketch ec_1.fq.gz ec_2.fq.gz

Else by piping to stdin:

cat ec_1.fq.gz ec_2.fq.gz | mash sketch -m 5 -I ec_r12 -o ec_r12_sketch -

Binary format differs from expectation

I used Dashing v0.5.6 s128 on a Linux machine to compare pre-hashed genomes. the command was:

./dashing_s128 cmp -p78 --presketched  -b -Ofull_dashing_S16_k31_dist.bin -F fullpath_hll_filelist.txt -Q fullpath_hll_filelist.txt

From the specification here I was expecting a half matrix output with 1 byte specifying full or half matrix, 8 bytes specifying the length in np.float64, and ((n*(n-1)/2)*4 bytes of data in npfloat32. Note that supplying -Q only for the file path did not work.

Instead, I get a file of exactly (n**2)*4 bytes so I'm assuming I just got a square matrix of 4-byte float32 values.

The file is 422,393,406,724 bytes for n = 324,959.

I can import the data as a Numpy memory map doing this:

import numpy as np
val = np.memmap('full_dashing_S16_k31_dist.bin', dtype=np.float32, shape=(324959,324959))

I just wanted to know if this import was correct and also make you aware that the output was not what I expected. I saw in the previous issues you are working on documenting the binary format so I thought I'd pass this along. Overall, Dashing is fantastic and I really appreciate your team's hard work.

bioconda install: Illegal instruction

Hi,

Trying to use dashing as it seems very useful, thanks.

Not sure if this issue belongs here but on my system the bioconda install gives an illegal instruction error:

$ dashing cmp -M tests/data/abisko4/*fna
Dashing version: v0.4.0
Illegal instruction (core dumped)

Specifically I have

  dashing            bioconda/linux-64::dashing-0.4.0-hfc8b89e_0

installed. I presume this is an issue with dashing compilation itself, but not 100% sure on that.

Any suggestions? On my other computer it works without issue. If it helps this is my cpu specs

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               42
Model name:          Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Stepping:            7
CPU MHz:             1714.000
CPU max MHz:         3800.0000
CPU min MHz:         1600.0000
BogoMIPS:            6784.26
Virtualisation:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8192K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts md_clear flush_l1d

Thanks, ben.

Querying with presketches: 'std::bad_alloc'

Hi!

I'm using the same references (-F) multiple times and thought it would be faster if I'd sketch them once and use only the sketches in the future for querying.

Sketching:
dashing sketch -F references.fasta_paths.txt -k 32 -p 2 --sketch-size 20 --use-bb-minhash

Querying:
dashing dist -F references.sketch_paths.txt -k 32 -p 2 --sizes --sketch-size 20 --use-bb-minhash -T -Q testdata_path.txt --presketched

Query error:

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

Removing the -Q testdata_path.txt --presketched outputs square matrix without errors.
Dashing version: v0.5-3-g03c10

Minor comments/questions:

  1. Is the missing "#Names" line intentional in query output? Asking because it is present in the square matrix and getting the order from the "sizes section/file" seems a bit odd.
  2. Is there any way to specify sketch output dir/name? -o seems to put them all into a single file.
  3. dashing -v outputs the version twice. Probably one time as always, the other as a response to the command.

Output HyperLogLog

Hi,

I have a question about the output from HLL, when I use Dashing with HyperLogLog i.e.:
./dashing hll -k15 -p2 -S24 read.fastq reference.fasta

The output from HLL is then:
Estimated number of unique exact matches: 2925637.000000

Which kind of matches counts HLL?, I thought the k-mer matches between the Inputs (read and reference).
If the HLL counts the K-mer matches, it shouldn't be 2925637, because my read length is 1628 bp and the reference about 3000000 bp.

My goal is to count the k-mer matches between read and reference. Are the counted matches in HLL between k-mer's or other kind of matches?

Best,

Ahmad

make clean does not propagate

This is rather a minor issue. When I am experimenting with different compilation options, I need to clean all intermediate files. However, make clean works only on the top level and does not propagate. This can be done by adding lines like $(MAKE) -C bonsai clean. I would create a pull request, but I am not sure if make clean is done properly in bonsai and distmat. Also, git clean cannot work recursively so this wouldn't be a solution.

Sketch from STDIN

Is it possible to create a sketch using fastas streamed through a pipe to dashing?

I'm manipulating both assembled genomes and k-mers and would like to compare them in the end multiple times. I could write them on the disk as an additional step but given the high volume, it is really cumbersome.

Thanks.

Recursive checkout failure

As mentioned in the readme I did a git clone --recursive https://github.com/dnbaker/dashing which downloaded a bunch of stuff but then ended with

…
Submodule path 'bonsai/zlib': checked out 'cacf7f1d4e3d44d871b605da3b647f07d718623f'
error: Server does not allow request for unadvertised object be71acca9721786f6a3bc3c5b4a42f699721a057
Fetched in submodule path 'bonsai/zstd', but it did not contain be71acca9721786f6a3bc3c5b4a42f699721a057. Direct fetching of that commit failed.
Submodule path 'distmat': checked out '3ffee11dc51a4cfeb562ca6d04ce46d3cb42db39'
Failed to recurse into submodule path 'bonsai'

Apart from the error I am wondering why dashing/bonsai comes with copies of zlib, zstd, divsufsort and others at all? My computer has perfectly fine versions of them installed already.

Union command failing

Hello! I'm trying to use the tool and was interested in the union subcommand.

I first create sketches of 2 read files using -k31 -S10. Then, I run

dashing union ./achromobacter_xylosoxidans__01/DRR015625.fa.gz.w.31.spacing.10.hll ./achromobacter_xylosoxidans__01/DRR015626.fa.gz.w.31.spacing.10.hll -o test.hll

And I receive the following error

terminate called after throwing an instance of 'std::runtime_error'
  what():  Could not open file at '@r' for reading
[1]    27344 abort      /n/data1/hms/dbmi/baym/arya/tools/dashing/dashing union

If I keep rerunning the same command I receive different strings for @r and sometimes receive the following backtrace:

*** Error in `/n/data1/hms/dbmi/baym/arya/tools/dashing/dashing': double free or corruption (out): 0x0000000001e9d250 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81679)[0x7f3ad2c52679]
/n/data1/hms/dbmi/baym/arya/tools/dashing/dashing[0x4f78b6]
/n/data1/hms/dbmi/baym/arya/tools/dashing/dashing[0x4eac84]
/n/data1/hms/dbmi/baym/arya/tools/dashing/dashing[0x40c605]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f3ad2bf3505]
/n/data1/hms/dbmi/baym/arya/tools/dashing/dashing[0x41d0ae]
======= Memory map: ========
00400000-009ae000 r-xp 00000000 00:34 52331274504                        /n/data1/hms/dbmi/baym/arya/tools/dashing/dashing
00bad000-00bae000 r--p 005ad000 00:34 52331274504                        /n/data1/hms/dbmi/baym/arya/tools/dashing/dashing
00bae000-00bb1000 rw-p 005ae000 00:34 52331274504                        /n/data1/hms/dbmi/baym/arya/tools/dashing/dashing
00bb1000-00bb2000 rw-p 00000000 00:00 0
01e6d000-01eaf000 rw-p 00000000 00:00 0                                  [heap]
7f3acc000000-7f3acc021000 rw-p 00000000 00:00 0
7f3acc021000-7f3ad0000000 ---p 00000000 00:00 0
7f3ad2bd1000-7f3ad2d94000 r-xp 00000000 fd:00 12583521                   /usr/lib64/libc-2.17.so
7f3ad2d94000-7f3ad2f94000 ---p 001c3000 fd:00 12583521                   /usr/lib64/libc-2.17.so
7f3ad2f94000-7f3ad2f98000 r--p 001c3000 fd:00 12583521                   /usr/lib64/libc-2.17.so
7f3ad2f98000-7f3ad2f9a000 rw-p 001c7000 fd:00 12583521                   /usr/lib64/libc-2.17.so
7f3ad2f9a000-7f3ad2f9f000 rw-p 00000000 00:00 0
7f3ad2f9f000-7f3ad2fb6000 r-xp 00000000 fd:00 12583547                   /usr/lib64/libpthread-2.17.so
7f3ad2fb6000-7f3ad31b5000 ---p 00017000 fd:00 12583547                   /usr/lib64/libpthread-2.17.so
7f3ad31b5000-7f3ad31b6000 r--p 00016000 fd:00 12583547                   /usr/lib64/libpthread-2.17.so
7f3ad31b6000-7f3ad31b7000 rw-p 00017000 fd:00 12583547                   /usr/lib64/libpthread-2.17.so
7f3ad31b7000-7f3ad31bb000 rw-p 00000000 00:00 0
7f3ad31bb000-7f3ad31d1000 r-xp 00000000 00:2b 5898226537                 /n/app/gcc/6.2.0/lib64/libgcc_s.so.1
7f3ad31d1000-7f3ad33d0000 ---p 00016000 00:2b 5898226537                 /n/app/gcc/6.2.0/lib64/libgcc_s.so.1
7f3ad33d0000-7f3ad33d1000 r--p 00015000 00:2b 5898226537                 /n/app/gcc/6.2.0/lib64/libgcc_s.so.1
7f3ad33d1000-7f3ad33d2000 rw-p 00016000 00:2b 5898226537                 /n/app/gcc/6.2.0/lib64/libgcc_s.so.1
7f3ad33d2000-7f3ad33fe000 r-xp 00000000 00:2b 5898355825                 /n/app/gcc/6.2.0/lib64/libgomp.so.1.0.0
7f3ad33fe000-7f3ad35fd000 ---p 0002c000 00:2b 5898355825                 /n/app/gcc/6.2.0/lib64/libgomp.so.1.0.0
7f3ad35fd000-7f3ad35fe000 r--p 0002b000 00:2b 5898355825                 /n/app/gcc/6.2.0/lib64/libgomp.so.1.0.0
7f3ad35fe000-7f3ad35ff000 rw-p 0002c000 00:2b 5898355825                 /n/app/gcc/6.2.0/lib64/libgomp.so.1.0.0
7f3ad35ff000-7f3ad3700000 r-xp 00000000 fd:00 12583529                   /usr/lib64/libm-2.17.so
7f3ad3700000-7f3ad38ff000 ---p 00101000 fd:00 12583529                   /usr/lib64/libm-2.17.so
7f3ad38ff000-7f3ad3900000 r--p 00100000 fd:00 12583529                   /usr/lib64/libm-2.17.so
7f3ad3900000-7f3ad3901000 rw-p 00101000 fd:00 12583529                   /usr/lib64/libm-2.17.so
7f3ad3901000-7f3ad3a72000 r-xp 00000000 00:2b 5898402838                 /n/app/gcc/6.2.0/lib64/libstdc++.so.6.0.22
7f3ad3a72000-7f3ad3c72000 ---p 00171000 00:2b 5898402838                 /n/app/gcc/6.2.0/lib64/libstdc++.so.6.0.22
7f3ad3c72000-7f3ad3c7c000 r--p 00171000 00:2b 5898402838                 /n/app/gcc/6.2.0/lib64/libstdc++.so.6.0.22
7f3ad3c7c000-7f3ad3c7e000 rw-p 0017b000 00:2b 5898402838                 /n/app/gcc/6.2.0/lib64/libstdc++.so.6.0.22
7f3ad3c7e000-7f3ad3c82000 rw-p 00000000 00:00 0
7f3ad3c82000-7f3ad3c97000 r-xp 00000000 fd:00 12584762                   /usr/lib64/libz.so.1.2.7
7f3ad3c97000-7f3ad3e96000 ---p 00015000 fd:00 12584762                   /usr/lib64/libz.so.1.2.7
7f3ad3e96000-7f3ad3e97000 r--p 00014000 fd:00 12584762                   /usr/lib64/libz.so.1.2.7
7f3ad3e97000-7f3ad3e98000 rw-p 00015000 fd:00 12584762                   /usr/lib64/libz.so.1.2.7
7f3ad3e98000-7f3ad3e9a000 r-xp 00000000 fd:00 12583527                   /usr/lib64/libdl-2.17.so
7f3ad3e9a000-7f3ad409a000 ---p 00002000 fd:00 12583527                   /usr/lib64/libdl-2.17.so
7f3ad409a000-7f3ad409b000 r--p 00002000 fd:00 12583527                   /usr/lib64/libdl-2.17.so
7f3ad409b000-7f3ad409c000 rw-p 00003000 fd:00 12583527                   /usr/lib64/libdl-2.17.so
7f3ad409c000-7f3ad40be000 r-xp 00000000 fd:00 12583514                   /usr/lib64/ld-2.17.so
7f3ad42a4000-7f3ad42ac000 rw-p 00000000 00:00 0
7f3ad42bb000-7f3ad42bd000 rw-p 00000000 00:00 0
7f3ad42bd000-7f3ad42be000 r--p 00021000 fd:00 12583514                   /usr/lib64/ld-2.17.so
7f3ad42be000-7f3ad42bf000 rw-p 00022000 fd:00 12583514                   /usr/lib64/ld-2.17.so
7f3ad42bf000-7f3ad42c0000 rw-p 00000000 00:00 0
7fff7bf18000-7fff7bf3a000 rw-p 00000000 00:00 0                          [stack]
7fff7bfc1000-7fff7bfc3000 r-xp 00000000 00:00 0                          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
[1]    29275 abort      /n/data1/hms/dbmi/baym/arya/tools/dashing/dashing union

This occurs when using the most recent precompiled release version of dashing or when I compile from source.

Any help would be appreciated!

  • Arya

Compiling from release - empty module dirs and running in Singularity container

Hi,

I'm planning to install dashing into a Singularity container (CentOS) but tried to install it on a server first (also CentOS).

-bash-4.2$ wget https://github.com/dnbaker/dashing/archive/v0.4.2.tar.gz
-bash-4.2$ tar -zxvf v0.4.2.tar.gz
-bash-4.2$ make
fatal: Not a git repository (or any parent up to mount point /serverhome)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
make: *** No rule to make target sketch/include/sketch/cbf.h', needed by src/dashing.o'. Stop.
-bash-4.2$ ls bonsai/ | wc -l
0
-bash-4.2$ ls distmat/ | wc -l
0
-bash-4.2$ ls khset | wc -l
0
-bash-4.2$ ls sketch/ | wc -l
0

The server has old gcc (4.8.5) but this is probably not the issue because making from cloned master breaks far later.

Unrelated: if no temporary files are created while creating the distance matrix, is everything held in memory? How large memory consumption is expected if running on thousands of assembled bacterial genomes (~5MB)? Asking for HPC resource allocation info.

Regards,
Mihkel

Add protein k-mers with 6-frame translation

Hello,
I've been using protein k-mers from sourmash to compare single-cell RNA-seq profiles across different species. I'm curious to benchmark dashing for this purpose as well.
Warmest,
Olga

Emit Intersection Size

Currently, dashing emits union sizes with --sizes, but not intersection size. You can get to that by subtracting the estimated set cardinality, but it would be preferable to emit it directly.

Can't build dashing on mac (fatal error: string.h: No such file or directory)

I didn't find a way how to compile dashing on mac. Would it be please possible to add an install section to the readme with more details for individual platforms (including dependencies)?

Some of the dependencies require openmp which is not supported by clang. Therefore, one needs to use gcc. However, it seems that gcc (installed using brew) then can't see certain header files.

gcc 4.9

$ make CXX=c++-4.9 CC=gcc-4.9 V=1
git submodule update --init --remote --recursive . && cd bonsai && git checkout master && git pull && make update && \
    cd linear && git checkout master && git pull && cd .. && cd .. && cd distmat && git checkout master && git pull && cd ..

...

cd bonsai/bonsai && make libzstd.a && cp libzstd.a ../../
cd ../zstd && /Applications/Xcode.app/Contents/Developer/usr/bin/make && mv lib/libzstd.a ../bonsai && cd ../bonsai
gcc-4.9 -O3   -I. -I./common -DXXH_NAMESPACE=ZSTD_ -I./legacy -DZSTD_LEGACY_SUPPORT=5  -c -o common/entropy_common.o common/entropy_common.c
In file included from common/entropy_common.c:38:0:
common/mem.h:22:37: fatal error: string.h: No such file or directory
 #include <string.h>     /* memcpy */
                                     ^
compilation terminated.
make[3]: *** [common/entropy_common.o] Error 1
make[2]: *** [lib-release] Error 2
make[1]: *** [libzstd.a] Error 2
make: *** [libzstd.a] Error 2

gcc 5

$ make CXX=c++-5 CC=gcc-5 V=1
git submodule update --init --remote --recursive . && cd bonsai && git checkout master && git pull && make update && \
    cd linear && git checkout master && git pull && cd .. && cd .. && cd distmat && git checkout master && git pull && cd ..

...

cd bonsai/bonsai && make libzstd.a && cp libzstd.a ../../
cd ../zstd && /Applications/Xcode.app/Contents/Developer/usr/bin/make && mv lib/libzstd.a ../bonsai && cd ../bonsai
gcc-5 -O3   -I. -I./common -DXXH_NAMESPACE=ZSTD_ -I./legacy -DZSTD_LEGACY_SUPPORT=5  -c -o common/entropy_common.o common/entropy_common.c
In file included from common/entropy_common.c:38:0:
common/mem.h:22:37: fatal error: string.h: No such file or directory
compilation terminated.
make[3]: *** [common/entropy_common.o] Error 1
make[2]: *** [lib-release] Error 2
make[1]: *** [libzstd.a] Error 2
make: *** [libzstd.a] Error 2

Output tsv from `setdist` (`setdist -T` isn't working)

Hello!
Thanks again for adding the output TSV feature for dist. The same command doesn't seem to be working for me for setdist. Is it possible to add the same feature?
Thank you!
Warmest,
Olga

dist -T works fine

(kmer-hashing)
 ✘  Wed 13 Feb - 08:08  ~/rcfiles   origin ☊ master ✔ 
  time /home/olga/code/dashing/dashing dist \
    -T -b -O \
    /home/olga/code/kmer-hashing/data/100_test_dashing/dashing_dist_k21_sketch10.tsv \
    -k 21 \
    /mnt/pureScratch/olga/dashing-test/catted_reads/*.fastq.gz
#Path   Size (est.)
/mnt/pureScratch/olga/dashing-test/catted_reads/A11-MAA000577-3_8_M-1-1.fastq.gz        45941141.617559
/mnt/pureScratch/olga/dashing-test/catted_reads/A1-MAA100039-3_11_M-1-1.fastq.gz        28812026.843665
/mnt/pureScratch/olga/dashing-test/catted_reads/A10-B000971-3_39_F-1-1.fastq.gz 17248371.587632
/mnt/pureScratch/olga/dashing-test/catted_reads/A12-D041914-3_8_M-1-1.fastq.gz  19983749.392062
/mnt/pureScratch/olga/dashing-test/catted_reads/A12-B001717-3_38_F-1-1.fastq.gz 16453228.911596
/mnt/pureScratch/olga/dashing-test/catted_reads/A1-MAA000487-3_10_M-1-1.fastq.gz        26391118.812770
/mnt/pureScratch/olga/dashing-test/catted_reads/A10-B002775-3_39_F-1-1.fastq.gz 13471740.831427
/mnt/pureScratch/olga/dashing-test/catted_reads/A11-D041914-3_8_M-1-1.fastq.gz  18346446.763078
/mnt/pureScratch/olga/dashing-test/catted_reads/A10-D041914-3_8_M-1-1.fastq.gz  17065167.466045
/mnt/pureScratch/olga/dashing-test/catted_reads/A11-MAA100041-3_9_M-1-1.fastq.gz        17565493.832883
/mnt/pureScratch/olga/dashing-test/catted_reads/A11-MAA100140-3_57_F-1-1.fastq.gz       8458176.796920
/mnt/pureScratch/olga/dashing-test/catted_reads/A10-MAA000559-3_8_M-1-1.fastq.gz        13580440.404920
/mnt/pureScratch/olga/dashing-test/catted_reads/A12-D042253-3_9_M-1-1.fastq.gz  18532120.609711
/mnt/pureScratch/olga/dashing-test/catted_reads/A11-MAA000559-3_8_M-1-1.fastq.gz        11405142.776925
/mnt/pureScratch/olga/dashing-test/catted_reads/A12-MAA000559-3_8_M-1-1.fastq.gz        12071181.284648
/mnt/pureScratch/olga/dashing-test/catted_reads/A1-B002764-3_38_F-1-1.fastq.gz  4868938.990295
/mnt/pureScratch/olga/dashing-test/catted_reads/A1-D042253-3_9_M-1-1.fastq.gz   12064152.714943
/mnt/pureScratch/olga/dashing-test/catted_reads/A1-MAA000779-3_11_M-1-1.fastq.gz        6605897.783324
/mnt/pureScratch/olga/dashing-test/catted_reads/A12-MAA000508-3_9_M-1-1.fastq.gz        6819225.271006
/home/olga/code/dashing/dashing dist -T -b -O  -k 21   106.14s user 1.10s system 99% cpu 1:47.60 total

setdist -T errors out

 Tue 12 Feb - 18:53  ~/rcfiles   origin ☊ master ✔ 
  time /home/olga/code/dashing/dashing setdist \
    -T -b -O \
    /home/olga/code/kmer-hashing/data/100_test_dashing/dashing_setdist_k21_sketch10.tsv \
    -k 21 \
    /mnt/pureScratch/olga/dashing-test/catted_reads/*.fastq.gz
setdist: invalid option -- 'T'
Usage: setdist <opts> [genomes if not provided from a file with -F]
Flags:
-h/-?   Usage
-k      Set kmer size [31]
-W      Cache sketches/use cached sketches
-p      Set number of threads [1]
-b      Emit distances in binary (default: human-readable, upper-triangular)
-U      Emit distances in PHYLIP upper triangular format(default: human-readable, upper-triangular)
-s      add a spacer of the format <int>x<int>,<int>x<int>,..., where the first integer corresponds to the space between bases repeated the second integer number of times
-w      Set window size [max(size of spaced kmer, [parameter])]
-S      Set sketch size [10, for 2**10 bytes each]
-H      Treat provided paths as pre-made sketches.
-C      Do not canonicalize. [Default: canonicalize]
-P      Set prefix for sketch file locations [empty]
-x      Set suffix in sketch file names [empty]
-o      Output for genome size estimates [stdout]
-I      Use Ertl's Improved Estimator
-E      Use Ertl's Original Estimator
-J      Use Ertl's JMLE Estimator [default      Uses Ertl-MLE]
-O      Output for genome distance matrix [stdout]
-L      Clamp estimates below expected variance to 0. [Default: do not clamp]
-e      Emit in scientific notation
-f      Report results as float. (Only important for binary format.) This halves the memory footprint at the cost of precision loss.
-F      Get paths to genomes from file rather than positional arguments
-M      Emit Mash distance (default: jaccard index)
-T      postprocess binary format to human-readable TSV (not upper triangular)
-Z      Emit genome sizes (default: jaccard index)
-N      Autodetect fastq or fasta data by filename (.fq or .fastq within filename).
-y      Filter all input data by count-min sketch.
-q      Set count-min number of hashes. Default: [4]
-c      Set minimum count for kmers to pass count-min filtering.
-t      Set count-min sketch size (log2). Default: ceil(log2(max_filesize)) + 2
-R      Set seed for seeds for count-min sketches
/home/olga/code/dashing/dashing setdist -T -b -O  -k 21   0.00s user 0.00s system 0% cpu 0.004 total

linking error in zlibWrapper

I have cloned the repository recursively and am trying to compile dashing using gcc 8.2.0 on centos linux. It compiles for a while but dies with an error while linking zlibWrapper:

g++ -O3 -funroll-loops -pipe -fno-strict-aliasing -DUSE_PDQSORT -DNOT_THREADSAFE -mpopcnt -flto -fopenmp -fno-rtti -std=c++14 -Wall -Wextra -Wno-char-subscripts -Wpointer-arith -Wwrite-strings -Wdisabled-optimization -Wformat -Wcast-align -Wno-unused-function -Wno-unused-parameter -pedantic -Wunused-variable -Wno-attributes -Wno-pedantic -Wno-ignored-attributes -Wno-missing-braces -Wno-unknown-pragmas -DDASHING_VERSION="v0.4.6" -fdiagnostics-color=always -Ibonsai/clhash/include -I. -Ibonsai/zlib -Ibonsai/libpopcnt -Iinclude -Ibonsai/circularqueue -Ibonsai/zstd/zlibWrapper -Ibonsai/zstd/lib/common -Ibonsai/zstd/lib -Isketch/vec -Ibonsai -Ibonsai/include/ -Isketch/include -Isketch -Isketch/include/sketch -Isketch/vec -L. bonsai/zstd/zlibWrapper/gzclose.o bonsai/zstd/zlibWrapper/gzlib.o bonsai/zstd/zlibWrapper/gzread.o bonsai/zstd/zlibWrapper/gzwrite.o bonsai/zstd/zlibWrapper/zstd_zlibwrapper.o libzstd.a bonsai/clhash.o bonsai/klib/kthread.o src/main.o src/union.o src/hllmain.o src/mkdistmain.o src/finalizers.o src/cardests.o src/distmain.o src/construct.o src/flatten_all.o src/sketchcmpbbmh.o src/sketchcmpkhs.o src/sketchcmprmh.o src/sketchcmpcrmh.o src/sketchcmphll.o src/sketchcmpbf.o src/sketchcmpsmh.o src/sketchcorekhs.o src/sketchcorecbbmh.o src/sketchcoresmh.o src/sketchcorebbmh.o src/sketchcorermh.o src/sketchcorebf.o src/sketchcorehll.o src/sketchcorecrmh.o src/background.o -O3 src/dashing.o -o dashing -DZWRAP_USE_ZSTD=1 -lzstd -lz -ldl -march=native -DNDEBUG # -DNDEBUG
bonsai/zstd/zlibWrapper/zstd_zlibwrapper.o: In function z_inflateGetDictionary': zstd_zlibwrapper.c:(.text+0x1ab9): undefined reference to inflateGetDictionary'
collect2: error: ld returned 1 exit status
make: *** [dashing] Error 1

I also tried running each of the three pre-compiled versions of dashing on a machine with AVX2 but each binary dies with a core dump:
Dashing version: v0.4.5-8-g370d
terminate called recursively
terminate called recursively
run_dashing.sh: line 6: 39845 Aborted (core dumped) ./dashing_s128 cmp -k 31 -p 8 -O distance_matrix.txt -o size_estimates.txt -F inputfiles.txt

Unable to install (error with make file)

Hi there,

I'm having an issue when attempting to install dashing via
make dashing

I get the following error:

make: *** No rule to make target bonsai/hll/include/cbf.h', needed by src/dashing.o'. Stop.

Any suggestions?

'sketch::exception::ZlibError'

Hello again! Thanks for helping with my other issue, I'm now encountering the following exception.

The command I run is:

start=$(date +%s)
log 'Generating union of SRR sketches' '...' 'STEP 2'
$DASHING union -p 6 -o $OUTPUT/unionofSRR.hll $OUTPUT/SRR*.hll
end=$(date +%s)
runtime=$(((end-start)/60))
log "Time Taken" "$runtime minutes"

In my job output I receive the following:

STEP 2 Generating union of SRR sketches ...
Dashing version: v0.5.6
terminate called without an active exception
/var/spool/slurmd/job28030963/slurm_script: line 318: 18202 Aborted   dashing union -p 6 -o ./test/unionofSRR.hll ./test/SRR*.hll
-> Time Taken 0 minutes

And when I view the SRR union file generated, I receive:

➜ dashing view unionofSRR.hll
Dashing version: v0.5.6
terminate called after throwing an instance of 'sketch::exception::ZlibError'
  what():  zlibError [file error][E:sketch/include/sketch/hll.h:1085:void sketch::hll::hllbase_t<HashStruct>::read(z_gzFile) [with HashStruct = sketch::hash::WangHash; z_gzFile = gzFile_s*]] Error reading from file

[1]    27242 abort      dashing view unionofSRR.hll

I've confirmed I can correctly generate unionofSRR.hll by running the same expression on the command line, so I believe it's some issue with calling the command in a job script. I am able to correctly generate sketches in a job through the dashing sketch command (done in a very similar way) so I'm not sure why dashing union would fail. Let me know if you can think of anything for me to try.

Best,
Arya

example files

Hi,

Are there example files somewhere I can download to test dashing? Thanks.
Jean

README clarification & performance

Hi again,
sorry to only be leaving 'negative' issues, but my dashing run has been going on for a day (2 threads) so I suppose I did something wrong when following the README instructions.
I have 15k datasets, each dataset is neither a complete genome nor a set of reads but a .fa.gz human transcriptome assembly.

  1. I went with the first cmdline of the README: dashing dist -k31 -p2. Is this an appropriate one for my setting?
  2. README says "unspaced, unminimized kmers". But these are still minimizers, right?
  3. If I understand correctly, the "dist" section is for all-against-all comparison, and the "dist (asymmetric mode)" section is all A's versus all B's comparison, right?

thanks in advance,
Rayan

Can't build dashing

After a huge recursive module fetching round, I get this:

In file included from bonsai/cppitertools/internal/iter_tuples.hpp:4:0,
                 from bonsai/cppitertools/product.hpp:4,
                 from src/testsat.cpp:8:
bonsai/cppitertools/internal/iterator_wrapper.hpp:6:19: fatal error: variant: No such file or direct
ory
compilation terminated.
make: *** [Makefile:100: testsat] Error 1
make: *** Waiting for unfinished jobs....
In file included from ./bonsai/hll/cbf.h:3:0,
                 from src/testcbf.cpp:2:
./bonsai/hll/bf.h: In member function 'void sketch::bf::bfbase_t<HashStruct>::reseed(uint64_t)':
./bonsai/hll/bf.h:110:31: error: expected ')' before ';' token
             if(auto val = mt(); std::find(seeds_.cbegin(), seeds_.cend(), val) == seeds_.cend())

Add support for 10x bam files

Hello,
For computing signatures of single-cell RNA-seq data, it's very convenient to use the 10x bam file directly. This was implemented in sourmash using the Python package bamnostic. For C/C++, I presume the htslib library would be used directly. It's been a while since I stretched my C muscles but I could give it a shot.
Warmest,
Olga

Build failure

As of v0.1.1-38-g5ab9bec a new clone and build fails with the following message.

bonsai/hll/vec/vec.h: In function ‘constexpr __m128i vec::_mm_mullo_epi64x(__m128i, uint64_t)’:
bonsai/hll/vec/vec.h:115:27: error: call to non-‘constexpr’ function ‘__m128i _mm_mullo_epi64(__m128i, __m128i)’
     return _mm_mullo_epi64(a, _mm_set1_epi64x(b));

(also bonsai/hll/vec/vec.h:119:30)

Option to produce PHYLIP distance matrix format

dashing dist produces a distance matrix in its own new file format that is incompatible with any downstream analysis tool. Most of them expect a matrix in PHYLIP format. The conversion can be done with a simple awk script but is tedious. It would be great if dashing could produce a phylip distance matrix for improved user experience.

See also marbl/Mash#9.

update of hll breaks the build

Hello,

I was abble to compile dashing monday 19th and not abble to recompile it today (thursday 22th) with the following error:

/local/gensoft2/exe/gcc/7.2.0/bin/g++ -O3 -funroll-loops -pipe -fno-strict-aliasing -march=native -mpclmul -DUSE_PDQSORT -DNOT_THREADSAFE -DENABLE_COMPUTED_GOTO   -fopenmp  -fno-rtti -std=c++14 -Wall -Wextra -Wno-char-subscripts -Wpointer-arith -Wwrite-strings -Wdisabled-optimization -Wformat -Wcast-align -Wno-unused-function -Wno-unused-parameter -pedantic -Wunused-variable -Wno-attributes -Wno-pedantic  -Wno-ignored-attributes  -DDASHING_VERSION=\"v0.3.5\"  -Ibonsai/clhash/include -I.  -Ibonsai/zlib -Ibonsai/libpopcnt -Iinclude -Ibonsai/circularqueue -Ibonsai/zstd/zlibWrapper -Ibonsai/zstd/lib/common -Ibonsai/zstd/lib  -Ibonsai/hll -Ibonsai/hll/vec -Ibonsai -Ibonsai/bonsai/include/ -L.  -Lbonsai/zlib -DNDEBUG -c src/dashing.cpp -o src/dashing.o -lz
In file included from ./bonsai/bonsai/include/database.h:4:0,
                 from src/dashing.cpp:5:
./bonsai/bonsai/include/encoder.h: In constructor 'bns::RollingHasher<IntType, HashClass>::RollingHasher(unsigned int, bool, bns::RollingHashingType, uint64_t, uint64_t)':
./bonsai/bonsai/include/encoder.h:550:58: error: 'NotImplementedError' is not a member of 'sketch::common'
         if(enc == PROTEIN_6_FRAME) throw sketch::common::NotImplementedError("Protein 6-frame not implemented.");
                                                          ^~~~~~~~~~~~~~~~~~~
./bonsai/bonsai/include/encoder.h:550:58: note: suggested alternatives:
In file included from bonsai/hll/common.h:44:0,
                 from src/dashing.cpp:2:
bonsai/hll/exception.h:10:7: note:   'sketch::exception::NotImplementedError'
 class NotImplementedError: public std::runtime_error {
       ^~~~~~~~~~~~~~~~~~~
bonsai/hll/exception.h:10:7: note:   'sketch::exception::NotImplementedError'
bonsai/hll/exception.h:10:7: note:   'sketch::exception::NotImplementedError'
src/dashing.cpp: At global scope:
src/dashing.cpp:29:23: error: 'sketch::common::NotImplementedError' has not been declared
 using sketch::common::NotImplementedError;
                       ^~~~~~~~~~~~~~~~~~~
src/dashing.cpp: In function 'int bns::union_main(int, char**)':
src/dashing.cpp:1717:40: error: 'NotImplementedError' is not a member of 'sketch::common'
         default: throw sketch::common::NotImplementedError(ks::sprintf("Union not implemented for %s\n", sketch_names[sketch_type]).data());
                                        ^~~~~~~~~~~~~~~~~~~
src/dashing.cpp:1717:40: note: suggested alternatives:
In file included from bonsai/hll/common.h:44:0,
                 from src/dashing.cpp:2:
bonsai/hll/exception.h:10:7: note:   'sketch::exception::NotImplementedError'
 class NotImplementedError: public std::runtime_error {
       ^~~~~~~~~~~~~~~~~~~
bonsai/hll/exception.h:10:7: note:   'sketch::exception::NotImplementedError'
bonsai/hll/exception.h:10:7: note:   'sketch::exception::NotImplementedError'
bonsai/hll/exception.h:10:7: note:   'sketch::exception::NotImplementedError'
make: *** [src/dashing.o] Error 1

due to updates in hll submodule.

monday I cloned dashing (commit e002e3e)
which had following revisions of submodules and nested submodules after make update

[gensoft@2b553ae80831 dashing-v0.3.5]$ git submodule
 2b1a48e49f19baeddf13384e528ee3bd2ff85cb6 bonsai (v0.2.2-9-g2b1a48e)
 ce5a57777cad45bb412878089afb3e30a9bccbce distmat (heads/master)
 ee5300f5c1266c963129a7f152f9c82c3a0e8b12 khset (heads/master)
[gensoft@2b553ae80831 dashing-v0.3.5]$ cd bonsai/
[gensoft@2b553ae80831 bonsai]$ git submodule
 5f48537e54e391ba1c249d2c4eddd2d1debe1f47 circularqueue (heads/master)
 742f81a66c8e2ae7889d1bc4c4b4d8734bdcd5af clhash (v0.1.0-8-g742f81a)
 2c4687431f978f02a3780e24b8b701d22aa32d9c flat_hash_map (heads/master)
 d516b3ed4278013e42963f551b53ecb68950dfd0 hll (v0.6-242-gd516b3e)
 f719aad5fa273424fab4b0d884d68375b7cc2520 klib (spawn-final-166-gf719aad)
 eeaaf19959c19644db134b4063aff11cc188b3bb kspp (heads/master)
 e7c5dddaf16e8f17d629f74b8f11d6d7eb258f93 lazy (heads/master)
 0de541e96ee4c88e5bd28a0e941ac91e99c3b62f libpopcnt (v2.2-20-g0de541e)
 0aa224367a0f09b5c3e5c90c715e3ff8fb2ffad3 linear (heads/master)
 65325bac67a840d2488908cfa266737297c159ff ntHash (v1.0.4-6-g65325ba)
 08879029ab8dcb80a70142acb709e3df02de5d37 pdqsort (heads/master)
 e68381300c074cbc6609a32367b9b7d93c65702e rollinghashcpp (heads/master)
 cacf7f1d4e3d44d871b605da3b647f07d718623f zlib (v1.2.11)
[gensoft@2b553ae80831 bonsai]$ cd ../distmat/
[gensoft@2b553ae80831 distmat]$ git submodule
 c9d32a81f40ad540015814edf13b29980c63e39c pybind11 (v2.2.0-223-gc9d32a8)
[gensoft@2b553ae80831 distmat]$ cd ../khset/
[gensoft@2b553ae80831 khset]$ git submodule
 f719aad5fa273424fab4b0d884d68375b7cc2520 klib (spawn-final-166-gf719aad)

today in order to reompile it on a other machine I cloned again dashing
which have following revisions of submodules and nested submodules after make update

[gensoft@2b553ae80831 dashing-v0.3.5]$ git submodule
 2b1a48e49f19baeddf13384e528ee3bd2ff85cb6 bonsai (v0.2.2-9-g2b1a48e)
 ce5a57777cad45bb412878089afb3e30a9bccbce distmat (heads/master)
 ee5300f5c1266c963129a7f152f9c82c3a0e8b12 khset (heads/master)
[gensoft@2b553ae80831 dashing-v0.3.5]$ cd bonsai/
[gensoft@2b553ae80831 bonsai]$ git submodule
 5f48537e54e391ba1c249d2c4eddd2d1debe1f47 circularqueue (heads/master)
 742f81a66c8e2ae7889d1bc4c4b4d8734bdcd5af clhash (v0.1.0-8-g742f81a)
 2c4687431f978f02a3780e24b8b701d22aa32d9c flat_hash_map (heads/master)
+78e20281a9f5495d09a9771ff2bbd324dbf49fc9 hll (v0.6-248-g78e2028)
 f719aad5fa273424fab4b0d884d68375b7cc2520 klib (spawn-final-166-gf719aad)
 eeaaf19959c19644db134b4063aff11cc188b3bb kspp (heads/master)
 e7c5dddaf16e8f17d629f74b8f11d6d7eb258f93 lazy (heads/master)
 0de541e96ee4c88e5bd28a0e941ac91e99c3b62f libpopcnt (v2.2-20-g0de541e)
 0aa224367a0f09b5c3e5c90c715e3ff8fb2ffad3 linear (heads/master)
 65325bac67a840d2488908cfa266737297c159ff ntHash (v1.0.4-6-g65325ba)
 08879029ab8dcb80a70142acb709e3df02de5d37 pdqsort (heads/master)
 e68381300c074cbc6609a32367b9b7d93c65702e rollinghashcpp (heads/master)
 cacf7f1d4e3d44d871b605da3b647f07d718623f zlib (v1.2.11)
[gensoft@2b553ae80831 bonsai]$ cd ../distmat/
[gensoft@2b553ae80831 distmat]$ git submodule
+12e8774bc9aa4603136f2979088619b495850ca2 pybind11 (v2.2.0-234-g12e8774)
[gensoft@2b553ae80831 distmat]$ cd ../khset/
[gensoft@2b553ae80831 khset]$ git submodule
 f719aad5fa273424fab4b0d884d68375b7cc2520 klib (spawn-final-166-gf719aad)

you can note that hll and pybind11 differs

11c10
<  d516b3ed4278013e42963f551b53ecb68950dfd0 hll (v0.6-242-gd516b3e)
---
> +78e20281a9f5495d09a9771ff2bbd324dbf49fc9 hll (v0.6-248-g78e2028)
23c22
<  c9d32a81f40ad540015814edf13b29980c63e39c pybind11 (v2.2.0-223-gc9d32a8)
---
> +12e8774bc9aa4603136f2979088619b495850ca2 pybind11 (v2.2.0-234-g12e8774)

after

[gensoft@2b553ae80831 dashing-v0.3.5]$ cd bonsai/hll
[gensoft@2b553ae80831 hll]$ git checkout d516b3ed4278013e42963f551b53ecb68950dfd0cd

make dashing ends correctly

the update rule braks the submodule behavior, as you perform the following
git checkout master && git pull

regards

Eric

compare with HLL

Hi,
I have a question about the output from
./dashing cmp –k10 -p5 -C -S24 --wj-exact reference_.fasta reads.fastq
The output i.e. 0.0005 (says that 0.05% of the k-mers in the union are shared)

I was surprised as I used the same command-line with different sizes of k-mers, the number of shared k-mers are increasing, but actually should decreasing. The results is:

k-mer size = 9 --> 06.66655 % are shared
k-mer size = 10 --> 24.9956 % are shared
k-mer size = 11 --> 60.9045 % are shared
k-mer size = 12 --> 97.622 % are shared
k-mer size = 13 --> 76.7513 % are shared
k-mer size = 14 --> 39.642 % are shared

I am wondering that the number of shared k-mers between 9 and 12 (k-mer size) are increasing, but actually I expected that the number of shared k-mers will be at 9 the maximum and then after 9 will start decreasing.

Dataset: the fastq file has many reads from Listeria and the fasta file has the reference genome human.

As I was running the program with Listeria reads and reference genome from Listeria I had the expected results: by k-mer size 9 was the maximum number of shared k-mer then the number of shared k-mers started decreasing.

Could you please explain it to me, why the number of shared k-mers are increasing between k-mer size 9 and 12?

Thanks,
Ahmad

Distance output formatting inconsistent - maybe output csv?

Hello,
When comparing 4 samples, the string output by dashing dist has 5 columns: ##Names (with a space!! in a tab-formatted file!! arhghghghgh) and the 4 sample names, which allows for easy parsing with pandas + StringIO:

screen shot 2019-01-24 at 8 12 23 am

However, when comparing 19 samples, somehow an extra column gets added to the output so ##Names is a column with all empty distances (-) and can be readily dropped, but the other columns are informative.

screen shot 2019-01-24 at 8 11 25 am

For these comparisons, I use a symmetric matrix and while it is redundant, it helps to clean up downstream calculations. To fully clean the matrix into a usable format for myself, this is my code:

from io import StringIO

import numpy as np
import pandas as pd


s = '''##Names 	A11-MAA000577-3_8_M-1-1.fastq.gz	A1-MAA100039-3_11_M-1-1.fastq.gz	A10-B000971-3_39_F-1-1.fastq.gz	A12-D041914-3_8_M-1-1.fastq.gz	A12-B001717-3_38_F-1-1.fastq.gz	A1-MAA000487-3_10_M-1-1.fastq.gz	A10-B002775-3_39_F-1-1.fastq.gz	A11-D041914-3_8_M-1-1.fastq.gz	A10-D041914-3_8_M-1-1.fastq.gz	A11-MAA100041-3_9_M-1-1.fastq.gz	A11-MAA100140-3_57_F-1-1.fastq.gz	A10-MAA000559-3_8_M-1-1.fastq.gz	A12-D042253-3_9_M-1-1.fastq.gz	A11-MAA000559-3_8_M-1-1.fastq.gz	A12-MAA000559-3_8_M-1-1.fastq.gz	A1-B002764-3_38_F-1-1.fastq.gz	A1-D042253-3_9_M-1-1.fastq.gz	A1-MAA000779-3_11_M-1-1.fastq.gz	A12-MAA000508-3_9_M-1-1.fastq.gz
A11-MAA000577-3_8_M-1-1.fastq.gz	-		0.057916	0.002939	0.022363	0.011671	0.094893	0.037217	0.075855	0.067865	0.064702	0.035617	0.004213	0.048542	0.016092	0.064480	0.032748	0.034813	0.013605	0.03673
A1-MAA100039-3_11_M-1-1.fastq.gz	-	-		0.036855	0.048998	0.045185	0.049695	0.055693	0.077198	0.066881	0.187610	0.030762	0.000000	0.072976	0.000000	0.030457	0.000835	0.070249	0.021217	0.03125
A10-B000971-3_39_F-1-1.fastq.gz	-	-	-		0.028069	0.000000	0.052360	0.032645	0.042451	0.058878	0.037838	0.021614	0.000000	0.012244	0.017612	0.023235	0.035272	0.044574	0.014970	0.00000
A12-D041914-3_8_M-1-1.fastq.gz	-	-	-	-		0.001242	0.073876	0.073333	0.120622	0.080829	0.097877	0.035713	0.025949	0.063141	0.000000	0.064814	0.032029	0.045404	0.024794	0.02730
A12-B001717-3_38_F-1-1.fastq.gz	-	-	-	-	-		0.023479	0.002168	0.057410	0.019085	0.036123	0.000000	0.010161	0.020215	0.033729	0.015095	0.025822	0.000000	0.012444	0.02335
A1-MAA000487-3_10_M-1-1.fastq.gz	-	-	-	-	-	-		0.020419	0.111309	0.117554	0.068591	0.049916	0.033270	0.081094	0.019418	0.061806	0.049051	0.074235	0.000000	0.04432
A10-B002775-3_39_F-1-1.fastq.gz	-	-	-	-	-	-	-		0.075102	0.097076	0.061366	0.023590	0.016296	0.067000	0.000000	0.054365	0.023102	0.067332	0.026361	0.02222
A11-D041914-3_8_M-1-1.fastq.gz	-	-	-	-	-	-	-	-		0.132616	0.099790	0.048618	0.035936	0.110347	0.032366	0.060814	0.030843	0.085139	0.047918	0.03960
A10-D041914-3_8_M-1-1.fastq.gz	-	-	-	-	-	-	-	-	-		0.085162	0.048054	0.055217	0.115842	0.055140	0.054070	0.028158	0.077156	0.049819	0.03790
A11-MAA100041-3_9_M-1-1.fastq.gz	-	-	-	-	-	-	-	-	-	-		0.065413	0.047860	0.118932	0.045932	0.052743	0.027070	0.080519	0.019320	0.03535
A11-MAA100140-3_57_F-1-1.fastq.gz	-	-	-	-	-	-	-	-	-	-	-		0.043141	0.014001	0.013765	0.039060	0.055096	0.048396	0.009868	0.01870
A10-MAA000559-3_8_M-1-1.fastq.gz	-	-	-	-	-	-	-	-	-	-	-	-		0.047940	0.014161	0.075492	0.014848	0.000000	0.036942	0.01477
A12-D042253-3_9_M-1-1.fastq.gz	-	-	-	-	-	-	-	-	-	-	-	-	-		0.000517	0.051918	0.000000	0.095890	0.000000	0.03791
A11-MAA000559-3_8_M-1-1.fastq.gz	-	-	-	-	-	-	-	-	-	-	-	-	-	-		0.009891	0.000000	0.019100	0.045230	0.00565
A12-MAA000559-3_8_M-1-1.fastq.gz	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-		0.012562	0.061699	0.041696	0.04132
A1-B002764-3_38_F-1-1.fastq.gz	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-		0.039803	0.055432	0.05948
A1-D042253-3_9_M-1-1.fastq.gz	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-		0.014889	0.02094
A1-MAA000779-3_11_M-1-1.fastq.gz	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-		0.03694
A12-MAA000508-3_9_M-1-1.fastq.gz	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-'''


# Read string as a file handle with StringIO
dashing_similarities = pd.read_table(StringIO(s), index_col=0)

# Put true NaNs into place
dashing_similarities = dashing_similarities.replace('-', np.nan)

# Force all data to be float (since there were '-'s before, these were interpreted as strings and the whole column was formatted as a string object)
dashing_similarities = dashing_similarities.astype(float)

# Remove uninformative column
dashing_similarities = dashing_similarities.drop(['##Names '], axis=1)

# Remove .fastq.gz from column name
dashing_similarities.columns = dashing_similarities.columns.str.split('.').str[0]

# Unify column and row name (index = row names in pandas)
dashing_similarities.index = dashing_similarities.columns

# Use upper triangle of matrix to fill lower triangle and make a symmetric matrix
data = np.triu(dashing_similarities) + np.triu(dashing_similarities).T

# Make a pandas dataframe with the symmetric matrix
dashing_similarities = pd.DataFrame(data, index=dashing_similarities.index, 
                                              columns=dashing_similarities.columns)

# Remaining NAs are on the diagonal, so replace with 1 since self-similarity is perfect
dashing_similarities = dashing_similarities.fillna(1)

# Show dataframe
dashing_similarities

And here's the reformatted dataframe:

screen shot 2019-01-24 at 9 03 11 am

To (hopefully) alleviate formatting issues, could it be possible to output a readily-formatted csv of the comparisons?

Since the pairwise distance matrices can get enormous, it could also be an option to output a "tidy" formatted table of the just the (n-1)^2/2 comparisons, instead of the whole n^2 matrix. For example, using dashing_similarities from above, one can reformat to a tidy data table of cell_a, cell_b and the similarity this way:

screen shot 2019-01-24 at 8 55 27 am

(This particular dataframe still includes the (n-1)^2 comparisons, but doesn't have to)

What do you think?
Warmest,
Olga

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.