jermp / lphash Goto Github PK

View Code? Open in Web Editor NEW

42.0 5.0 2.0 105.66 MB

Fast and compact locality-preserving minimal perfect hashing for k-mer sets.

License: MIT License

CMake 1.39% C++ 85.61% Shell 13.01%

bioinformatics hashing k-mers minimal-perfect-hash locality-preserving

lphash's People

Contributors

Stargazers

Watchers

Forkers

clayne tbenavi1

lphash's Issues

Code style / refactoring

Use #pragma once directive instead of if-define guards.
Move all code not strictly related to LPHash to external/ (or another directory), such as kseq.h and BooPHF.hpp.
Delete unused scripts and code (such as prettyprint.hpp)
Properly format all the code using the rules in .clang-format. For example: do not use if-else without parentheses. Only use one-liner ifs without parentheses.
Maintain in src/ only the executables for building and performing queries. We can also have a unique driver program named lphash that takes as input an argument build or query specifying the sub-tool to use. See an example here: https://github.com/jermp/sshash-lite/blob/main/src/sshash-lite.cpp. This also implies having a unique tool to build both data structures, partitioned and un-partitioned (currently called -alt).
Move tests into a separate folder called tests/.
Use a build_configuration class to build the data structures with default parameters, as used here and here for example. A build_configuration object is then passed as input to the constructor of the mphf.

A kmer and its reverse complement hash to different values

Hello,
Based on my testing, Lphash doesn't convert kmers into their canonical representation. I would expect a kmer and its reverse complement to hash to the same value.

For example, BCALM 2 converts all k-mers into their canonical representation with respect to reverse-complements.

I believe that the kmers will need to be converted to their canonical form both during the building and the querying steps. Thanks for any insight.

Also, what is the expected output when you query lphash for a kmer that is not in the database? Does lphash return a particular value? I'm not 100% sure, but I think right now it will still output a hash number (that will collide with another kmer actually in the database). Thanks for any insights.

Associating metadata (satellite values) with kmers

Hello and thank you for this great tool! I was wondering if it would be possible to associate metadata with the kmers. The paper mentions abundance counts, reference identifiers, or contig identifiers as possible satellite values. How would one go about querying your lphash database for this information? Thanks for any insights.

(I understand that the database size and database construction time would not be optimized for this use case, but I am most interested in keeping a fast lookup time.)

Building Time

Hello,

The building time for me seems to be much higher than the results in Table 5 of the paper. What value of k corresponds to the results in Table 5? I downloaded both the k=31 and k=63 human files from Zenodo and ran lphash to build the database.

Specifically, I ran:

lphash build-p -i human.k63.unitigs.fa.ust.fa.gz -c 5 -k 63 -m 28 --check -o human_c5_k63_m28.lph --verbose -t 4

How long is this expected to take? Thank you.

Different hashes error for S. cerevisiae

Hello,

I ran lphash on the S. cerevisiae reference genome and received the following error:

[Error] different hashes, maybe there were some Ns in the input (not supported as of now)

Specifically, I first ran bcalm with -kmer-size 31 and -abundance-min 1, then I ran ust with -k 31, then I ran lphash build-p with -k 31 and -m 15. As far as I can tell, there are no Ns in the genome.

Is this an error I should be concerned about? Thanks for your assistance.

I do notice that the genome has upper and lower case letters. Does lphash convert everything to uppercase for both building and querying?

jermp / lphash Goto Github PK

lphash's People

Contributors

Stargazers

Watchers

Forkers

lphash's Issues

Code style / refactoring

A kmer and its reverse complement hash to different values

Associating metadata (satellite values) with kmers

Building Time

Different hashes error for S. cerevisiae

Multithreading for query-p command

Print statistics in JSON format rather than CSV

Update external dependencies

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent