Git Product home page Git Product logo

lphash's People

Contributors

jermp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

clayne tbenavi1

lphash's Issues

Code style / refactoring

  1. Use #pragma once directive instead of if-define guards.
  2. Move all code not strictly related to LPHash to external/ (or another directory), such as kseq.h and BooPHF.hpp.
  3. Delete unused scripts and code (such as prettyprint.hpp)
  4. Properly format all the code using the rules in .clang-format. For example: do not use if-else without parentheses. Only use one-liner ifs without parentheses.
  5. Maintain in src/ only the executables for building and performing queries. We can also have a unique driver program named lphash that takes as input an argument build or query specifying the sub-tool to use. See an example here: https://github.com/jermp/sshash-lite/blob/main/src/sshash-lite.cpp. This also implies having a unique tool to build both data structures, partitioned and un-partitioned (currently called -alt).
  6. Move tests into a separate folder called tests/.
  7. Use a build_configuration class to build the data structures with default parameters, as used here and here for example. A build_configuration object is then passed as input to the constructor of the mphf.

A kmer and its reverse complement hash to different values

Hello,
Based on my testing, Lphash doesn't convert kmers into their canonical representation. I would expect a kmer and its reverse complement to hash to the same value.

For example, BCALM 2 converts all k-mers into their canonical representation with respect to reverse-complements.

I believe that the kmers will need to be converted to their canonical form both during the building and the querying steps. Thanks for any insight.

Also, what is the expected output when you query lphash for a kmer that is not in the database? Does lphash return a particular value? I'm not 100% sure, but I think right now it will still output a hash number (that will collide with another kmer actually in the database). Thanks for any insights.

Associating metadata (satellite values) with kmers

Hello and thank you for this great tool! I was wondering if it would be possible to associate metadata with the kmers. The paper mentions abundance counts, reference identifiers, or contig identifiers as possible satellite values. How would one go about querying your lphash database for this information? Thanks for any insights.

(I understand that the database size and database construction time would not be optimized for this use case, but I am most interested in keeping a fast lookup time.)

Building Time

Hello,

The building time for me seems to be much higher than the results in Table 5 of the paper. What value of k corresponds to the results in Table 5? I downloaded both the k=31 and k=63 human files from Zenodo and ran lphash to build the database.

Specifically, I ran:

lphash build-p -i human.k63.unitigs.fa.ust.fa.gz -c 5 -k 63 -m 28 --check -o human_c5_k63_m28.lph --verbose -t 4

How long is this expected to take? Thank you.

Different hashes error for S. cerevisiae

Hello,

I ran lphash on the S. cerevisiae reference genome and received the following error:

[Error] different hashes, maybe there were some Ns in the input (not supported as of now)

Specifically, I first ran bcalm with -kmer-size 31 and -abundance-min 1, then I ran ust with -k 31, then I ran lphash build-p with -k 31 and -m 15. As far as I can tell, there are no Ns in the genome.

Is this an error I should be concerned about? Thanks for your assistance.

I do notice that the genome has upper and lower case letters. Does lphash convert everything to uppercase for both building and querying?

Multithreading for query-p command

Hello,

Would it be possible to implement multiple threads for the query-p command. In particular, a very simple implementation would just be to have each thread analyze a different subset of the sequences in the input fasta file. Thanks for any assistance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.