Git Product home page Git Product logo

Comments (13)

jermp avatar jermp commented on September 26, 2024

Your command seems correct. Note that the "checking" time should not count towards the construction time.
For the benchmark in Table 5 of the paper does not include this time (which is needed to check the correctness of the function). Also, be sure you compiled and run the code in Release mode.

How long is this expected to take?

Approximately what we reported in Table 5. (The actual runtime obviously depends on the specific processor.)

Let me know!

from lphash.

tbenavi1 avatar tbenavi1 commented on September 26, 2024

It's been running for over 4 hours so far. The latest line in the log:

2023-07-30 15:14:16: 1259379 buckets done in 8207 seconds (14.7426% of keys, 5% of buckets, 3.72143 trials per bucket, 3.72144 expected trials per bucket)

I also compiled in Release mode according to the instructions on github. I'm running on a computing cluster so the processors should be fast.

from lphash.

jermp avatar jermp commented on September 26, 2024

That's strange. I cannot replicate your issue on my end. I've just re-built a lphash (partitioned) function with

/usr/bin/time ./lphash build-p -i /data2/DNA/lphash_datasets/human.k63.unitigs.fa.ust.fa.gz -k 63 -m 28 -t 4 -c 5.0 -d tmp_dir -o human.k64.lphash --verbose

and got

1154.18user 17.66system 7:26.27elapsed

so exactly what I reported in the paper.

The line that your reported looks like this on my end

2023-07-31 07:48:17: 1259379 buckets done in 0 seconds (14.7426% of keys, 5% of buckets, 3.72143 trials per bucket, 3.72144 expected trials per bucket) 

0 seconds!

I would do a fresh clone of the repo and re-compilation.

from lphash.

tbenavi1 avatar tbenavi1 commented on September 26, 2024

Hello,

Based on some testing, I think there may be several issues.

First, I think increasing the number of threads above some threshold eventually leads to a performance downgrade.

At one point, when I tried to run with 64 threads, I received the following error after it ran for some time:

terminate called after throwing an instance of 'std::invalid_argument'
  what():  parallel search should use at most 36 threads

However, when I run with 32 threads I no longer receive the error, but the runtime is longer than when I run with 4 threads.

I also receive another error when running the following code on T2T human genome input:

lphash build-p -i chm13v2.0.k25.unitigs.fa.ust.fa.gz -k 25 -m 17 -t 4 -c 5.0 -d tmp3 -o chm13v2.0_c5_k25_m17.lph --verbose
2023-07-31 08:58:06: search ends
 == 1743363 empty buckets (2.71267%)
 == total trials = 13063020969
 == total expected trials = 13062410860
 == search took 304.757 seconds
Part 3: build inverted index
Percentage of maximal super-k-mers: 26.1134%
Percentage of left-maximal super-k-mers: 26.7672%
Percentage of right-maximal super-k-mers : 26.726%
Percentage of unclassified super-k-mers: 20.3935%
Total = 100%
Percentage of ambiguous minimizers: 16.9299%
Part 4: build fallback MPHF
c = 5
alpha = 0.94
num_keys = 955743096
table_size = 1016747974
num_buckets = 160187311
using 8 GB of RAM (0.127094 GB occupied by the bitmap)
using a peak of 28.0423 GB of disk space
 == processed 49% keys from inputterminate called after throwing an instance of 'std::runtime_error'                                                
  what():  seed did not work
Aborted (core dumped)

Additionally, I am wondering if lphash will get confused if you start multiple runs in the same directory (and do not set the temp directory with -d). It seems that when I set the temp directory, lphash seemed to perform better.

The only possible other difference I could see for the performance downgrade would be the order of the commands between the command I ran and the command you ran, but I doubt that that is the issue.

In summary: I can now get lphash to run quicker on your human dataset, though I am only using 4 threads. I cannot get lphash to run on a T2T human data I generated, due to a "seed did not work" error.

from lphash.

jermp avatar jermp commented on September 26, 2024

However, when I run with 32 threads I no longer receive the error, but the runtime is longer than when I run with 4 threads.

Yes, it could be, due to contention. PTHash (the general purpose MPHF that LPHash is using for the minimizers) uses a complicated parallel search procedure. If you want to boost scalability, you should use a PTHash-HEM function. I can explain to you who to use it if you want.

I never tried LPHash on k=25. Could you try to use the uint64_t type here https://github.com/jermp/lphash/blob/main/include/compile_constants.tpd instead of __uint128_t ?

from lphash.

tbenavi1 avatar tbenavi1 commented on September 26, 2024

I switched to uint64_t in compile_constants.tpd and recompiled. This time, it worked! I did not get a "seed did not work" error.

I would also be interested in the PTHash-HEM function. Thanks.

from lphash.

jermp avatar jermp commented on September 26, 2024

Ok, cool! I can explain via emails if you prefer.
In the meanwhile, refer to the HEM description here https://arxiv.org/pdf/2106.02350.pdf.

from lphash.

tbenavi1 avatar tbenavi1 commented on September 26, 2024

Thanks. Now I am running into another issue. It seems like lphash is not generating a minimal perfect hash. Specifically, there are several different kmers which are all hashing to the same value.
For example: TAACTGTGAGATGTAAGCACACATC, TTTTATTTTGCAGTAATGGCCAATA, and TTTTCCTGAAGATGTATCTATGACC all hash to 2391456533. As far as I'm aware, these three kmers were all part of the original input that was used to generate the lphash file. It seems like the hashes are close to a minimal perfect hash, but there are still some gaps and repeats.

Please let me know if you would like me to send you any input files for testing. I am currently just running k=25 with the T2T human genome. I can also see if k=31 has the same problem.

from lphash.

jermp avatar jermp commented on September 26, 2024

We never tested any k < 31, so please send me your input file.
CC: @yhhshb.

Also, why can't you use k=31?

from lphash.

tbenavi1 avatar tbenavi1 commented on September 26, 2024

I just emailed you my input file. I am also in the process of running with k=31. I will let you know my results. Thanks.

from lphash.

jermp avatar jermp commented on September 26, 2024

Hello, I tried your file with

/usr/bin/time -v ./lphash build-p -i chm13v2.0.k25.unitigs.fa.ust.fa.gz -k 25 -m 17 -t 8 -c 7.0 -d tmp_dir -o test.lphash --verbose --check

and everything looks good. No hash collisions: the --check flag enables the correctness checks that did not report any problem.

I've also indexes all kmers using SSHash and again I did not see any collisions. The SSHash command used was:

./sshash build -i ~/lphash/build/chm13v2.0.k25.unitigs.fa.ust.fa.gz -k 25 -m 17 -o chm13.sshash -d tmp --verbose --check

from lphash.

tbenavi1 avatar tbenavi1 commented on September 26, 2024

Thank you so much. I am able to get lphash to work properly on my files now. I believe my issue was the compilation constant as described above.

I did want to let you know of two other bugs I found: If you run the query command on a fasta file containing a sequence that is smaller than the kmer size k, lphash crashes. Also, I tried to install lphash on a different cluster with an older operating system, but I get the following error when trying to run lphash:

Illegal instruction (core dumped)

I'm not sure if there is some os/architecture requirement. lphash works for "Intel(R) Xeon(R) Gold 6154 CPU @ 3.00GHz" running "Rocky Linux 8.4 (Green Obsidian)" but not "Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz" running "CentOS Linux 7 (Core)".

from lphash.

jermp avatar jermp commented on September 26, 2024

I did want to let you know of two other bugs I found: If you run the query command on a fasta file containing a sequence that is smaller than the kmer size k, lphash crashes.

Thanks for reporting. Yeah, most likely a trivial edge case that we did not care about. We will fix it.

The Illegal Instruction error is common nowadays, due to the different architectures. That could be caused, e.g., by PDEP or POPCNT missing.

Let me know if you need further assistance.

from lphash.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.