Git Product home page Git Product logo

Comments (9)

tmaklin avatar tmaklin commented on July 28, 2024 2

thanks for the quick reply! I think I changed the wrong line (https://github.com/jermp/sshash/blob/master/include/kmer.hpp#L6), will let you know if this works.

Just testing fulgor vs themisto v3 with mSWEEP for a certain downstream application for now.

from fulgor.

tmaklin avatar tmaklin commented on July 28, 2024

I forgot to mention that this happens after fulgor enters the following step:

building minimizers MPHF (PTHash) with 8 threads...

from fulgor.

jermp avatar jermp commented on July 28, 2024

Hi,
yes it means there are more than 1B minimizer for m=20, hence the minimal perfect hash function use in SSHash (which is PTHash) would require 128-bit hash codes to be safe.
It should be enough to change the default hasher in SSHash for minimizers, here: https://github.com/jermp/sshash/blob/master/include/hash_util.hpp#L51.

from fulgor.

jermp avatar jermp commented on July 28, 2024

Should you have some update on the Themisto index, I'd love to know. Thanks!

from fulgor.

rob-p avatar rob-p commented on July 28, 2024

Thanks for the info @tmaklin :). One thing (of which I'm certain you are aware) is that the current approaches for reporting the pseudoalignment results are far from space optimal.

I'm aware this is something you've worked on in alignment-writer. If it makes sense to design and converge on a more standard and compact output format for these tools, this is something that we'd certainly be interested in. For example, we have a binary format (the RAD) format that we use in alevin-fry that addresses a variant of this problem. However, it would be nice to generalize and to understand what information makes sense for different use cases.

Cheers,
Rob

from fulgor.

tmaklin avatar tmaklin commented on July 28, 2024

Indexing the data overnight worked so closing this. Thanks!

Re the file format, a standardized and compact output for all the different tools sounds great! Alignment-writer (think I need to figure out a better name) is a wrapper around BitMagic and achieves anything from 10x to 100x compression on the test cases I used while developing but the efficiency naturally depends on the complexity of the alignment. I'll be adding support for the format fulgor currently uses soon.

Some issues I've noticed with the formats while developing and using the various tools, roughly in order of headaches caused

  • Total number of alignment targets can't be inferred with certainty from the format.
  • Not printing empty lines for no alignments.
  • Fragment names instead of the position of the read in the fastq files (makes sorting the file difficult and slows down matching the alignments with the reads if the file is not sorted).
  • Total number of reads can't be inferred.
  • Multiple files to store the results (for example unique alignments + their counts).
  • I tend to prefer formats that support streaming the results rather than having to wait for the whole alignment to finish, or conversely read in the whole file before processing the results.

Would be nice to discuss/work on this at some point.

from fulgor.

jermp avatar jermp commented on July 28, 2024

Hi @tmaklin,
yes, a common alignment format would be very nice. Happy to work on this together if you like.

from fulgor.

rob-p avatar rob-p commented on July 28, 2024

Likewise. I have some thoughts on this as well :). Shall we open a separate issue for it? Or, even better, @jermp, if you can enable “discussions” on this repo.

from fulgor.

jermp avatar jermp commented on July 28, 2024

Good idea, discussions enabled!

from fulgor.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.