Git Product home page Git Product logo

raptor's Introduction

Raptor build status codecov install with bioconda

A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

Download and Installation

See the documentation for detailed information on how to download, install, and compile Raptor.

There may be performance benefits when compiling from source, as Raptor can be optimized for the host system.

Install with bioconda (Linux)

conda install -c bioconda -c conda-forge raptor

Example Data and Usage

See the documentation for usage.

raptor --help will show available commands. Each command will have a respective help page that can be shown via, e.g., raptor build --help.

Authorship and Copyright

Raptor is being developed by Enrico Seiler, but also incorporates much work from other members of SeqAn.

Citation

In your academic works (also comparisons and pipelines) please cite:

  • Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences; Enrico Seiler, Svenja Mehringer, Mitra Darvish, Etienne Turc, and Knut Reinert; iScience 2021 24 (7): 102782. doi: https://doi.org/10.1016/j.isci.2021.102782
  • Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries; Svenja Mehringer, Enrico Seiler, Felix Droop, Mitra Darvish, René Rahn, Martin Vingron, and Knut Reinert; Genome Biol 24, 131 (2023). doi: https://doi.org/10.1186/s13059-023-02971-4

RECOMB 2021

Raptor was presented at the 25th International Conference on Research in Computational Molecular Biology:

Please see the License section for information on allowed use.

Supplementary

The subdirectory util contains applications and scripts related to papers.

License

Raptor is open-source software. However, certain conditions apply when you (re-)distribute and/or modify Raptor, please see the license.

raptor's People

Contributors

dependabot[bot] avatar eseiler avatar irallia avatar mitradarja avatar mr-c avatar seqan-actions avatar sgssgene avatar smehringer avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

raptor's Issues

Rebase on upstream SeqAn3

  • Rebase and use seqan3::interleaved_bloom_filter and the counting_agent (#19)
  • Allow compression formats for input (lost in rebase) (#20)

Increase code coverage

Make sure everything that should be tested, is tested. Exclude the other code from analysis.

CMake

  • Add find_package for SeqAn3
  • Add find_package for googletest/googlebenchmark, use fetch_content if not found

Index should store k and w

The build step should store the k-mer and window-size such that the search step does not need these parameters anymore.

  • Introduce class raptor_index that wraps a seqan3::interleaved_bloom_filter and stores additional information.

Overestimated minimal number of minimizers

Hi @eseiler

I was wondering if the minimal number of minimizers is being overestimated here:

size_t const minimal_number_of_minimizers = std::ceil(kmers_per_pattern / static_cast<double>(kmers_per_window));

I think the floor should be used instead of ceil, for example:

l=14, w=6, k=3
--------------
wwwwww
   kkk
    wwwwww
       kkk
        wwwwww
           kkk
(14-3+1)/(6-3+1) = 3

l=15, w=6, k=3
---------------
wwwwww
   kkk
    wwwwww
       kkk
        wwwwww
           kkk
         wwwwww
           kkk
(15-3+1)/(6-3+1) = 3.25

l=17, w=6, k=3
-----------------
wwwwww
   kkk
    wwwwww
       kkk
        wwwwww
           kkk
           wwwwww
           kkk
(17-3+1)/(6-3+1) = 3.75

I tested with the example data and raptor v2.0.0 and the change increased the number of sequences with a bin assigned when using --error 1 (but no changes with --error 2). Let me know if you agree or if I'm missing something. If yes, I can open a PR.

Backlog

  • Add find_package for SeqAn3
  • Add find_package for googletest/googlebenchmark, use fetch_content if not found
  • Add a {ctest,cmake} flag that results in printing result.command
  • Add option to store basename in index
  • Check whether index always stores absolute paths
  • Make counting_agent template parameter auto-detected. E.g., when using short queries, use uint8_t, etc.

Adapt chopper output

  • Header with mapping of bin id to file path
  • No header in socks
  • no trailing comma

No --hibf with actual hibf index and empty query file segfaults

Platform

  • Raptor version: 2.0.1
  • Operating system: Linux 4.19.128-microsoft-standard SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Source: from source
  • Compiler: gcc 10

Description

I originally observed that raptor segfaults when an empty fastq file is given. In the process of finding what causes the problem, I also found out that supplying a hibf with --index without the flag --hibf also segfaults.

How to repeat the problem

Give raptor an empty .fastq query file. Give raptor a serialized HIBF without suppliying the --hibf flag

Expected behaviour

For the empty input file I would expect: an error, a warning and/or simply an output file with no read classifications.
For the missing --hibf I would expect: an error

Actual behaviour

Segfaults

[BUG] The default parameter size throws an error, if not set.

Platform

  • Raptor version: 2.0.1
  • Operating system: macOS Catalina
  • Source: Compiled from source
  • Compiler: GCC 12.2.0

Description

How to repeat the problem

Run
raptor build --output raptor.index all_bin_paths.txt, e.g. with the example_data.

Expected behaviour

As the size has a default value, this should work.

Actual behaviour

[Error] Option --size is required but not set.

Implement SOCKS Interface

SOCKS - software for colored k-mer sets - will benchmark different applications in the field of computational pangenomics.
A common interface was defined therefor which now needs to be implemented.

Index construction

Input

One plain text file, for example:

COLOR_NAME_1: /PATH/TO/GENOME.FASTA
COLOR_NAME_2: /PATH/TO/READ_1.FASTQ /PATH/TO/READ_2.FASTQ
  • One line per color
  • : separates color name from path
  • There may be multiple paths
  • Paths are whitespace-separated

This results in something along the lines of:

([^:]+:\s*([^\s]+)(\s+([^\s]+))

Questions

  • What about paths with whitespaces?

Output

Only the first option will be implemented.

K-mer lookup

Input

One plain text file, for example:

ACGTACGT
ACCTAGGT
  • One line per query

Questions

  • What is the alphabet? Is it always dna4?

Output

One plain text file, color names or binary vector:

ACGTACGT: COLOR_NAME_1 COLOR_NAME_4 COLOR_NAME_7 ...
ACCTAGGT: COLOR_NAME_1 COLOR_NAME_5 COLOR_NAME_8 ...

or

ACGTACGT: 10010010...
ACCTAGGT: 10001001...

We will implement the last option.

Questions

  • Canonical kmers?

Color lookup

Not applicable.

Validation error when using multiple parts

Platform

  • Raptor version: 1.1.0
  • Operating system: 18.04.1-Ubuntu
  • Source: tried both conda and compiling from source

Description

How to repeat the problem

Scenario 1: querying all subindices at once. Parts: Without suffix _0
raptor build bin_paths.txt --threads 2 --parts 4 --output index --size 10m
raptor search --query reads_e5_150/all.fastq --output out_0 --index index --error 5

[Error] Validation failed for option --index: The file "index" does not exist!

Scenario 2: querying each subindex separately with:
raptor search --query reads_e5_150/all.fastq --output out_0 --index index_0 --error 5
raptor search --query reads_e5_150/all.fastq --output out_1 --index index_1 --error 5
etc.

Expected behaviour

The union of the 4 output files would be the same as when querying a single index built with the same parameters.

Actual behaviour

Scenario two does result in output files but many bin matches are not found.

For example:

  • Querying a single IBF resulted in 1536 reads matching bin 50
  • Querying 4 parts resulted in a total of 677 reads matching bin 50 grep "50," out_* | wc -l

Searching the IBF parts with multiple threads also results in unexpected behavior.

HIBF creates a very large index

Hi

I have been trying to build an index of a large collection of microbial genomes (102999) using HIBF and the resulting index is way larger than when I create the same index using IBF.

The raptor version I used:

VERSION
    Last update: 2023-08-30
    Raptor version: 3.1.0-rc.1 (raptor-v3.0.0-146-gedec71b5a2c19a2203278db814b3362ddb98e9e6)
    Sharg version: 1.1.1
    SeqAn version: 3.4.0-rc.1

The layout stat file:

## ### Parameters ###
## number of user bins = 102999
## number of hash functions = 2
## false positive rate = 0.05
## ### Notation ###
## X-IBF = An IBF with X number of bins.
## X-HIBF = An HIBF with tmax = X, e.g a maximum of X technical bins on each level.
## ### Column Description ###
## tmax : The maximum number of technical bin on each level
## c_tmax : The technical extra cost of querying an tmax-IBF, compared to 64-IBF
## l_tmax : The estimated query cost for an tmax-HIBF, compared to an 64-HIBF
## m_tmax : The estimated memory consumption for an tmax-HIBF, compared to an 64-HIBF
## (l*m)_tmax : Computed by l_tmax * m_tmax
## size : The expected total size of an tmax-HIBF
# tmax  c_tmax  l_tmax  m_tmax  (l*m)_tmax      size
64      1.00    0.00    1.00    0.00    424.3GiB
384     1.51    3.34    1.48    4.96    630.0GiB
# Best t_max (regarding expected query runtime): 64

The prepare and layout and build commands I used:

raptor prepare --input genomes.lst --output genomes_k20_w20 --kmer 20 --window 20 --threads 32
raptor layout --input-file genomes_k20_w20/minimiser.list --output-sketches-to genomes_k20_w20 \
    --determine-best-tmax --kmer-size 20 --false-positive-rate 0.05 --threads 32 \
    --output-filename genomes_k20_w20_binning
raptor build --input genomes_k20_w20_binning --output genomes_k20_w20.index --threads 32

The final index is ~1Tb, and these are the timings of building the index, where it had a peak memory usage of ~3Tb:

============= Timings =============
Wall clock time [s]: 40397.13
Peak memory usage [TiB]: 2.9
Index allocation [s]: 0.00
User bin I/O avg per thread [s]: 0.00
User bin I/O sum [s]: 0.00
Merge kmer sets avg per thread [s]: 0.00
Merge kmer sets sum [s]: 0.00
Fill IBF avg per thread [s]: 0.00
Fill IBF sum [s]: 0.00
Store index [s]: 0.00

The IBF index is ~750G and required a fraction of the memory to build the index. Shouldn't the HBIF be smaller than the IBF index? Any suggestions are much appreciated :-)

Thanks
Antonio

Not properly parsing NCBI assembly filenames

Platform

  • Raptor version: 3.0.0
  • Operating system: Linux 5.10.0-23-amd64 Debian 5.10.179-2 (2023-07-14) x86_64 GNU/Linu
  • Source: bioconda
  • Compiler:

Description

raptor prepare eats parts of the filename with an dot. e.g. the common file name for genome assemblies from NCBI GCF_029338575.1_ASM2933857v1_genomic.fna.gz turns into GCF_029338575.minimiser and .header. That not only makes it impossible to track back to the file, but also may break or behave unexpectedly using two versions of the same assembly (GCF_029338575.1 and GCF_029338575.2)

How to repeat the problem

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/029/338/575/GCF_029338575.1_ASM2933857v1/GCF_029338575.1_ASM2933857v1_genomic.fna.gz
find . -name "*.fna.gz" > files.txt
raptor prepare --input files.txt --output tmp

Expected behaviour

GCF_029338575.1_ASM2933857v1_genomic.minimiser and GCF_029338575.1_ASM2933857v1_genomic.header should be created

Actual behaviour

GCF_029338575.minimiser and GCF_029338575.header are created

Read parameters from layout file

When using a layout file, parameters (k-mer, number of hash functions, fpr) should be read from the layout.
There should be a warning (no error) when some parameter is overwritten.

HIBF improvement through saving ubiquitous only once

There might be experiments where all bins contain a certain set of ubiquitous k-mers. These can be more efficiently stored in a simple lookup table, but there might also be experiments which contain k-mers that are not present in all bins, but in all bins of a merged bin. This leads to the question, if there is a possibility to store these k-mers only once.
One possible way would be:

If a merged bin contains a k-mer in all its bins, only store it in the level of the merged bin not on any lower level. This would mean for the search, if a k-mer is found in a merged bin and not in any bin on the lower level that this is a ubiquitous k-mer and therefore found for all bins in the merged bin.

How would this approach impact the accuracy?
Let's call the probability for a FP on the level, where the merged bin is, p_m, and the probability for a FP on the lower level p_l. The level containing the merged bin is called merged and the one bin we are interested here is called merged bin. The bins in the merged bin on the lower level are called lower bins.

If a k-mer is found as TP in the merged bin and in one bin of the lower bins as FP, then only the bin with the FP would find the k-mer correctly, for all other bins the k-mer is reported as not present and therefore as FN. This happens with a probability of p_mp_l(number of lower bins) as the probability of a FP in one lower bin is independent of the probability of a FP in another lower bin.

This probability would get high quite quickly and is therefore not a good solution.

But maybe we can correct for this?
A k-mer is seen as present in all lower bins, if it is found in the merged bin and the number of found k-mers on the lower level are smaller than (number of lower bins) * p_l. This has the disadvantage that for k-mers present in only a few lower bins, are found for all lower bins.

Alternatively, any merged bin could own its own lookup table, but this seems like a lot of overhead.

Any other ideas, how ubiquitous k-mers in one merged bin could be stored not multiple times?

Wrong description for --window

Description

Wenn calling raptor build --help it says:

--window (unsigned 32 bit integer)
          The window size. Default: 20. Value must be a positive integer.

The default of window size is setting the window to --kmer. (Or not using a window at all)

Timing flag for raptor build and prepare

Build

  • Index allocation (raptor_index<> index{*arguments};)
  • User bin I/O (reader.hash_into(file_names, std::back_inserter(hashes));)
  • Filling the IBF (ibf.emplace(value, seqan3::bin_index{bin_number});)
  • Storing the index (store_index(arguments.out_path, std::move(index), arguments);)

Prepare

  • Reading hashes into map (reader.for_each_hash)
  • Writing minimiser (outfile.write(reinterpret_cast<const char *>(&hash), sizeof(hash));)
  • Writing header (headerfile << ...;)

Search (already offers feature)

  • Moving query records into vector (std::ranges::move(chunked_records, std::back_inserter(records));)
  • Loading the index (load_index(index, arguments, index_io_time);)
  • Computation:
    • Reading hashes into vector (minimiser.assign(minimiser_view.begin(), minimiser_view.end());
    • Querying IBF (auto & result = counter.bulk_count(minimiser);)
    • Writing result (synced_out.write(result_string);)

Ideas

  • Use std::chrono::steady_clock instead of std::chrono::high_resolution_clock
  • If we need atomics, we can cast to the internal representation and cast back upon printing https://godbolt.org/z/63TKq5P6z

[BUG] --compute minimiser function is broken

When running raptor build --kmer 19 --window 23 --size 8m --compute-minimiser --output sequence_file
I get a `seqan3::file_open_error'.
My assumption is that the parallel call does not work correctly and does not give worker the correct file name. I tried to print out the flle name and got nonsense as an output.

Remove `--disable-sketch-output`

This flag saves the intermediate state so that you pass your result from the chopper count to the chopper layout that has to happen anyway with the merging of both functions. So you don't want to/can't turn that off.

Streamline test infrastructure

  • Add a {ctest,cmake} flag that results in printing result.command
  • Add option to store basename in index
  • Check whether index always stores absolute paths

After-SOCKS cleanup

Follow-up of #35

Look into:

  • Refactor sequence I/O; custom format?
  • Unify partitioned and unpartitioned use case

String identifier for user bins

I suggest to have an option to tag user bins with a name/identifier. That would facilitate integration with other tools and downstream analysis. Currently, AFAIK, it accepts in the raptor prepare --input:

filename
or
filename1 <space> filename2 <space> filename3

The first filename of each line is used as the "identifier" of the bin. I would suggest adding two new modes:

filename <tab> identifier
and
filename1 <space> filename2 <space> filename3 <tab> identifier

Where instead of using the first filename it uses the last col (tab separeted) as an identifier for each bin.

Example: building the HIBF at species level (each species are formed by several files). Here the species taxid would be used as identifier. When I do the search, I can directly get the species out.

Search: Cannot select monolithic index when partitioned index with same prefix exists in directory

E.g.,

- raptor.index // Monolithic
- raptor.index_0 // Partitioned
- raptor.index_1 // Partitioned
- raptor.index_2 // Partitioned
- raptor.index_3 // Partitioned

And running

raptor search --index raptor.index [...]

Because of

try
{
validator(arguments.index_file.string() + std::string{"_0"});
partitioned = true;
}
catch (sharg::validation_error const & e)
{
validator(arguments.index_file);
}

The partitioned index will be used.

Solution: Probably just add a check whether the monolithic index exists. If yes, either throw or emit warning, and suggest renaming the monolithic index.

new release?

I'd be happy to package raptor for Debian, especially once #167 and #168 are merged.

Can we have a new release with a tag?

Providing a list of input files

Hi,

Thanks for developing Raptor!

I am trying to use it on a list of contigs, for which I've precomputed a list of minimisers as indicated in your nice tutorial.

However I'm hitting the bash command line argument length when providing the list of minimisers this way.
I think providing a list of minimisers (tsv file, one minimiser per line) would prevent this from triggering.

I've been looking at the code in src/raptor.cpp, and it seems that the input validation is done by the bin_validation class.
I'm guessing it shouldn't be too difficult to make it iterate over a list of file names, but I can't figure it out as I can barely read C++. (And Seqan3-level C++20 templated metaprogramming is pure wizardry to me !)

thanks again,

Some extra information about the subcommands

For the help/man pages it would be awesome if the subcommands also have description text, similar to the top parser.

raptor/src/raptor.cpp

Lines 23 to 31 in fecfbca

raptor::init_shared_meta(top_level_parser);
top_level_parser.info.description.emplace_back(
"Raptor is a system for approximately searching many queries such as "
"next-generation sequencing reads or transcripts in large collections of "
"nucleotide sequences. Raptor uses winnowing minimizers to define a set of "
"representative k-mers, an extension of the interleaved Bloom filters (IBFs) "
"as a set membership data structure and probabilistic thresholding for "
"minimizers. Our approach allows compression and partitioning of the IBF to "
"enable the effective use of secondary memory.");

Currently there is no description of what build, search or upgrade do.

destroyed_indirectly_by_error crashes due to out-of-bounds error

The following parameters crash the execution of destroyed_indirectly_by_error:

destroyed_indirectly_by_error::pattern_size: 15
destroyed_indirectly_by_error::window_size: 6
destroyed_indirectly_by_error::kmer_size: 4

destroyed_indirectly_by_error::iteration: 2391
destroyed_indirectly_by_error::sequence: TAGAGGGGGGCCCAG
destroyed_indirectly_by_error::mins: [0,1,0,1,0,0,1,1,1,1,0,0,0,0,0]
destroyed_indirectly_by_error::minse: [0,1,0,1,1,0,0,1,1,1,1,1,0,0,0]
destroyed_indirectly_by_error::error_pos: 9
destroyed_indirectly_by_error::count: 3
destroyed_indirectly_by_error::result.size(): 3

(original bug opened in eaasna/valik#42)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.