seqan / raptor Goto Github PK

View Code? Open in Web Editor NEW

52.0 4.0 18.0 4.16 MB

A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences.

Home Page: https://docs.seqan.de/raptor

License: Other

CMake 14.73% C++ 82.83% Nextflow 0.85% Common Workflow Language 0.99% Python 0.59%

minimizer k-mer binning threshold bloom-filter filter set-membership

raptor's Introduction

Raptor

A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

Download and Installation

See the documentation for detailed information on how to download, install, and compile Raptor.

There may be performance benefits when compiling from source, as Raptor can be optimized for the host system.

Install with bioconda (Linux)

conda install -c bioconda -c conda-forge raptor

Example Data and Usage

See the documentation for usage.

raptor --help will show available commands. Each command will have a respective help page that can be shown via, e.g., raptor build --help.

Authorship and Copyright

Raptor is being developed by Enrico Seiler, but also incorporates much work from other members of SeqAn.

Citation

In your academic works (also comparisons and pipelines) please cite:

Raptor: A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences; Enrico Seiler, Svenja Mehringer, Mitra Darvish, Etienne Turc, and Knut Reinert; iScience 2021 24 (7): 102782. doi: https://doi.org/10.1016/j.isci.2021.102782
Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries; Svenja Mehringer, Enrico Seiler, Felix Droop, Mitra Darvish, René Rahn, Martin Vingron, and Knut Reinert; Genome Biol 24, 131 (2023). doi: https://doi.org/10.1186/s13059-023-02971-4

RECOMB 2021

Raptor was presented at the 25th International Conference on Research in Computational Molecular Biology:

Please see the License section for information on allowed use.

Supplementary

The subdirectory util contains applications and scripts related to papers.

License

Raptor is open-source software. However, certain conditions apply when you (re-)distribute and/or modify Raptor, please see the license.

raptor's People

Contributors

Stargazers

Watchers

Forkers

eseiler advaitb archieyoung mitradarja rrahn wangdi2014 karolinad sgssgene smehringer baajarmeh schaudge jensuweulrich irallia mr-c myrthew swewagner leoxnard new2f7

raptor's Issues

[MISC] Add more test cases

E.g., for #25 (one compressed sequence file input)

Subcommand for precomputing minimisers

Add subcommand
Add the advantages to README and subcommand

Rebase on upstream SeqAn3

Rebase and use seqan3::interleaved_bloom_filter and the counting_agent (#19)
Allow compression formats for input (lost in rebase) (#20)

Increase code coverage

Make sure everything that should be tested, is tested. Exclude the other code from analysis.

CMake

Add find_package for SeqAn3
Add find_package for googletest/googlebenchmark, use fetch_content if not found

Add warning for high variance of pattern sizes

Index should store k and w

The build step should store the k-mer and window-size such that the search step does not need these parameters anymore.

~~Introduce class raptor_index that wraps a seqan3::interleaved_bloom_filter and stores additional information.~~

Overestimated minimal number of minimizers

Hi @eseiler

I was wondering if the minimal number of minimizers is being overestimated here:

raptor/src/search/precompute_threshold.cpp

Line 71 in 64ffff2

 size_t const minimal_number_of_minimizers = std::ceil(kmers_per_pattern / static_cast<double>(kmers_per_window)); 

I think the floor should be used instead of ceil, for example:

l=14, w=6, k=3
--------------
wwwwww
   kkk
    wwwwww
       kkk
        wwwwww
           kkk
(14-3+1)/(6-3+1) = 3

l=15, w=6, k=3
---------------
wwwwww
   kkk
    wwwwww
       kkk
        wwwwww
           kkk
         wwwwww
           kkk
(15-3+1)/(6-3+1) = 3.25

l=17, w=6, k=3
-----------------
wwwwww
   kkk
    wwwwww
       kkk
        wwwwww
           kkk
           wwwwww
           kkk
(17-3+1)/(6-3+1) = 3.75

I tested with the example data and raptor v2.0.0 and the change increased the number of sequences with a bin assigned when using --error 1 (but no changes with --error 2). Let me know if you agree or if I'm missing something. If yes, I can open a PR.

Backlog

Add find_package for SeqAn3
Add find_package for googletest/googlebenchmark, use fetch_content if not found
Add a {ctest,cmake} flag that results in printing result.command
Add option to store basename in index
Check whether index always stores absolute paths
Make counting_agent template parameter auto-detected. E.g., when using short queries, use uint8_t, etc.

Write list file when using raptor prepare

Create a minimiser.list file in the output directory, which then can be used with raptor build.

[MISC] use the HIBF lib

depends also on #314

Needs the HIBF lib to feature that the HIBF is constructible from layout file

Throw when query_length < k

Implement
Test
Update search tutorial

Adapt chopper output

Header with mapping of bin id to file path
No header in socks
no trailing comma

Add parameters to search output

Similar to seqan/chopper#96

No --hibf with actual hibf index and empty query file segfaults

Platform

Raptor version: 2.0.1
Operating system: Linux 4.19.128-microsoft-standard SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Source: from source
Compiler: gcc 10

Description

I originally observed that raptor segfaults when an empty fastq file is given. In the process of finding what causes the problem, I also found out that supplying a hibf with --index without the flag --hibf also segfaults.

How to repeat the problem

Give raptor an empty .fastq query file. Give raptor a serialized HIBF without suppliying the --hibf flag

Expected behaviour

For the empty input file I would expect: an error, a warning and/or simply an output file with no read classifications.
For the missing --hibf I would expect: an error

Actual behaviour

Segfaults

[BUG] The default parameter size throws an error, if not set.

Platform

Raptor version: 2.0.1
Operating system: macOS Catalina
Source: Compiled from source
Compiler: GCC 12.2.0

Description

How to repeat the problem

Run
raptor build --output raptor.index all_bin_paths.txt, e.g. with the example_data.

Expected behaviour

As the size has a default value, this should work.

Actual behaviour

[Error] Option --size is required but not set.

Implement SOCKS Interface

SOCKS - software for colored k-mer sets - will benchmark different applications in the field of computational pangenomics.
A common interface was defined therefor which now needs to be implemented.

Index construction

Input

One plain text file, for example:

COLOR_NAME_1: /PATH/TO/GENOME.FASTA
COLOR_NAME_2: /PATH/TO/READ_1.FASTQ /PATH/TO/READ_2.FASTQ

One line per color
: separates color name from path
There may be multiple paths
Paths are whitespace-separated

This results in something along the lines of:

([^:]+:\s*([^\s]+)(\s+([^\s]+))

Questions

What about paths with whitespaces?

Output

Proprietary binary index
Standardized index format

Only the first option will be implemented.

K-mer lookup

Input

One plain text file, for example:

ACGTACGT
ACCTAGGT

One line per query

Questions

What is the alphabet? Is it always dna4?

Output

One plain text file, color names or binary vector:

ACGTACGT: COLOR_NAME_1 COLOR_NAME_4 COLOR_NAME_7 ...
ACCTAGGT: COLOR_NAME_1 COLOR_NAME_5 COLOR_NAME_8 ...

ACGTACGT: 10010010...
ACCTAGGT: 10001001...

We will implement the last option.

Questions

Canonical kmers?

Color lookup

Not applicable.

Validation error when using multiple parts

Platform

Raptor version: 1.1.0
Operating system: 18.04.1-Ubuntu
Source: tried both conda and compiling from source

Description

How to repeat the problem

Scenario 1: querying all subindices at once. Parts: Without suffix _0
raptor build bin_paths.txt --threads 2 --parts 4 --output index --size 10m
raptor search --query reads_e5_150/all.fastq --output out_0 --index index --error 5

[Error] Validation failed for option --index: The file "index" does not exist!

Scenario 2: querying each subindex separately with:
raptor search --query reads_e5_150/all.fastq --output out_0 --index index_0 --error 5
raptor search --query reads_e5_150/all.fastq --output out_1 --index index_1 --error 5
etc.

Expected behaviour

The union of the 4 output files would be the same as when querying a single index built with the same parameters.

Actual behaviour

Scenario two does result in output files but many bin matches are not found.

For example:

Querying a single IBF resulted in 1536 reads matching bin 50
Querying 4 parts resulted in a total of 677 reads matching bin 50 grep "50," out_* | wc -l

Searching the IBF parts with multiple threads also results in unexpected behavior.

doxygen refers to fastq_to_fasta and not raptor

https://github.com/seqan/raptor/blob/main/doc/doxygen_cfg

:-)

[FEATURE] Pass parameters through the different raptor calls.

Is there a possibility to pass the parameters through the different raptor calls?
For example the fpr from raptor build via the index to raptor search?
This would avoid mistakes.

Adapt IBF build algorithm

Stream instead of caching

HIBF creates a very large index

I have been trying to build an index of a large collection of microbial genomes (102999) using HIBF and the resulting index is way larger than when I create the same index using IBF.

The raptor version I used:

VERSION
    Last update: 2023-08-30
    Raptor version: 3.1.0-rc.1 (raptor-v3.0.0-146-gedec71b5a2c19a2203278db814b3362ddb98e9e6)
    Sharg version: 1.1.1
    SeqAn version: 3.4.0-rc.1

The layout stat file:

## ### Parameters ###
## number of user bins = 102999
## number of hash functions = 2
## false positive rate = 0.05
## ### Notation ###
## X-IBF = An IBF with X number of bins.
## X-HIBF = An HIBF with tmax = X, e.g a maximum of X technical bins on each level.
## ### Column Description ###
## tmax : The maximum number of technical bin on each level
## c_tmax : The technical extra cost of querying an tmax-IBF, compared to 64-IBF
## l_tmax : The estimated query cost for an tmax-HIBF, compared to an 64-HIBF
## m_tmax : The estimated memory consumption for an tmax-HIBF, compared to an 64-HIBF
## (l*m)_tmax : Computed by l_tmax * m_tmax
## size : The expected total size of an tmax-HIBF
# tmax  c_tmax  l_tmax  m_tmax  (l*m)_tmax      size
64      1.00    0.00    1.00    0.00    424.3GiB
384     1.51    3.34    1.48    4.96    630.0GiB
# Best t_max (regarding expected query runtime): 64

The prepare and layout and build commands I used:

raptor prepare --input genomes.lst --output genomes_k20_w20 --kmer 20 --window 20 --threads 32
raptor layout --input-file genomes_k20_w20/minimiser.list --output-sketches-to genomes_k20_w20 \
    --determine-best-tmax --kmer-size 20 --false-positive-rate 0.05 --threads 32 \
    --output-filename genomes_k20_w20_binning
raptor build --input genomes_k20_w20_binning --output genomes_k20_w20.index --threads 32

The final index is ~1Tb, and these are the timings of building the index, where it had a peak memory usage of ~3Tb:

============= Timings =============
Wall clock time [s]: 40397.13
Peak memory usage [TiB]: 2.9
Index allocation [s]: 0.00
User bin I/O avg per thread [s]: 0.00
User bin I/O sum [s]: 0.00
Merge kmer sets avg per thread [s]: 0.00
Merge kmer sets sum [s]: 0.00
Fill IBF avg per thread [s]: 0.00
Fill IBF sum [s]: 0.00
Store index [s]: 0.00

The IBF index is ~750G and required a fraction of the memory to build the index. Shouldn't the HBIF be smaller than the IBF index? Any suggestions are much appreciated :-)

Thanks
Antonio

Not properly parsing NCBI assembly filenames

Platform

Raptor version: 3.0.0
Operating system: Linux 5.10.0-23-amd64 Debian 5.10.179-2 (2023-07-14) x86_64 GNU/Linu
Source: bioconda
Compiler:

Description

raptor prepare eats parts of the filename with an dot. e.g. the common file name for genome assemblies from NCBI GCF_029338575.1_ASM2933857v1_genomic.fna.gz turns into GCF_029338575.minimiser and .header. That not only makes it impossible to track back to the file, but also may break or behave unexpectedly using two versions of the same assembly (GCF_029338575.1 and GCF_029338575.2)

How to repeat the problem

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/029/338/575/GCF_029338575.1_ASM2933857v1/GCF_029338575.1_ASM2933857v1_genomic.fna.gz
find . -name "*.fna.gz" > files.txt
raptor prepare --input files.txt --output tmp

Expected behaviour

GCF_029338575.1_ASM2933857v1_genomic.minimiser and GCF_029338575.1_ASM2933857v1_genomic.header should be created

Actual behaviour

GCF_029338575.minimiser and GCF_029338575.header are created

Read parameters from layout file

When using a layout file, parameters (k-mer, number of hash functions, fpr) should be read from the layout.
There should be a warning (no error) when some parameter is overwritten.

HIBF improvement through saving ubiquitous only once

There might be experiments where all bins contain a certain set of ubiquitous k-mers. These can be more efficiently stored in a simple lookup table, but there might also be experiments which contain k-mers that are not present in all bins, but in all bins of a merged bin. This leads to the question, if there is a possibility to store these k-mers only once.
One possible way would be:

If a merged bin contains a k-mer in all its bins, only store it in the level of the merged bin not on any lower level. This would mean for the search, if a k-mer is found in a merged bin and not in any bin on the lower level that this is a ubiquitous k-mer and therefore found for all bins in the merged bin.

How would this approach impact the accuracy?
Let's call the probability for a FP on the level, where the merged bin is, p_m, and the probability for a FP on the lower level p_l. The level containing the merged bin is called merged and the one bin we are interested here is called merged bin. The bins in the merged bin on the lower level are called lower bins.

If a k-mer is found as TP in the merged bin and in one bin of the lower bins as FP, then only the bin with the FP would find the k-mer correctly, for all other bins the k-mer is reported as not present and therefore as FN. This happens with a probability of p_mp_l(number of lower bins) as the probability of a FP in one lower bin is independent of the probability of a FP in another lower bin.

This probability would get high quite quickly and is therefore not a good solution.

But maybe we can correct for this?
A k-mer is seen as present in all lower bins, if it is found in the merged bin and the number of found k-mers on the lower level are smaller than (number of lower bins) * p_l. This has the disadvantage that for k-mers present in only a few lower bins, are found for all lower bins.

Alternatively, any merged bin could own its own lookup table, but this seems like a lot of overhead.

Any other ideas, how ubiquitous k-mers in one merged bin could be stored not multiple times?

[INFRA] Header test builds all targets

ctest -j6 -R header_test builds api_test and cli_test
https://github.com/seqan/raptor/runs/3717674475?check_suite_focus=true

Maybe just remove the ctest step for header tests?

Allow custom static cutoff for prepare

Write number of hashes into search output

update benchmarks

--rearrange-user-bins is deprecated

Timing flag for raptor build --hibf

Wrong description for --window

Description

Wenn calling raptor build --help it says:

--window (unsigned 32 bit integer)
          The window size. Default: 20. Value must be a positive integer.

The default of window size is setting the window to --kmer. (Or not using a window at all)

Timing flag for raptor build and prepare

Build

Index allocation (raptor_index<> index{*arguments};)
User bin I/O (reader.hash_into(file_names, std::back_inserter(hashes));)
Filling the IBF (ibf.emplace(value, seqan3::bin_index{bin_number});)
Storing the index (store_index(arguments.out_path, std::move(index), arguments);)

Prepare

Reading hashes into map (reader.for_each_hash)
Writing minimiser (outfile.write(reinterpret_cast<const char *>(&hash), sizeof(hash));)
Writing header (headerfile << ...;)

Search (already offers feature)

Moving query records into vector (std::ranges::move(chunked_records, std::back_inserter(records));)
Loading the index (load_index(index, arguments, index_io_time);)
Computation:
- Reading hashes into vector (minimiser.assign(minimiser_view.begin(), minimiser_view.end());
- Querying IBF (auto & result = counter.bulk_count(minimiser);)
- Writing result (synced_out.write(result_string);)

Ideas

Use std::chrono::steady_clock instead of std::chrono::high_resolution_clock
If we need atomics, we can cast to the internal representation and cast back upon printing https://godbolt.org/z/63TKq5P6z

Store seed + fpr in index

We should store the seed in the index, in case the seed ever changes.

[BUG] --compute minimiser function is broken

When running raptor build --kmer 19 --window 23 --size 8m --compute-minimiser --output sequence_file
I get a `seqan3::file_open_error'.
My assumption is that the parallel call does not work correctly and does not give worker the correct file name. I tried to print out the flle name and got nonsense as an output.

Remove `--disable-sketch-output`

This flag saves the intermediate state so that you pass your result from the chopper count to the chopper layout that has to happen anyway with the merging of both functions. So you don't want to/can't turn that off.

Make thresholding a (standalone) library

Streamline test infrastructure

Add a {ctest,cmake} flag that results in printing result.command
Add option to store basename in index
Check whether index always stores absolute paths

After-SOCKS cleanup

Follow-up of #35

Look into:

Refactor sequence I/O; custom format?
Unify partitioned and unpartitioned use case

Update raptor upgrade to upgrade 2.0 to 3.0

Before releasing update.

new: HIBF -> always false
store FPR:
- pass via command line
- compute from index.bin_path
- compute from passed bin_file

String identifier for user bins

I suggest to have an option to tag user bins with a name/identifier. That would facilitate integration with other tools and downstream analysis. Currently, AFAIK, it accepts in the raptor prepare --input:

filename
or
filename1 <space> filename2 <space> filename3

The first filename of each line is used as the "identifier" of the bin. I would suggest adding two new modes:

filename <tab> identifier
and
filename1 <space> filename2 <space> filename3 <tab> identifier

Where instead of using the first filename it uses the last col (tab separeted) as an identifier for each bin.

Example: building the HIBF at species level (each species are formed by several files). Here the species taxid would be used as identifier. When I do the search, I can directly get the species out.

Search: Cannot select monolithic index when partitioned index with same prefix exists in directory

E.g.,

- raptor.index // Monolithic
- raptor.index_0 // Partitioned
- raptor.index_1 // Partitioned
- raptor.index_2 // Partitioned
- raptor.index_3 // Partitioned

And running

raptor search --index raptor.index [...]

Because of

raptor/src/argument_parsing/search_parsing.cpp

Lines 121 to 129 in 6cc6610

 try 

 { 

 validator(arguments.index_file.string() + std::string{"_0"}); 

 partitioned = true; 

 } 

 catch (sharg::validation_error const & e) 

 { 

 validator(arguments.index_file); 

 }

The partitioned index will be used.

Solution: Probably just add a check whether the monolithic index exists. If yes, either throw or emit warning, and suggest renaming the monolithic index.

new release?

I'd be happy to package raptor for Debian, especially once #167 and #168 are merged.

Can we have a new release with a tag?

Providing a list of input files

Hi,

Thanks for developing Raptor!

I am trying to use it on a list of contigs, for which I've precomputed a list of minimisers as indicated in your nice tutorial.

However I'm hitting the bash command line argument length when providing the list of minimisers this way.
I think providing a list of minimisers (tsv file, one minimiser per line) would prevent this from triggering.

I've been looking at the code in src/raptor.cpp, and it seems that the input validation is done by the bin_validation class.
I'm guessing it shouldn't be too difficult to make it iterate over a list of file names, but I can't figure it out as I can barely read C++. (And Seqan3-level C++20 templated metaprogramming is pure wizardry to me !)

thanks again,

To-do: Changes that affect serialization

Remove user_bin_filenames from include/raptor/hierarchical_interleaved_bloom_filter.hpp
Remove compressed_ from include/raptor/index.hpp

Some extra information about the subcommands

For the help/man pages it would be awesome if the subcommands also have description text, similar to the top parser.

raptor/src/raptor.cpp

Lines 23 to 31 in fecfbca

 raptor::init_shared_meta(top_level_parser); 

 top_level_parser.info.description.emplace_back( 

 "Raptor is a system for approximately searching many queries such as " 

 "next-generation sequencing reads or transcripts in large collections of " 

 "nucleotide sequences. Raptor uses winnowing minimizers to define a set of " 

 "representative k-mers, an extension of the interleaved Bloom filters (IBFs) " 

 "as a set membership data structure and probabilistic thresholding for " 

 "minimizers. Our approach allows compression and partitioning of the IBF to " 

 "enable the effective use of secondary memory.");

Currently there is no description of what build, search or upgrade do.

Add examples for --input / --output parameters

destroyed_indirectly_by_error crashes due to out-of-bounds error

The following parameters crash the execution of destroyed_indirectly_by_error:

destroyed_indirectly_by_error::pattern_size: 15
destroyed_indirectly_by_error::window_size: 6
destroyed_indirectly_by_error::kmer_size: 4

destroyed_indirectly_by_error::iteration: 2391
destroyed_indirectly_by_error::sequence: TAGAGGGGGGCCCAG
destroyed_indirectly_by_error::mins: [0,1,0,1,0,0,1,1,1,1,0,0,0,0,0]
destroyed_indirectly_by_error::minse: [0,1,0,1,1,0,0,1,1,1,1,1,0,0,0]
destroyed_indirectly_by_error::error_pos: 9
destroyed_indirectly_by_error::count: 3
destroyed_indirectly_by_error::result.size(): 3

(original bug opened in eaasna/valik#42)

Throw if layout config is not readable

If no args were given on cli

	try
	{
	validator(arguments.index_file.string() + std::string{"_0"});
	partitioned = true;
	}
	catch (sharg::validation_error const & e)
	{
	validator(arguments.index_file);
	}

	raptor::init_shared_meta(top_level_parser);
	top_level_parser.info.description.emplace_back(
	"Raptor is a system for approximately searching many queries such as "
	"next-generation sequencing reads or transcripts in large collections of "
	"nucleotide sequences. Raptor uses winnowing minimizers to define a set of "
	"representative k-mers, an extension of the interleaved Bloom filters (IBFs) "
	"as a set membership data structure and probabilistic thresholding for "
	"minimizers. Our approach allows compression and partitioning of the IBF to "
	"enable the effective use of secondary memory.");

seqan / raptor Goto Github PK

raptor's Introduction

Raptor

A fast and space-efficient pre-filter for querying very large collections of nucleotide sequences

Download and Installation

Install with bioconda (Linux)

Example Data and Usage

Authorship and Copyright

Citation

RECOMB 2021

Supplementary

License

raptor's People

Contributors

Stargazers

Watchers

Forkers

raptor's Issues

Platform

Description

How to repeat the problem

Expected behaviour

Actual behaviour

Platform

Description

How to repeat the problem

Expected behaviour

Actual behaviour

Index construction

Input

Questions

Output

K-mer lookup

Input

Questions

Output

Questions

Color lookup

Platform

Description

How to repeat the problem

Expected behaviour

Actual behaviour

Platform

Description

How to repeat the problem

Expected behaviour

Actual behaviour

Description

Build

Prepare

Search (already offers feature)

Ideas

Recommend Projects

Recommend Topics

Recommend Org