R-comp: Online RLBWT compression in optimal-time and BWT-runs bounded space

This is an experimental library of R-comp, an online RLBWT compression algorithm in optimal-time and BWT-runs bounded space, described in the paper: An Optimal-Time RLBWT Construction in BWT-runs Bounded Space (ICALP 2022) by Takaaki Nishimoto, Shunsuke Kanda, and Yasuo Tabei.

Build instructions

You can download and compile the library with the following commands:

$ git clone https://github.com/kampersanda/rcomp.git
$ cd rcomp
$ mkdir build
$ cd build
$ cmake ..
$ make -j

The code is written in C++17, so please install g++ >= 7.0 or clang >= 4.0. For the build system, CMake >= 3.0 have to be installed to compile the library.

The library employs the third-party libraries cmd_line_parser, doctest, nameof and tinyformat, whose header files are contained in this repository.

The code has been tested only on Mac OS X and Linux. That is, this library considers only UNIX-compatible OS.

Implementations

The library implements several data structures and provides the following variants of R-comp defined in rlbwt_types:

rlbwt_types::lfig_naive is a straightforward implementation of the LF-interval graph with O(r) nodes, and
rlbwt_types::glfig_serialized<g> is a spece-efficient implementation of the LF-interval graph with O(r/g) nodes,

where r is the number of BWT-runs.

Also, the library implements r-index on these data structures, providing count and locate queries in the compressed space. In the same manner as rlbwt_types, the variants are defined in rindex_types.

Limitations

An input text must NOT contain the 0x00 character because it is used as a special end marker.
In the current version, class GroupedFIndex resorts to static global variables. Please do NOT create multiple instances of glfig_serialized.

Sample usage

RLBWT

sample/sample_rlbwt.cpp provides a sample usage.

#include <string>

#include "rlbwt_types.hpp"
#include "utils.hpp"

using namespace rcomp;

int main(int argc, char** argv) {
    // Input text
    const std::string text = "abaababaab";

    // Construct the RLBWT by appending characters (with end-marker $) in reverse.
    // Note that '\0' is used for the end marker (i.e., the text should not contain '\0').
    rlbwt_types::glfig_serialized<8>::type rlbwt;
    for (size_t i = 1; i <= text.size(); i++) {
        rlbwt.extend(text[text.size() - i]);
    }

    // Extract the resulted BWT-runs
    std::cout << "BWT-runs: ";
    rlbwt.output_runs([](const run_type& r) { std::cout << r << ","; });
    std::cout << std::endl;

    // Decode the original text (except $) from the RLBWT
    std::string decoded;
    rlbwt.decode_text([&](uchar_type c) { decoded.push_back(c); });
    std::reverse(decoded.begin(), decoded.end());  // need to be reversed
    std::cout << "text:    " << text << std::endl;
    std::cout << "decoded: " << decoded << std::endl;

    return 0;
}

The output will be

BWT-runs: (b,3),(a,1),(b,1),($,1),(a,5),
text:    abaababaab
decoded: abaababaab

r-index

sample/sample_rindex.cpp provides a sample usage.

#include <string>

#include "rindex_types.hpp"
#include "utils.hpp"

using namespace rcomp;

int main(int argc, char** argv) {
    // Input text
    const std::string text = "abaababaab";

    // Construct the r-index by appending characters (with end-marker $) in reverse.
    // Note that '\0' is used for the end marker (i.e., the text should not contain '\0').
    rindex_types::glfig_naive<8>::type rindex;
    for (size_t i = 1; i <= text.size(); i++) {
        rindex.extend(text[text.size() - i]);
    }

    // Extract the resulted BWT-runs
    std::cout << "BWT-runs: ";
    rindex.output_runs([](const run_type& r) { std::cout << r << ","; });
    std::cout << std::endl;

    // Decode the original text (except $) from the index
    std::string decoded;
    rindex.decode_text([&](uchar_type c) { decoded.push_back(c); });
    std::reverse(decoded.begin(), decoded.end());  // need to be reversed
    std::cout << "text:    " << text << std::endl;
    std::cout << "decoded: " << decoded << std::endl;

    // Count the occurrences of the (reversed) query
    const std::string query = "aaba";  // i.e., "abaa" in the original order
    const size_type occ = rindex.count(make_range(query));
    std::cout << "count(" << query << ") = " << occ << std::endl;

    // Locate the (reversed) query
    std::cout << "locate(" << query << ") = {";
    rindex.locate(make_range(query), [&](size_type pos) {
        // We will get pos such that text_r[..pos] = "..aaba", where text_r = reverse(text+'$').
        // In other words, we can extract the original position starting at "abaa" as follows.
        std::cout << text.size() - pos << ",";
    });
    std::cout << "}" << std::endl;

    return 0;
}

The output will be

BWT-runs: (b,3),(a,1),(b,1),($,1),(a,5),
text:    abaababaab
decoded: abaababaab
count(aaba) = 2
locate(aaba) = {5,0,}

Performance test

RLBWT

The executable perf/perf_rlbwt measures the performance of R-comp. The command line options are printed by specifying the parameter -h.

$ ./perf/perf_rlbwt -h
Usage: ./perf/perf_rlbwt [-h,--help] input_path [-t rlbwt_type] [-r reverse_mode] [-T enable_test]

 input_path
    Input file path of text
 [-t rlbwt_type]
    Rlbwt data structure type: lfig | glfig_[8|16|32|64] (default=glfig_16)
 [-r reverse_mode]
    Load the text in reverse? (default=1)
 [-T enable_test]
    Test the data structure? (default=1)
 [-h,--help]
    Print this help text and silently exits.

For data structure type t, lfig is a straight forward implementation, and glfig_g is a memory-efficient implementation by the grouping technique with group size g.
When r is set to 1, the text will be input in reverse order to build the RLBWT for the text in the original order.
When T is set to 1, the correctness of the result will be tested.

For example, for dataset alice29.txt, the following command measures the performace of R-comp with data structure glfig_16 and shows the detailed statistics.

$ ./perf/perf_rlbwt alice29.txt
[Input_Params]
input_path:     alice29.txt
rlbwt_type:     rcomp::Rlbwt_GLFIG<rcomp::GroupedLFIntervalGraph<rcomp::GroupedLData_Serialized<16, false, 2, true, true>, rcomp::GroupedFData_Serialized<rcomp::GroupedLData_Serialized<16, false, 2, true, true> >, 7> >
reverse_mode:   1
[Progress_Report]
num_chars:      10000
construction_sec:       0.002
[Progress_Report]
num_chars:      100000
construction_sec:       0.03
[Final_Report]
construction_sec:       0.048
num_runs:       66903
num_chars:      152090
compression_ratio:      0.439891
alloc_memory_in_bytes:  2892218
alloc_memory_in_MiB:    2.75823
peak_memory_in_bytes:   3604480
peak_memory_in_MiB:     3.4375
Testing the data structure now...
No Problem!
Testing the decoded text now...
No Problem!

r-index

The executable perf/perf_rindex measures the performance of r-index on R-comp. The command line options are printed by specifying the parameter -h.

$ ./perf/perf_rindex -h
Usage: ./perf/perf_rindex [-h,--help] input_path [-t rindex_type] [-r reverse_mode] [-T enable_test]

 input_path
        Input file path of text
 [-t rindex_type]
        Rindex data structure type: lfig | glfig_[8|16|32|64] (default=glfig_16)
 [-r reverse_mode]
        Loading the text in reverse? (default=1)
 [-T enable_test]
        Testing the data structure? (default=1)
 [-h,--help]
        Print this help text and silently exits.

The parameter settings are the same as perf_rlbwt, and the following command measures the performace of r-index on R-comp with data structure glfig_16.

$ ./perf/perf_rindex alice29.txt 
[Input_Params]
input_path:     alice29.txt
rindex_type:    rcomp::Rindex_GLFIG<rcomp::GroupedLFIntervalGraph<rcomp::GroupedLData_Serialized<16, true, 2, true, true>, rcomp::GroupedFData_Serialized<rcomp::GroupedLData_Serialized<16, true, 2, true, true> >, 7> >
reverse_mode:   1
[Progress_Report]
num_chars:      10000
construction_sec:       0.005
[Progress_Report]
num_chars:      100000
construction_sec:       0.07
[Final_Report]
construction_sec:       0.116
num_runs:       66903
num_chars:      152090
compression_ratio:      0.439891
alloc_memory_in_bytes:  6408190
alloc_memory_in_MiB:    6.11133
peak_memory_in_bytes:   8417280
peak_memory_in_MiB:     8.02734
[Search_Settings]
num_trials:     10
num_queries:    1000
query_length:   8
query_seed:     13
Warming up now...
dummy:  18858
[Count_Query]
occ_per_query:  18.858
microsec_per_query:     4.19685
[Locate_Query]
occ_per_query:  37.716
microsec_per_query:     7.14312
microsec_per_occ:       0.189392
Testing the data structure now...
No Problem!
Testing the decoded text now...
No Problem!

BWT tool

The executable tool/transform constructs the BWT text from a given text. You need to set r to 1 to output the BWT for the text in the original order.

$ ./tool/transform alice29.txt alice29.bwt -r 1
Constructing now...
RLBWT was constructed for 152090 chars.
Outputting now...
BWT-text was output.
The number of resulting runs was 66903.

Note that, when r is set to 0, the BWT for the reversed text will be constructed.

$ ./tool/transform alice29.txt alice29.bwt -r 0
WARNING: Since the text will be input in the original order, the BWT for the reversed text will be constructed.
Constructing now...
RLBWT was constructed for 152090 chars.
Outputting now...
BWT-text was output.
The number of resulting runs was 66186.

r-index demo

The executable tool/demo_rindex offers a demo of count and locate queries using r-index.

$ ./tool/demo_rindex alice29.txt 
Constructing r-index...
152090 characters indexed in 6327169 bytes = 6178.88 KiB = 6.03406 MiB.
1. Enter query string to search.
2. Enter "exit" to continue indexing.
> Dinah
Count("Dinah") = 14, done in 36.6 micro sec.
1. Enter '1' to run locate with print.
2. Enter '2' to run locate without print.
3. Enter another not to run locate.
> 1
Locate("Dinah") = {43681, 4612, 5237, 33563, 35845, 5189, 33713, 32627, 32751, 21324, 36155, 4532, 4475, 32894, }
Locate query, done in 76 micro sec, 5.42857 micro sec per occ.
1. Enter query string to search.
2. Enter "exit" to continue indexing.
> exit
Thanks!

Unit test

The unit tests are written using doctest. After compiling, you can run tests with the following command.

$ make test

Authors

Licensing

This program is available for only academic use, basically. For the academic use, please keep MIT License. For the commercial use, please keep GPL 2.0 and make a contact to one of the authors.

If you use the library, please cite the following paper:

@inproceedings{nishimoto2022optimal,
  author =	{Nishimoto, Takaaki and Kanda, Shunsuke and Tabei, Yasuo},
  title =	{{An Optimal-Time RLBWT Construction in BWT-Runs Bounded Space}},
  booktitle =	{49th International Colloquium on Automata, Languages, and Programming (ICALP 2022)},
  pages =	{99:1--99:20},
  year =	{2022},
  doi =		{10.4230/LIPIcs.ICALP.2022.99},
}

Related software

renum is a C++ implementation of enumeration of characteristic substrings in BWT-runs bounded space.
rlbwt_iterator is a C++ implementation of some iterators in BWT-runs bounded space.

kampersanda / rcomp Goto Github PK

rcomp's Introduction

R-comp: Online RLBWT compression in optimal-time and BWT-runs bounded space

Build instructions

Implementations

Limitations

Sample usage

RLBWT

r-index

Performance test

RLBWT

r-index

BWT tool

r-index demo

Unit test

Authors

Licensing

Related software

rcomp's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org