Git Product home page Git Product logo

ilyagrebnov / libsais Goto Github PK

View Code? Open in Web Editor NEW
170.0 14.0 20.0 690 KB

libsais is a library for linear time suffix array, longest common prefix array and burrows wheeler transform construction based on induced sorting algorithm.

License: Apache License 2.0

C 99.91% CMake 0.09%
bwt burrows-wheeler-transform saca suffix-array sais divsufsort suffixarray lcp lcp-array longest-common-prefix

libsais's Introduction

libsais

The libsais is a library for fast (see Benchmarks below) linear time suffix array, longest common prefix array and Burrows-Wheeler transform construction based on induced sorting algorithm described in the following papers:

  • Ge Nong, Sen Zhang, Wai Hong Chan Two Efficient Algorithms for Linear Suffix Array Construction, 2009
  • Juha Karkkainen, Giovanni Manzini, Simon J. Puglisi Permuted Longest-Common-Prefix Array, 2009
  • Nataliya Timoshevskaya, Wu-chun Feng SAIS-OPT: On the characterization and optimization of the SA-IS algorithm for suffix array construction, 2014
  • Jing Yi Xie, Ge Nong, Bin Lao, Wentao Xu Scalable Suffix Sorting on a Multicore Machine, 2020

Copyright (c) 2021-2024 Ilya Grebnov [email protected]

The libsais is inspired by libdivsufsort, sais libraries by Yuta Mori and msufsort by Michael Maniscalco.

libcubwt

If you are looking for even faster construction times, you can try libcubwt a library for GPU-based suffix array, inverse suffix array and Burrows-Wheeler transform construction.

Introduction

The libsais provides simple C99 API to construct suffix array and Burrows-Wheeler transformed string from a given string over constant-size alphabet. The algorithm runs in a linear time using typically only ~16KB of extra memory (with 2n bytes as absolute worst-case; where n is the length of the string). OpenMP acceleration uses 200KB of addition memory per thread.

  • The libsais works with compilers from GNU, Microsoft and Intel, but I recommend Clang for best performance.
  • The libsais is sensitive to fast memory and software prefetching and might not be suitable for some workloads. Please benchmark yourself.

License

The libsais is released under the Apache License Version 2.0

Changes

  • June 13, 2024 (2.8.4)
    • Additional OpenMP acceleration (libsais16 & libsais16x64).
  • June 11, 2024 (2.8.3)
    • Implemented suffix array construction of a long 16-bit array (libsais16x64).
  • May 27, 2024 (2.8.2)
    • Implemented suffix array construction of a long 64-bit array (libsais64).
  • April 5, 2024 (2.8.1)
    • Fixed out-of-bound memory access issue for large inputs (libsais64).
  • March 3, 2024 (2.8.0)
    • Implemented permuted longest common prefix array (PLCP) construction of an integer array.
    • Fixed compilation error when compiling the library with OpenMP enabled.
  • February 26, 2024 (2.7.5)
    • Improved performance of suffix array and burrows wheeler transform construction on degenerate inputs.
  • February 23, 2024 (2.7.4)
    • Resolved strict aliasing violation resulted in invalid code generation by Intel compiler.
  • April 21, 2023 (2.7.3)
    • CMake script for library build and integration with other projects.
  • April 18, 2023 (2.7.2)
    • Fixed out-of-bound memory access issue for large inputs (libsais64).
  • June 19, 2022 (2.7.1)
    • Improved cache coherence for ARMv8 architecture.
  • April 12, 2022 (2.7.0)
    • Support for longest common prefix array (LCP) construction.
  • January 1, 2022 (2.6.5)
    • Exposed functions to construct suffix array of a given integer array.
    • Improved detection of various compiler intrinsics.
    • Capped free space parameter to avoid crashing due to 32-bit integer overflow.
  • October 21, 2021 (2.6.0)
    • libsais16 for 16-bit inputs.
  • October 15, 2021 (2.5.0)
    • Support for optional symbol frequency tables.
  • July 14, 2021 (2.4.0)
    • Reverse Burrows-Wheeler transform.
  • June 23, 2021 (2.3.0)
    • Burrows-Wheeler transform with auxiliary indexes.
  • April 27, 2021 (2.2.0)
    • libsais64 for inputs larger than 2GB.
  • April 19, 2021 (2.1.0)
    • Additional OpenMP acceleration.
  • April 4, 2021 (2.0.0)
    • OpenMP acceleration.
  • February 23, 2021 (1.0.0)
    • Initial release.

Versions of the libsais

  • libsais.c (and corresponding libsais.h) is for suffix array, PLCP, LCP, forward BWT and reverse BWT construction over 8-bit inputs smaller than 2GB (2147483648 bytes).
    • libsais64.c (and corresponding libsais64.h) is optional extension of the library for inputs larger or equlas to 2GB (2147483648 bytes).
    • This versions of the library could also be used to construct suffix array of an integer array (with a caveat that input array must be mutable).
  • libsais16.c + libsais16x64.c (and corresponding libsais16.h + libsais16x64.h) is independent version of the library for 16-bit inputs.
    • This version of the library could also be used to construct suffix array and BWT of a set of strings by adding a unique end-of-string symbol to each string and then computing the result for the concatenated string.

Examples of APIs (see libsais.h, libsais16.h, libsais16x64.h and libsais64.h for complete APIs list)

    /**
    * Constructs the suffix array of a given string.
    * @param T [0..n-1] The input string.
    * @param SA [0..n-1+fs] The output array of suffixes.
    * @param n The length of the given string.
    * @param fs Extra space available at the end of SA array (0 should be enough for most cases).
    * @param freq [0..255] The output symbol frequency table (can be NULL).
    * @return 0 if no error occurred, -1 or -2 otherwise.
    */
    int32_t libsais(const uint8_t * T, int32_t * SA, int32_t n, int32_t fs, int32_t * freq);

    /**
    * Constructs the suffix array of a given integer array.
    * Note, during construction input array will be modified, but restored at the end if no errors occurred.
    * @param T [0..n-1] The input integer array.
    * @param SA [0..n-1+fs] The output array of suffixes.
    * @param n The length of the integer array.
    * @param k The alphabet size of the input integer array.
    * @param fs Extra space available at the end of SA array (can be 0, but 4k or better 6k is recommended for optimal performance).
    * @return 0 if no error occurred, -1 or -2 otherwise.
    */
    int32_t libsais_int(int32_t * T, int32_t * SA, int32_t n, int32_t k, int32_t fs);

    /**
    * Constructs the burrows-wheeler transformed string of a given string.
    * @param T [0..n-1] The input string.
    * @param U [0..n-1] The output string (can be T).
    * @param A [0..n-1+fs] The temporary array.
    * @param n The length of the given string.
    * @param fs Extra space available at the end of A array (0 should be enough for most cases).
    * @param freq [0..255] The output symbol frequency table (can be NULL).
    * @return The primary index if no error occurred, -1 or -2 otherwise.
    */
    int32_t libsais_bwt(const uint8_t * T, uint8_t * U, int32_t * A, int32_t n, int32_t fs, int32_t * freq);

    /**
    * Constructs the original string from a given burrows-wheeler transformed string with primary index.
    * @param T [0..n-1] The input string.
    * @param U [0..n-1] The output string (can be T).
    * @param A [0..n] The temporary array (NOTE, temporary array must be n + 1 size).
    * @param n The length of the given string.
    * @param freq [0..255] The input symbol frequency table (can be NULL).                	
    * @param i The primary index.
    * @return 0 if no error occurred, -1 or -2 otherwise.
    */
    int32_t libsais_unbwt(const uint8_t * T, uint8_t * U, int32_t * A, int32_t n, const int32_t * freq, int32_t i);

Example installation using CPM

CPMAddPackage(
  NAME libsais
  GITHUB_REPOSITORY IlyaGrebnov/libsais
  GIT_TAG v2.8.3
  OPTIONS
    "LIBSAIS_USE_OPENMP OFF"
    "LIBSAIS_BUILD_SHARED_LIB OFF"
)

target_link_libraries(<your target> libsais)

Algorithm description

The libsais uses the SA-IS (Suffix Array Induced Sorting) algorithm to construct both the suffix array and the Burrows-Wheeler transform through recursive decomposition and induced sorting:

  • Initially, the algorithm classifies each position in a string as either an S-type or an L-type, based on whether the suffix starting at that position is lexicographically smaller or larger than the suffix at the adjacent right position. Positions identified as S-type, which have an adjacent left L-type position, are further categorized as LMS-type (Leftmost S-type) positions. Next, the algorithm splits the input string into LMS substrings, which start at an LMS-type position and extend up to the next adjacent LMS-type position. These LMS substrings are then lexicographically sorted through induced sorting and subsequently replaced in the input string with their corresponding sorted ranks, thus forming a new, compacted string. This compacted string reduces the problem size, enabling the algorithm to perform a recursive decomposition in which it is reapplied to construct the suffix array for the compacted string. And at the end of the recursive call, the suffix array for the input string is constructed from the suffix array of the compacted string using another round of induced sorting.
  • The induced sorting is a core mechanic of the SA-IS algorithm and is employed twice during each recursive call: initially before the recursive call to establish the order of LMS substrings, and subsequently after the recursive call to finalize the order of the suffixes of the string. This process involves two sequential scans: a left-to-right scan that determines the order of L-type positions based on the LMS-type positions, followed by a right-to-left scan that establishes the order of S-type positions based on L-type positions. These scans efficiently extend the ordering from LMS-type positions to all positions in the string.

The SA-IS algorithm is quite elegant, yet implementing it efficiently presents multiple challenges. The primary challenge is that the SA-IS algorithm exhibits random memory access patterns, which can significantly decrease efficiency due to cache misses. Another significant challenge is that the SA-IS algorithm is not a lightweight construction algorithm; it requires additional memory to support positions classification, induced sorting, compacted string representations, and recursive decomposition. To circumvent this, the libsais implements careful optimizations that are worth highlighting:

  • The libsais is meticulously designed from the ground up to leverage the capabilities of modern microprocessors, aiming to minimize various stalls and enhance throughput through instruction-level parallelism. The library employs sophisticated techniques such as manual loop unrolling, software prefetching, and branch elimination to achieve this goal. Moreover, it strives to minimize the number of passes over the data by combining multiple operations into a single function. A prime example of these techniques could be observed in the initialization phase of the SA-IS algorithm. In this phase, the entire logic required to classify positions, count symbols into various buckets, and segment the string into LMS substrings is executed through a single, completely branch-less loop:
        for (i = m - 1, j = omp_block_start + prefetch_distance + 3; i >= j; i -= 4)
        {
            libsais_prefetchr(&T[i - 2 * prefetch_distance]);

            libsais_prefetchw(&buckets[BUCKETS_INDEX4(T[i - prefetch_distance - 0], 0)]);
            libsais_prefetchw(&buckets[BUCKETS_INDEX4(T[i - prefetch_distance - 1], 0)]);
            libsais_prefetchw(&buckets[BUCKETS_INDEX4(T[i - prefetch_distance - 2], 0)]);
            libsais_prefetchw(&buckets[BUCKETS_INDEX4(T[i - prefetch_distance - 3], 0)]);

            c1 = T[i - 0]; s = (s << 1) + (fast_uint_t)(c1 > (c0 - (fast_sint_t)(s & 1))); SA[m] = (sa_sint_t)(i + 1); m -= ((s & 3) == 1);
            buckets[BUCKETS_INDEX4((fast_uint_t)c0, s & 3)]++;

            c0 = T[i - 1]; s = (s << 1) + (fast_uint_t)(c0 > (c1 - (fast_sint_t)(s & 1))); SA[m] = (sa_sint_t)(i - 0); m -= ((s & 3) == 1);
            buckets[BUCKETS_INDEX4((fast_uint_t)c1, s & 3)]++;

            c1 = T[i - 2]; s = (s << 1) + (fast_uint_t)(c1 > (c0 - (fast_sint_t)(s & 1))); SA[m] = (sa_sint_t)(i - 1); m -= ((s & 3) == 1);
            buckets[BUCKETS_INDEX4((fast_uint_t)c0, s & 3)]++;

            c0 = T[i - 3]; s = (s << 1) + (fast_uint_t)(c0 > (c1 - (fast_sint_t)(s & 1))); SA[m] = (sa_sint_t)(i - 2); m -= ((s & 3) == 1);
            buckets[BUCKETS_INDEX4((fast_uint_t)c1, s & 3)]++;
        }
  • To sort LMS substrings lexicographically and compute their ranks, the libsais algorithm begins by gathering LMS-type positions as they appear in the string, placing them at the end of the suffix array. The library then employs two passes of induced sorting, which concludes with these same LMS-type positions ordered lexicographically at the beginning of the suffix array. Once all LMS-type positions are sorted, the ranks of the LMS substrings are computed by inspecting each pair of adjacent positions to determine if the corresponding LMS substrings are identical. If they are the same, they receive the same rank; otherwise, the rank is incremented by one.
    • The first challenge of induced sorting is that, during passes over the suffix array, we need to examine each value to determine if it represents a valid position or an empty space, whether this position is not the beginning of the string (and thus could induce another position), and if the induced position is going to be of the necessary type (for example, during a left-to-right scan, we are only inducing L-type positions). This process can cause branch mispredictions and corresponding microprocessor stalls. To address this challenge, libsais employs following techniques. Firstly, the library uses two pointers per induction bucket, each pointing to different sections of the suffix array depending on the type of positions these positions will be inducing next. This approach allows for the separation of LS-type (meaning S-type, which induces L-type; this is the same as LMS-type) and LL-type positions needed for the left-to-right scan from SL-type and SS-type positions needed for the right-to-left scan. Secondly, by understanding the distribution of symbols based on their position types and the types they induce (i.e., SS, SL, LS, LL), we can pre-calculate pointers for each bucket, leaving no empty spaces. And thirdly, by removing the first LMS position and all positions left of it from the initial gathering and distribution, we eliminate the need to check whether a position is not the beginning of the string. These techniques not only result in a completely branch-less loop for each induction sorting pass but also eliminate redundant scanning and the final gathering of LMS-type positions at the beginning of the suffix array.
    • The second challenge arises after induced sorting when we need to compute the ranks of LMS substrings. To accomplish this, we must first calculate and store the lengths of LMS substrings and then inspect each pair of adjacent LMS-type positions to determine if the corresponding LMS substrings are identical. This comparison starts with their lengths, and if they are the same, proceeds to compare the substrings themselves. Such operations exhibits random memory access patterns, which can significantly decrease efficiency due to cache misses. However, libsais avoids this inefficient logic by incorporating the ranking of LMS substrings as part of the induced sorting process itself. The library achieves this by marking the most significant bit (MSB) of positions in the suffix array that start new ranking groups. Each time a position is processed during induced sorting, the library checks the MSB and increments the current rank if the beginning of a new ranking group is encountered. Additionally, for each pointer in an induction bucket, the rank of the previous induced position is maintained. Whenever another position is induced, this previous rank is used to determine whether to mark the newly induced position as the beginning of a new rank group. All the logic to update the ranks and mark the beginnings of new ranking groups is implemented using bit manipulation and is completely branch-less.
    for (i = omp_block_start, j = omp_block_start + omp_block_size - 2 * prefetch_distance - 1; i < j; i += 2)
    {
        libsais_prefetchr(&SA[i + 3 * prefetch_distance]);

        libsais_prefetchr(&T[SA[i + 2 * prefetch_distance + 0] & SAINT_MAX] - 1);
        libsais_prefetchr(&T[SA[i + 2 * prefetch_distance + 0] & SAINT_MAX] - 2);
        libsais_prefetchr(&T[SA[i + 2 * prefetch_distance + 1] & SAINT_MAX] - 1);
        libsais_prefetchr(&T[SA[i + 2 * prefetch_distance + 1] & SAINT_MAX] - 2);

        sa_sint_t p0 = SA[i + prefetch_distance + 0] & SAINT_MAX; sa_sint_t v0 = BUCKETS_INDEX4(T[p0 - (p0 > 0)], 0); libsais_prefetchw(&buckets[v0]);
        sa_sint_t p1 = SA[i + prefetch_distance + 1] & SAINT_MAX; sa_sint_t v1 = BUCKETS_INDEX4(T[p1 - (p1 > 0)], 0); libsais_prefetchw(&buckets[v1]);

        sa_sint_t p2 = SA[i + 0]; d += (p2 < 0); p2 &= SAINT_MAX; sa_sint_t v2 = BUCKETS_INDEX4(T[p2 - 1], T[p2 - 2] >= T[p2 - 1]);
        SA[buckets[v2]++] = (p2 - 1) | ((sa_sint_t)(buckets[2 + v2] != d) << (SAINT_BIT - 1)); buckets[2 + v2] = d;

        sa_sint_t p3 = SA[i + 1]; d += (p3 < 0); p3 &= SAINT_MAX; sa_sint_t v3 = BUCKETS_INDEX4(T[p3 - 1], T[p3 - 2] >= T[p3 - 1]);
        SA[buckets[v3]++] = (p3 - 1) | ((sa_sint_t)(buckets[2 + v3] != d) << (SAINT_BIT - 1)); buckets[2 + v3] = d;
    }
  • In the SA-IS algorithm, after induced sorting, the ranks of LMS substrings are computed in suffix order. These ranks then need to be scattered to reorder them in string order before being gathered again to form the compacted string for recursion. At this point, some LMS substrings may be unique, meaning they don't share their rank with any other LMS substring. Being unique, these substrings are essentially already sorted, and their position relative to other LMS substrings is already determined. However, these unique LMS substrings may still be necessary for sorting other, non-unique LMS substrings during recursion-unless a unique LMS substring is immediately followed by another unique LMS substring in the string. In such cases, the rank of any subsequent unique LMS substrings becomes redundant in the compacted string, as it will not be utilized. Leveraging this insight, libsais employs a strategy to further reduce the size of the compacted string by omitting such redundant LMS substring ranks. This process involves a few steps. First, unique LMS substrings are identified by looking ahead while scanning LMS-positions in the suffix array during the ranking and scattering phase. When scattering LMS substring ranks to form the compacted string, the most significant bit (MSB) of the rank is used to mark that this rank is unique. Next, as the library scans the ranks in string order and detects tandems of unique ranks using the MSB, it then recalculates the MSB for ranks which are redundant, thus markign them for removal from the compacted string. Subsequently, the libsais rescans the LMS-positions in suffix order to recompute the ranks, now focusing only on the ranks of the remaining LMS substrings. The library also uses MSB of first symbol of LMS substrings to mark that LMS substring is removed from the compacted string. Finally, the library builds the compacted string based on the newly recalculated ranks for the remaining LMS substrings, while also saving the final positions for the removed LMS substrings before proceeding with recursion. This reduction process not only further decreases the size of the compacted string but also reduces the alphabet size of the reduced string and creates additional free space in the suffix array, which can be utilized during recursion.
  • The SA-IS algorithm, while robust for suffix array construction, is not considered lightweight due to its need for additional memory for tasks such as position classification, induced sorting, the creation of compacted string representations, and recursive decomposition. To mitigate this, libsais optimizes memory usage by not storing position classifications and striving to reuse the memory space allocated for the suffix array for induced sorting, compacted string representations, and recursive decomposition processes. Since position classifications are not stored, the library recalculates them as needed, typically involving checks of adjacent symbols for a given position. Although this approach may seem straightforward, it introduces the challenge of random memory access. Nevertheless, libsais manages these accesses in a manner that either avoids unnecessary memory fetches or minimizes cache penalties. In situations where avoiding cache penalties is unfeasible, the library leverages the most significant bit (MSB) bits for computations, as branch mispredictions on modern microprocessors generally incur lower penalties than cache misses. Memory reuse for the suffix array, despite appearing straightforward, also presents hidden challenges related to implementation complexity. In certain cases, the available space in the suffix array may not suffice for the most optimal algorithm implementation mentioned above. Although such instances are rare, the library aims to deliver optimal performance without additional memory allocation by resorting to a less efficient variant of induced sorting. To accommodate various scenarios, libsais includes four distinct implementations tailored to different breakpoints based on alphabet size (denoted by 'k'): 6k, 4k, 2k, and 1k, with each implementation optimized to ensure performance efficiency. Extensive efforts have been dedicated to refining these implementations, including significant time invested in using various sanitizers to confirm the correctness of the algorithms. Ultimately, while there are specific inputs under which libsais might require additional memory-most of which tend to be synthetic tests designed specifically to challenge the SA-IS algorithm-such instances are relatively rare. In these exceptional cases, the library is designed to allocate only the minimum necessary amount of memory while still delivering the best possible performance.
  • The libsais library, initially was developed for constructing suffix arrays, but has broadened its scope to include the calculation of the longest common prefix (LCP) and both the forward and inverse Burrows-Wheeler Transform (BWT) with considerable efforts has been dedicated to refining these algorithms to ensure they deliver maximum performance and maintain the correctness. An illustrative example is the forward BWT, which performance is nearly identical to that of its suffix array construction which is achieved by integrating a modified version of the induced sorting implementation within the final stage of the SA-IS algorithm. Rather than inducing suffix positions at this stage, the library induces the Burrows-Wheeler Transform directly. This approach also supports in-place transformation, maintaining a memory usage of 5n, making it an sutable for data compression applications. Similarly, the inverse BWT is fine-tuned to operate in-place, adhering to the same memory efficiency of 5n with an additional optimization of a bi-gram LF-mapping technique, which allows for the decoding of two symbols simultaneously effectively reduces the number of cache misses during the inversion of the Burrows-Wheeler Transform.

Benchmarks

Full list of benchmarks are moved to own Benchmarks.md file.

libsais's People

Contributors

atgoogat avatar ilyagrebnov avatar schafferde avatar tansy avatar zvezdochiot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libsais's Issues

intel compiler: libsais64 doesn't produce correct SuffixArray

Using the Intel compiler and calling the libsais64 does not result in correct suffix array.

How to reproduce

  • Checkout newest commit #d5e74eb93

  • create test.c file
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    #include <libsais.h>
    #include <libsais64.h>
    
    int main() {
        // input data
        char *Text = "abracadabra";
        int n = strlen(Text);
        int i, j;
    
        // allocate
        int64_t *SA = (int64_t *)malloc(n * sizeof(int64_t));
    
        // sort
        libsais64((unsigned char *)Text, SA, n, 0, NULL);
    
        // output
        for(i = 0; i < n; ++i) {
            printf("SA[%2d] = %2ld: ", i, SA[i]);
            for(j = SA[i]; j < n; ++j) {
                printf("%c", Text[j]);
            }
            printf("$\n");
        }
    
        // deallocate
        free(SA);
    
        return 0;
    }
  • compiled with icx -O2 -DNDEBUG src/libsais64.c src/libsais.c test.c
  • run ./a.out

Output:

SA[ 0] = 10: a$                                                                                                                                                
SA[ 1] =  7: abra$                                                                                                                                             
SA[ 2] =  0: abracadabra$                                                                                                                                      
SA[ 3] =  3: acadabra$                                                                                                                                         
SA[ 4] =  0: abracadabra$                                                                                                                                      
SA[ 5] =  0: abracadabra$                                                                                                                                      
SA[ 6] =  3: acadabra$                                                                                                                                         
SA[ 7] =  0: abracadabra$                                                                                                                                      
SA[ 8] =  0: abracadabra$                                                                                                                                      
SA[ 9] =  0: abracadabra$                                                                                                                                      
SA[10] =  0: abracadabra$       

Expected Output:

SA[ 0] = 10: a$
SA[ 1] =  7: abra$
SA[ 2] =  0: abracadabra$
SA[ 3] =  3: acadabra$
SA[ 4] =  5: adabra$
SA[ 5] =  8: bra$
SA[ 6] =  1: bracadabra$
SA[ 7] =  4: cadabra$
SA[ 8] =  6: dabra$
SA[ 9] =  9: ra$
SA[10] =  2: racadabra$

Version

icx --version

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230622)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm/../bin/icx.cfg

Things I know:

  • using -O1 works, code works with -O2 under clang and gcc.
  • libsais works but libsais64 doesn't

Thank you for all the help!

Overlapping parameters are passed to memcpy

I was playing around with libsais and Silesia corpus when I encountered a possibility for overlapping parameters passed to memcpy. This can happen precisely at libsais_reconstruct_compacted_lms_suffixes_32s_2k_omp libsais/src/libsais.c:3912.

Whether this causes problems or not, seems to depend on environment. In an environment where calls to memcpy are redirected to memmove this does not seem to cause problems and is only detected with AddressSanitizer. However in an environment where memcpy is actually utilized, this causes either a corrupted SA and/or segmentation fault, and is detected also with Valgrind.

I was able to isolate the part of the payload causing this error and wrote a minimal reproducer for it. Payload is a 32kiB block extracted from concatenation of Silesia corpus contents in alphabetical order.

How to use reproducer:
Compile with gcc -o reproducer -g -fsanitize=address reproducer.c libsais.c
Run with ./reproducer payload

If there is some information that I could provide additionally, please feel free to ask.

Crash for file size close to 2GB

As an old user looking forward to switch from libdivsufsort, I noticed libsais would crash as file size approaches 2GB (with or without giving extra space), while divsufsort won't as long as file size is strictly under 2G (210241024*1024). I wonder what is the max size doable without switching to 64-bit version.

multiple vulnerabilities in bzip3

Hello,

I reported multiple vulnerabilities in bzip3, and some of these looks to be related to libsais.

bzip3 uses its own version of libsais, but you may want to see the report to understand if those issues affect your libsais as well.

As a side note:
If you have C tests that uses libsais to parse untrusted input, I can help discover bugs with fuzzing.

strange bug in combination with malloc_count

I am not sure where the root of this issue lies, but on some operating systems, a strange bug occurs when using libsais with OpenMP enabled in combination with malloc_count (https://github.com/bingmann/malloc_count) and gcc. I have not encountered any issues when using malloc_count in combination with any other library yet, so I suspected the issue might be linked to libsais and decided to post it here. I prepared a repository to illustrate it: https://github.com/LukasNalbach/libsais-malloc_count-bug

test-1.cpp uses malloc_count and libsais: Screenshot 2023-07-23 120025
test-2.cpp uses libsais without malloc_count and test-3.cpp uses malloc_count without libsais, no errors.

The bug does not occur when using clang. It also occurs in the sigle-thread version, as long as LIBSAIS_USE_OPENMP is set to ON. It seems to occurr regardless of the gcc and OpenMP versions used. For example, on Ubuntu 18.04 and 22.04 the bug occurs, but on 20.04 and 23.04 it does not. The issue does not occur when executing the program with valgrind with the options --tool=memcheck --leak-check=full.

Even just linking against (not including or calling) libsais when also using malloc_count can cause the same issue in a simple "#pragma omp parallel for" region, regardless of whether LIBSAIS_USE_OPENMP is set to ON. I have not been able to replicate this issue in the example repository, but for example, it occurs here: https://github.com/LukasNalbach/move-r/blob/b615972ca07ea4031416b7450bea65c762cc9e7d/include/move_r/algorithms/construction/modes/common.cpp#L12

edit: Setting THREAD_SAFE_GCC_INTRINSICS to 1 does not solve the issue.

segmentation fault when building the suffix array for some large texts

With some large texts I get a segmentation fault in libsais64 and libsais64_omp (see Screenshot 2023-04-13 100612). Here is one of the texts the issue occurs with: https://drive.google.com/file/d/1UJLAHUW9KZVrW7Y8zekAJLnVp6Au_epc/view?usp=sharing. I prepared an example repository "test_libsais" on my github to test this. I can confirm that the issue occurs on Ubuntu 20.04 in WSL2 (with i7-12700K and i7-1160G7) and on Ubuntu 18.04 (with 2x AMD EPYC 7452). I tested with GCC 9.4.0 and 11.1.0 and libomp-dev (5.0.1-1).

UB in libsais_unbwt

Hi! The following C code will trigger undefined behaviour in libsais' unbwt code:

unsigned char arr[] = { 0xFB, 0xB7, 0x46, 0xA8, 0x13, 0xBC, 0x88, 0xC8, 0x9B, 0xBC, 0x97, 0xCB, 0x1A, 0xA6, 0xAE, 0x96, 0xBC, 0xBD, 0x13, 0xB7, 0xA3, 0xE2, 0x95, 0x88, 0x9B, 0xB6, 0xC2, 0x87, 0x65, 0x77, 0xF7, 0xB8, 0x8E, 0xCE, 0xE1, 0xCB, 0x9F, 0x63, 0x9B, 0xF3, 0xCB, 0x63, 0x42, 0x26, 0x14, 0x2F, 0xC4, 0xCE, 0x43 };
int size = 49; int bwt_idx = 1;

int main(void) {
    s32 * A = (s32 *) malloc(sizeof(s32) * (size + 1));
    libsais_unbwt(arr, arr, A, size, NULL, bwt_idx);
    printf("%d\n", arr[0]);
}

The Valgrind output follows:

==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D5C2: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D5EA: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D5A8: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Conditional jump or move depends on uninitialised value(s)
==2044781==    at 0x11D5CA: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D607: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)

After a closer investigation, this happens on the if statement below:

static void libsais_unbwt_decode_1(u8 * RESTRICT U, sa_uint_t * RESTRICT P, sa_uint_t * RESTRICT bucket2,
                                   u16 * RESTRICT fastbits, fast_uint_t shift, fast_uint_t * i0, fast_uint_t k) {
    u16 * RESTRICT U0 = (u16 *)(void *)U;

    fast_uint_t i, p0 = *i0;

    for (i = 0; i != k; ++i) {
        u16 c0 = fastbits[p0 >> shift];
        if (bucket2[c0] <= p0) {
            do {
                c0++;
            } while (bucket2[c0] <= p0);
        }
        p0 = P[p0];
        U0[i] = libsais_bswap16(c0);
    }

    *i0 = p0;
}

Feature request?

When dealing with large files, for space reasons I often need a partial suffix array, that is, the suffix array of every other (or every n-th) positions of the file. I am guessing it would be much faster than sorting the whole file. Would this be a sensible feature or it would require a considerate amount of work?

feature request: binary BWT

Hello. I'm grateful for your library.
However, I would really appreciate if you would add function to perform BWT over binary strings with 1/0 alphabet. Sometimes it's important for entropy analysis. But it require a lot of memory and computation time. Possibly, algorithm could be optimized for such case.

Constructing the suffix array of a string set

Thanks for the library. Very fast!

I would like to build the SA of a set of strings. There are several ways to define the SA of a string set. The most common is as follows. Let $\mathcal{T}=\{T_1,T_2,\ldots,T_n\}$ be a set of strings over $\Sigma$. Their concatenation is $T=T_1\$_1T_2\$_2\cdots T_n\$_n$ where $\$_1&lt;\$_2&lt;\cdots&lt;\$_n$ are smaller than all symbols in $\Sigma$. The SA of string set $\mathcal{T}$ is defined as the SA of string $T$. This way, you can easily retrieve the $i$-th sequence via LF-mapping.

With libsasi, which supports integer alphabets, we can convert $T$ to an integer array $X$:

$$X[k]=\left\{\begin{array}{ll} i & \mbox{if $T[k]=\$_i$} \\\ T[k]+n & \mbox{otherwise} \end{array}\right.$$

It is easy to see the SA of $X$ is identical to the SA of $T$. A problem with the solution is that we need to convert a 8-bit string to a 64-bit integer array. We need 16 bytes per symbol instead of 9 (1-byte symbol plus 8-byte SA).

I modified a 2008 version of Yuta Mori's sais such that during symbol comparisons, we use a special rule to compare sentinels. Briefly, during symbol comparisons, we implicitly replace a sentinel $\$_i$ with $j-|T|$ where $j$ is the offset of $\$_i$ in $T$. This works well. You can find the source code in this repo. I wonder if it is easy to implement a similar strategy in libsais.

Most huge datasets (the one I am working with has 600 billion symbols) logically consist of multiple strings, not a single string. It would be great if we could have a fast and lightweight SA construction algorithm for string sets. Nevertheless, I understand this is not your priority and I suspect it may be technically challenging to apply my old strategy to libsais. Please feel to say "no". Thanks for reading the long and complex request anyway.

Makefile missing

Hello, it looks like the Makefile was deleted in 3847654. Is this intentional? Is there a new preferred way to build this library now? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.