Git Product home page Git Product logo

ilyagrebnov / libsais Goto Github PK

View Code? Open in Web Editor NEW
172.0 14.0 21.0 690 KB

libsais is a library for linear time suffix array, longest common prefix array and burrows wheeler transform construction based on induced sorting algorithm.

License: Apache License 2.0

C 99.91% CMake 0.09%
bwt burrows-wheeler-transform saca suffix-array sais divsufsort suffixarray lcp lcp-array longest-common-prefix parallel-suffix-array

libsais's Issues

Feature request?

When dealing with large files, for space reasons I often need a partial suffix array, that is, the suffix array of every other (or every n-th) positions of the file. I am guessing it would be much faster than sorting the whole file. Would this be a sensible feature or it would require a considerate amount of work?

Overlapping parameters are passed to memcpy

I was playing around with libsais and Silesia corpus when I encountered a possibility for overlapping parameters passed to memcpy. This can happen precisely at libsais_reconstruct_compacted_lms_suffixes_32s_2k_omp libsais/src/libsais.c:3912.

Whether this causes problems or not, seems to depend on environment. In an environment where calls to memcpy are redirected to memmove this does not seem to cause problems and is only detected with AddressSanitizer. However in an environment where memcpy is actually utilized, this causes either a corrupted SA and/or segmentation fault, and is detected also with Valgrind.

I was able to isolate the part of the payload causing this error and wrote a minimal reproducer for it. Payload is a 32kiB block extracted from concatenation of Silesia corpus contents in alphabetical order.

How to use reproducer:
Compile with gcc -o reproducer -g -fsanitize=address reproducer.c libsais.c
Run with ./reproducer payload

If there is some information that I could provide additionally, please feel free to ask.

Constructing the suffix array of a string set

Thanks for the library. Very fast!

I would like to build the SA of a set of strings. There are several ways to define the SA of a string set. The most common is as follows. Let $\mathcal{T}=\{T_1,T_2,\ldots,T_n\}$ be a set of strings over $\Sigma$. Their concatenation is $T=T_1\$_1T_2\$_2\cdots T_n\$_n$ where $\$_1<\$_2<\cdots<\$_n$ are smaller than all symbols in $\Sigma$. The SA of string set $\mathcal{T}$ is defined as the SA of string $T$. This way, you can easily retrieve the $i$-th sequence via LF-mapping.

With libsasi, which supports integer alphabets, we can convert $T$ to an integer array $X$:

$$X[k]=\left\{\begin{array}{ll} i & \mbox{if $T[k]=\$_i$} \\\ T[k]+n & \mbox{otherwise} \end{array}\right.$$

It is easy to see the SA of $X$ is identical to the SA of $T$. A problem with the solution is that we need to convert a 8-bit string to a 64-bit integer array. We need 16 bytes per symbol instead of 9 (1-byte symbol plus 8-byte SA).

I modified a 2008 version of Yuta Mori's sais such that during symbol comparisons, we use a special rule to compare sentinels. Briefly, during symbol comparisons, we implicitly replace a sentinel $\$_i$ with $j-|T|$ where $j$ is the offset of $\$_i$ in $T$. This works well. You can find the source code in this repo. I wonder if it is easy to implement a similar strategy in libsais.

Most huge datasets (the one I am working with has 600 billion symbols) logically consist of multiple strings, not a single string. It would be great if we could have a fast and lightweight SA construction algorithm for string sets. Nevertheless, I understand this is not your priority and I suspect it may be technically challenging to apply my old strategy to libsais. Please feel to say "no". Thanks for reading the long and complex request anyway.

UB in libsais_unbwt

Hi! The following C code will trigger undefined behaviour in libsais' unbwt code:

unsigned char arr[] = { 0xFB, 0xB7, 0x46, 0xA8, 0x13, 0xBC, 0x88, 0xC8, 0x9B, 0xBC, 0x97, 0xCB, 0x1A, 0xA6, 0xAE, 0x96, 0xBC, 0xBD, 0x13, 0xB7, 0xA3, 0xE2, 0x95, 0x88, 0x9B, 0xB6, 0xC2, 0x87, 0x65, 0x77, 0xF7, 0xB8, 0x8E, 0xCE, 0xE1, 0xCB, 0x9F, 0x63, 0x9B, 0xF3, 0xCB, 0x63, 0x42, 0x26, 0x14, 0x2F, 0xC4, 0xCE, 0x43 };
int size = 49; int bwt_idx = 1;

int main(void) {
    s32 * A = (s32 *) malloc(sizeof(s32) * (size + 1));
    libsais_unbwt(arr, arr, A, size, NULL, bwt_idx);
    printf("%d\n", arr[0]);
}

The Valgrind output follows:

==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D5C2: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D5EA: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D5A8: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Conditional jump or move depends on uninitialised value(s)
==2044781==    at 0x11D5CA: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==
==2044781== Use of uninitialised value of size 8
==2044781==    at 0x11D607: libsais_unbwt_decode_1 (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x11F638: libsais_unbwt_decode (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12016F: libsais_unbwt_decode_omp (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120205: libsais_unbwt_core (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x12031E: libsais_unbwt_main (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120601: libsais_unbwt_aux (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x120489: libsais_unbwt (in /home/palaiologos/Desktop/bzip3/unit)
==2044781==    by 0x10927E: main (in /home/palaiologos/Desktop/bzip3/unit)

After a closer investigation, this happens on the if statement below:

static void libsais_unbwt_decode_1(u8 * RESTRICT U, sa_uint_t * RESTRICT P, sa_uint_t * RESTRICT bucket2,
                                   u16 * RESTRICT fastbits, fast_uint_t shift, fast_uint_t * i0, fast_uint_t k) {
    u16 * RESTRICT U0 = (u16 *)(void *)U;

    fast_uint_t i, p0 = *i0;

    for (i = 0; i != k; ++i) {
        u16 c0 = fastbits[p0 >> shift];
        if (bucket2[c0] <= p0) {
            do {
                c0++;
            } while (bucket2[c0] <= p0);
        }
        p0 = P[p0];
        U0[i] = libsais_bswap16(c0);
    }

    *i0 = p0;
}

multiple vulnerabilities in bzip3

Hello,

I reported multiple vulnerabilities in bzip3, and some of these looks to be related to libsais.

bzip3 uses its own version of libsais, but you may want to see the report to understand if those issues affect your libsais as well.

As a side note:
If you have C tests that uses libsais to parse untrusted input, I can help discover bugs with fuzzing.

intel compiler: libsais64 doesn't produce correct SuffixArray

Using the Intel compiler and calling the libsais64 does not result in correct suffix array.

How to reproduce

  • Checkout newest commit #d5e74eb93

  • create test.c file
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    
    #include <libsais.h>
    #include <libsais64.h>
    
    int main() {
        // input data
        char *Text = "abracadabra";
        int n = strlen(Text);
        int i, j;
    
        // allocate
        int64_t *SA = (int64_t *)malloc(n * sizeof(int64_t));
    
        // sort
        libsais64((unsigned char *)Text, SA, n, 0, NULL);
    
        // output
        for(i = 0; i < n; ++i) {
            printf("SA[%2d] = %2ld: ", i, SA[i]);
            for(j = SA[i]; j < n; ++j) {
                printf("%c", Text[j]);
            }
            printf("$\n");
        }
    
        // deallocate
        free(SA);
    
        return 0;
    }
  • compiled with icx -O2 -DNDEBUG src/libsais64.c src/libsais.c test.c
  • run ./a.out

Output:

SA[ 0] = 10: a$                                                                                                                                                
SA[ 1] =  7: abra$                                                                                                                                             
SA[ 2] =  0: abracadabra$                                                                                                                                      
SA[ 3] =  3: acadabra$                                                                                                                                         
SA[ 4] =  0: abracadabra$                                                                                                                                      
SA[ 5] =  0: abracadabra$                                                                                                                                      
SA[ 6] =  3: acadabra$                                                                                                                                         
SA[ 7] =  0: abracadabra$                                                                                                                                      
SA[ 8] =  0: abracadabra$                                                                                                                                      
SA[ 9] =  0: abracadabra$                                                                                                                                      
SA[10] =  0: abracadabra$       

Expected Output:

SA[ 0] = 10: a$
SA[ 1] =  7: abra$
SA[ 2] =  0: abracadabra$
SA[ 3] =  3: acadabra$
SA[ 4] =  5: adabra$
SA[ 5] =  8: bra$
SA[ 6] =  1: bracadabra$
SA[ 7] =  4: cadabra$
SA[ 8] =  6: dabra$
SA[ 9] =  9: ra$
SA[10] =  2: racadabra$

Version

icx --version

Intel(R) oneAPI DPC++/C++ Compiler 2023.2.0 (2023.2.0.20230622)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm
Configuration file: /opt/intel/oneapi/compiler/2023.2.0/linux/bin-llvm/../bin/icx.cfg

Things I know:

  • using -O1 works, code works with -O2 under clang and gcc.
  • libsais works but libsais64 doesn't

Thank you for all the help!

Why not compute the LCP Array during the Suffix Array's construction ?

Currently the path to computing the LCP Aray with libsais is:
libsais(T,...) ==> libsais_plcp(T, SA,...) ==> libsais_lcp(PLCP, SA,...)

This takes an unnecessary amount of operations since the LCP Array can be computed during the Suffix Array's construction thus avoiding the recomparison of entire adjacent overlapping strings in the Suffix Array:

This can be done because whenever we place two S- or L-suffixes Si−1 and Sj−1 at adjacent places k−1 and k in the final Suffix Array (steps 3 and 4 in the SAIS algorithm), the length of their longest common prefix can be induced from the longest common prefix of the suffixes Si and Sj.
Since the latter suffixes are exactly those that caused the inducing of Si−1 and Sj−1, we already know their LCP-value ℓ (by the order in which we fill the SA), and can hence set LCP[k] to ℓ + 1.

strange bug in combination with malloc_count

I am not sure where the root of this issue lies, but on some operating systems, a strange bug occurs when using libsais with OpenMP enabled in combination with malloc_count (https://github.com/bingmann/malloc_count) and gcc. I have not encountered any issues when using malloc_count in combination with any other library yet, so I suspected the issue might be linked to libsais and decided to post it here. I prepared a repository to illustrate it: https://github.com/LukasNalbach/libsais-malloc_count-bug

test-1.cpp uses malloc_count and libsais: Screenshot 2023-07-23 120025
test-2.cpp uses libsais without malloc_count and test-3.cpp uses malloc_count without libsais, no errors.

The bug does not occur when using clang. It also occurs in the sigle-thread version, as long as LIBSAIS_USE_OPENMP is set to ON. It seems to occurr regardless of the gcc and OpenMP versions used. For example, on Ubuntu 18.04 and 22.04 the bug occurs, but on 20.04 and 23.04 it does not. The issue does not occur when executing the program with valgrind with the options --tool=memcheck --leak-check=full.

Even just linking against (not including or calling) libsais when also using malloc_count can cause the same issue in a simple "#pragma omp parallel for" region, regardless of whether LIBSAIS_USE_OPENMP is set to ON. I have not been able to replicate this issue in the example repository, but for example, it occurs here: https://github.com/LukasNalbach/move-r/blob/b615972ca07ea4031416b7450bea65c762cc9e7d/include/move_r/algorithms/construction/modes/common.cpp#L12

edit: Setting THREAD_SAFE_GCC_INTRINSICS to 1 does not solve the issue.

Crash for file size close to 2GB

As an old user looking forward to switch from libdivsufsort, I noticed libsais would crash as file size approaches 2GB (with or without giving extra space), while divsufsort won't as long as file size is strictly under 2G (210241024*1024). I wonder what is the max size doable without switching to 64-bit version.

Makefile missing

Hello, it looks like the Makefile was deleted in 3847654. Is this intentional? Is there a new preferred way to build this library now? Thanks.

segmentation fault when building the suffix array for some large texts

With some large texts I get a segmentation fault in libsais64 and libsais64_omp (see Screenshot 2023-04-13 100612). Here is one of the texts the issue occurs with: https://drive.google.com/file/d/1UJLAHUW9KZVrW7Y8zekAJLnVp6Au_epc/view?usp=sharing. I prepared an example repository "test_libsais" on my github to test this. I can confirm that the issue occurs on Ubuntu 20.04 in WSL2 (with i7-12700K and i7-1160G7) and on Ubuntu 18.04 (with 2x AMD EPYC 7452). I tested with GCC 9.4.0 and 11.1.0 and libomp-dev (5.0.1-1).

feature request: binary BWT

Hello. I'm grateful for your library.
However, I would really appreciate if you would add function to perform BWT over binary strings with 1/0 alphabet. Sometimes it's important for entropy analysis. But it require a lot of memory and computation time. Possibly, algorithm could be optimized for such case.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.