Git Product home page Git Product logo

marisa-trie's People

Contributors

glebm avatar jmr avatar karry avatar mikepb avatar s-yata avatar superbobry avatar tony-mak avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

marisa-trie's Issues

Python binding no longer compatible with python 3.12 and later

The current python binding includes include imp instruction, which will raise error with Python 3.12 or later because the imp module has been removed with Python 3.12.

Downstream bug report: https://bugs.debian.org/1061809

When trying to regenerate the python binding with swig 4.2, the following warnings will appear:

-> % make swig-python          
swig -Wall -c++ -python -outdir python marisa-swig.i
marisa-swig.h:55: Warning 321: 'str' conflicts with a built-in name in python
marisa-swig.h:56: Warning 321: 'id' conflicts with a built-in name in python
marisa-swig.h:69: Warning 321: 'str' conflicts with a built-in name in python
marisa-swig.h:70: Warning 321: 'id' conflicts with a built-in name in python
mv marisa-swig_wrap.cxx python
cp marisa-swig.cxx marisa-swig.h python

Please revise the Python binding and fix the warnings accordingly. Thanks!

MARISA_SIZE_ERROR: buf.size() > MARISA_UINT32_MAX

Hello Susumu,

We are using the Python Marisa trie wrapper (https://github.com/kmike/marisa-trie) which implements your library. The amount of data we've been placing in the trie has been increasing over time and the most recent trie generation caused the following overflow:

File "marisa_trie.pyx", line 422, in marisa_trie.BytesTrie.init (src/marisa_trie.cpp:7670)
File "marisa_trie.pyx", line 127, in marisa_trie.Trie._build (src/marisa_trie.cpp:2768)
RuntimeError: lib/marisa/grimoire/trie/tail.cc:192: MARISA_SIZE_ERROR: buf.size() > MARISA_UINT32_MAX

If there's any more info you need please let me know!

Error with perl bindings

Marisa works well on my local machine (perl 5.22). But on a machine with CentOS6 (perl 5.10.1) I get the following error:
perl sample.pl
Can't load '/usr/local/lib64/perl5/auto/marisa/marisa.so' for module marisa: /usr/local/lib64/perl5/auto/marisa/marisa.so: undefined symbol: _ZTVN10__cxxabiv120__si_class_type_infoE at /usr/lib64/perl5/DynaLoader.pm line 200.
at /usr/local/lib64/perl5/marisa.pm line 11
Compilation failed in require at sample.pl line 1.
BEGIN failed--compilation aborted at sample.pl line 1.

Any suggestions?
(PS had to upgrade to newer versions of autoconf and automake in order to compile marisa on CentOS6).

ID Assigned according to the lexicographical order

Can the ID be assigned in ascending lexicographical order?
For example, I hope that the ID corresponding to "bbb" is larger than the ID corresponding to "a".
Hope to provide an interface to return ID greater/smaller than a certain string.
The purpose is to transform the comparison of strings into comparison of cooresponding IDs.

I haven't read the code yet, and I don't know if this requirement is feasible in the current architecture.

Looking forward to any feedback, thank you very much.

An error occur when "make" in OSX

warning: ignoring file ./.libs/libcmdopt.a, file was built for archive which is not the architecture being linked (x86_64)。

I installed automake, libtool by homebrew.
marisa version: 0.2.5

How can I do search?

I find the function _find, but I think it is an internal function and I don't know how to use it, is it right ? _find(key, 0, 0, len(key))

Any restrictions on making an alternate implementation

Hi, at first I would like to express my appreciation for this great work! It is amazing how much compression can be achieved by using this technology!!

I would like to make a completely different implementation of the same idea with several modifications for use in my project Unishox. Are there any restrictions on using this idea for a different implementation and document it indicating the differences from this work? I could not find a white paper for marisa. Has it been patented?

Keyset causes bottleneck for large data sets?

Hi,
I am currently trying to insert every consecutive k-mer (subsequence of length k) of the human chromosome 21 into your trie structure. Since DNA has only a character alphabet of {A, C, G, T} I thought that might be a good use case for tries. With k=32 we are talking about ~55 million "words" to be inserted.

It seems that the bottleneck here is the marisa::Keyset, since this data structure obviously gets way too big for the RAM. I never even make it to the building step of marisa::trie. Is there a way to avoid building the full keyset first? Any chance to store data of that size with this library? I use pretty much the code from your README in section "Library->How to use".

race condition in Mapper::open_()

Hello, I'm giving marisa a quick review as part of an Ubuntu Main Inclusion Request: https://bugs.launchpad.net/ubuntu/+source/marisa/+bug/1914808 This is just a quick review, nothing too deep.

Coverity reported a race condition in Mapper::open_():

void Mapper::open_(const char *filename) {

If another process were to swap out the file between the stat() call and the open() call, the size might refer to a different file.

This probably isn't a big deal, since a process that could do that could do other silly things to break marisa, but I figured it's worth reporting all the same.

Coverity gives a lot of results, some look like mistakes in Coverity, some look like harmless issues (std::rand() use in tests, exceptions bubbling up to main() in tests, etc.). I'll attach the whole Coverity report to the Ubuntu bug, in case you're curious.

Thanks

lib/marisa/grimoire/io/mapper.cc:141:
  1. fs_check_call: Calling function "stat" to perform check on "filename".
lib/marisa/grimoire/io/mapper.cc:141:
  2. path: Condition "!(stat(filename, &st) != 0)", taking true branch.
lib/marisa/grimoire/io/mapper.cc:142:
  3. path: Condition "!((marisa::UInt64)st.st_size > 18446744073709551615UL /* (size_t)~((size_t)0) */)", taking true branch.
lib/marisa/grimoire/io/mapper.cc:145:
  4. toctou: Calling function "open" that uses "filename" after a check function. This can cause a time-of-check, time-of-use race condition.

can marisa-build preserves the original weight of a special key?

hi @s-yata
Thank you for your excellent marisa-trie.

I have a question, hope you can give some help or hint:
The document on http://s-yata.github.io/marisa-trie/docs/readme.en.html has this description:

If an input line contains horizontal tabs, the last one serves as the delimiter between a key and its weight which is used to optimize the order of nodes. Estimated frequency of each key, given as the weight, may improve the search performance.

I'm wonder if marisa-trie can keep the original predefined weight of a special key? I mean not the updated weight.

while (trie.predictive_search(agent)) {
    const marisa::Key key = agent.key();
    key.weight(); // this `weight` is not the original weight
}

Can marisa-trie provide another api to get the original predefined weight? That will be very helpful for many use-cases, eg: a word with predefined frequency as the weight in a large words dictionary.

read/map v0.1.6 trie with v>=0.2.6?

It would be nice to be able to read and map (or mostly map and read the rest) data from 0.1.5, which I know uses a different format. Are the formats close enough that it would be a moderately small amount of extra code to read? At first glance, I see Trie and MarisaTrie have most fields in common, but 0.1.6 has labels_ and links_ while 0.2.6 has bases_ and extras_. I didn't look at whether the formats of BitVector or other classes changed.

Bug hunt: Pointer underflow behavior

Following up from #7 marisa::grimoire::trie::ReverseKey takes advantage of pointer underflow to simplify the code. Unfortunately, it seems that the pointer underflow behavior is undefined. Clang, for example, exhibits a different behavior than GCC.

A quick search suggests that relying on this behavior opens up marisa-trie to compiler-specific bugs and security exploits.

The purpose of this issue is to track these consequences and discuss possible fixes.

From Show at Stack Overflow on the behavior in C++:

The standard specifies in §5.7/4 that:

When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integral expression. [...] Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object. If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behaviour is undefined.

A similar quote can be found for subtraction in §5.7/5. So given that overflow and underflow are special cases of pointers that exceed the bounds of object to which they originally point to, the behaviour would simply be undefined.

From rampion at Stack Overflow on the behavior in C99:

From section 6.5.6 of the combined C99 + TC1 + TC2 (pdf):

If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.

Some resources:

Segmentation fault with CXXFLAGS="-O3" or CXXFLAGS="-O1 -ftree-loop-vectorize"

Segmentation fault in marisa-test occurs if Marisa is configured with CXXFLAGS="-O3".
Problem does not occur for me with additional -march=westmere or -march=native (e.g. CXXFLAGS="-O3 -march=westmere").
I use GCC 10.2.0.

$ CXXFLAGS="-g -O3" ./configure
...
$ make
...
$ make check
...
make  check-TESTS
make[2]: Entering directory '/tmp/marisa-trie/tests'
make[3]: Entering directory '/tmp/marisa-trie/tests'
PASS: base-test
PASS: io-test
PASS: vector-test
PASS: trie-test
../test-driver: line 109: 14027 Segmentation fault      "$@" > $log_file 2>&1
FAIL: marisa-test
============================================================================
Testsuite summary for marisa 0.2.6
============================================================================
# TOTAL: 5
# PASS:  4
# SKIP:  0
# XFAIL: 0
# FAIL:  1
# XPASS: 0
# ERROR: 0
============================================================================
See tests/test-suite.log
Please report to [email protected]
============================================================================
make[3]: *** [Makefile:715: test-suite.log] Error 1
make[3]: Leaving directory '/tmp/marisa-trie/tests'
make[2]: *** [Makefile:823: check-TESTS] Error 2
make[2]: Leaving directory '/tmp/marisa-trie/tests'
make[1]: *** [Makefile:925: check-am] Error 2
make[1]: Leaving directory '/tmp/marisa-trie/tests'
make: *** [Makefile:456: check-recursive] Error 1
$ LD_LIBRARY_PATH="$(pwd)/lib/marisa/.libs" gdb tests/.libs/marisa-test
...
(gdb) r
Starting program: /tmp/marisa-trie/tests/.libs/marisa-test 
marisa-test.cc:13: TestEmptyTrie(): ok
marisa-test.cc:87: TestTinyTrie(): ok

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7fab3c5 in marisa::grimoire::algorithm::details::get_label<marisa::grimoire::trie::Entry> (depth=1, unit=...)
    at ../../../../lib/marisa/grimoire/algorithm/sort.h:19
19        return (depth < unit.length()) ? (int)(UInt8)unit[depth] : -1;
(gdb) bt
#0  0x00007ffff7fab3c5 in marisa::grimoire::algorithm::details::get_label<marisa::grimoire::trie::Entry> (depth=1, unit=...)
    at ../../../../lib/marisa/grimoire/algorithm/sort.h:19
#1  marisa::grimoire::algorithm::details::sort<marisa::grimoire::trie::Entry*> (l=l@entry=0x55555557c110, r=r@entry=0x55555557c660, depth=depth@entry=1)
    at ../../../../lib/marisa/grimoire/algorithm/sort.h:103
#2  0x00007ffff7fab4ea in marisa::grimoire::algorithm::details::sort<marisa::grimoire::trie::Entry*> (l=0x55555557aed0, r=0x55555557dce0, 
    depth=depth@entry=0) at ../../../../lib/marisa/grimoire/algorithm/sort.h:132
#3  0x00007ffff7fac2c3 in marisa::grimoire::algorithm::sort<marisa::grimoire::trie::Entry*> (end=<optimized out>, begin=<optimized out>)
    at ../../../../lib/marisa/grimoire/algorithm/sort.h:189
#4  marisa::grimoire::Algorithm::sort<marisa::grimoire::trie::Entry*> (end=<optimized out>, begin=<optimized out>, this=<optimized out>)
    at ../../../../lib/marisa/grimoire/algorithm.h:15
#5  marisa::grimoire::trie::Tail::build_ (this=0x7fffffffd130, entries=..., offsets=0x7fffffffd310, mode=MARISA_TEXT_TAIL) at tail.cc:161
#6  0x00007ffff7facde7 in marisa::grimoire::trie::Tail::build (this=this@entry=0x7fffffffd918, entries=..., offsets=offsets@entry=0x7fffffffd310, 
    mode=MARISA_TEXT_TAIL) at tail.cc:41
#7  0x00007ffff7fb569b in marisa::grimoire::trie::LoudsTrie::build_next_trie<marisa::grimoire::trie::Key> (this=0x7fffffffd630, keys=..., 
    terminals=0x7fffffffd310, config=..., trie_id=1) at ../../../../lib/marisa/grimoire/trie/config.h:35
#8  0x00007ffff7fbea57 in marisa::grimoire::trie::LoudsTrie::build_trie<marisa::grimoire::trie::Key> (this=this@entry=0x7fffffffd630, keys=..., 
    terminals=terminals@entry=0x7fffffffd4c0, config=..., trie_id=trie_id@entry=1) at louds-trie.cc:300
#9  0x00007ffff7fb5940 in marisa::grimoire::trie::LoudsTrie::build_ (this=0x7fffffffd630, keyset=..., config=...) at louds-trie.cc:257
#10 0x00007ffff7fb69e3 in marisa::grimoire::trie::LoudsTrie::build (this=this@entry=0x555555576bd0, keyset=..., flags=flags@entry=135169)
    at louds-trie.cc:27
#11 0x00007ffff7fa975a in marisa::Trie::build (this=this@entry=0x7fffffffdb38, keyset=..., config_flags=config_flags@entry=135169) at trie.cc:16
#12 0x000055555555ada4 in (anonymous namespace)::TestTrie (num_tries=1, tail_mode=MARISA_TEXT_TAIL, node_order=MARISA_WEIGHT_ORDER, keyset=...)
    at marisa-test.cc:268
#13 0x000055555555bd0a in (anonymous namespace)::TestTrie (tail_mode=MARISA_TEXT_TAIL, node_order=MARISA_WEIGHT_ORDER, keyset=...) at marisa-test.cc:357
#14 0x000055555555be2e in (anonymous namespace)::TestTrie (tail_mode=MARISA_TEXT_TAIL) at marisa-test.cc:367
#15 0x0000555555559b7e in (anonymous namespace)::TestTrie () at marisa-test.cc:372
#16 main () at marisa-test.cc:383
(gdb) 

Could Keyset (memory bottleneck) be bypassed by ordering input?

As already noted in (closed) issue #18 , marisa::Keyset poses an input size (memory) bottleneck for trie construction.
dawgdic has a DawgBuilder class which accepts input in (LC_ALL=C, i.e. memcmp) sorted order, and builds their data structure directly from that to avoid the need for an in-memory input record buffer (which is what I understand Marisa's marisa::Keyset to be).
Since we have good tools for scalable sorting (even out-of-core; e.g. GNU sort), is is possible a similar MarisaBuilder could accept input in some preprocessed (e.g. sorted) order, and avoid the need for a marisa::Keyset? I think some of us would be willing to externally preprocess our input data if so.

Add riscv64 support

Due to the fact riscv64 is not explicitly mentioned in include/marisa/base.h, MARISA_WORD_SIZE ends up being incorrectly set to 32 on this architecture. Fortunately, this is trivial to address - just add

|| (defined(__riscv) && (__riscv_xlen == 64))

to the end of the #if statement in base.h:31-34.

Confirmed to work on a physical rv64 Linux system.

References:

MARISA_IO_ERROR: file_ == INVALID_HANDLE_VALUE

We've managed to spot one more issue with Windows support in marisa-trie. Trie.write fails with

lib/marisa/grimoire/io\mapper.cc:127: MARISA_IO_ERROR: file_ == INVALID_HANDLE_VALUE

You can see the Python test case causing the error here.

Clarify license

README.md says:

marisa-trie/README.md

Lines 58 to 61 in 006020c

#### Source code license
* The BSD 2-clause License
* The LGPL 2.1 or any later version

It is not clear whether this means one must obey the terms of both:

  • BSD-2-clause and LGPL-2.1-or-later

or either:

  • BSD-2-clause or LGPL-2.1-or-later

COPYING.md says:

libmarisa and its command line tools are dual-licensed under the BSD 2-clause license and the LGPL.

I've often seen the term "dual-licensed" used to mean one can choose one license or the other, but it would be good to specify the interpretation you intended explicitly.

Error installing : Fatal Error

I am trying to install a package in windows 10 which requires marisa trie but i am facing below error. I tried installing as a standalone package as well but still throwing the same error.

> Searching for marisa_trie
> Reading https://pypi.org/simple/marisa_trie/
> Downloading https://files.pythonhosted.org/packages/20/95/d23071d0992dabcb61c948fb118a90683193befc88c23e745b050a29e7db/marisa-trie-0.7.5.tar.gz#sha256=c73bc25d868e8c4ea7aa7f1e19892db07bba2463351269b05340ccfa06eb2baf
> Best match: marisa-trie 0.7.5
> Processing marisa-trie-0.7.5.tar.gz
> Writing C:\Users\Gabriel\AppData\Local\Temp\easy_install-0neivqdq\marisa-trie-0.7.5\setup.cfg
> Running marisa-trie-0.7.5\setup.py -q bdist_egg --dist-dir C:\Users\Gabriel\AppData\Local\Temp\easy_install-0neivqdq\marisa-trie-0.7.5\egg-dist-tmp-5v_selap
> warning: no files found matching 'lib\COPYING'
> agent.cc
> keyset.cc
> trie.cc
> mapper.cc
> marisa-trie\lib\marisa/grimoire/io\mapper.cc(4): fatal error C1083: Cannot open include file: 'windows.h': No such file or directory
> error: Setup script exited with error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe' failed with exit status 2

Could you please check this

Warnings on 64-bit Windows

Hello,

I'm one of the maintainers of the Python bindings to marisa-trie. Recently we started testing the code on Windows and came across a known issue with the 64-bit version. The library compiles with a number of invalid-cast warnings, but most of the tests fail to pass, which may indicate that despite successful compilation, the library is broken.

Here's a sample of the compilation warnings:

lib/marisa/grimoire/vector\bit-vector.cc(771) : warning C4267: 'argument' : conversion from 'size_t' to 'const marisa::UInt32', possible loss of data
lib/marisa/grimoire/vector\bit-vector.cc(776) : warning C4267: 'argument' : conversion from 'size_t' to 'const marisa::UInt32', possible loss of data
lib/marisa/grimoire/vector\bit-vector.cc(815) : warning C4267: 'argument' : conversion from 'size_t' to 'const marisa::UInt32', possible loss of data

You can see full compilation log on AppVeyor.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.