s-yata / marisa-trie Goto Github PK
View Code? Open in Web Editor NEWMARISA: Matching Algorithm with Recursively Implemented StorAge
License: Other
MARISA: Matching Algorithm with Recursively Implemented StorAge
License: Other
marisa-trie/bindings/python/marisa.py
Line 12 in 006020c
The current python binding includes include imp
instruction, which will raise error with Python 3.12 or later because the imp
module has been removed with Python 3.12.
Downstream bug report: https://bugs.debian.org/1061809
When trying to regenerate the python binding with swig 4.2, the following warnings will appear:
-> % make swig-python
swig -Wall -c++ -python -outdir python marisa-swig.i
marisa-swig.h:55: Warning 321: 'str' conflicts with a built-in name in python
marisa-swig.h:56: Warning 321: 'id' conflicts with a built-in name in python
marisa-swig.h:69: Warning 321: 'str' conflicts with a built-in name in python
marisa-swig.h:70: Warning 321: 'id' conflicts with a built-in name in python
mv marisa-swig_wrap.cxx python
cp marisa-swig.cxx marisa-swig.h python
Please revise the Python binding and fix the warnings accordingly. Thanks!
Hello Susumu,
We are using the Python Marisa trie wrapper (https://github.com/kmike/marisa-trie) which implements your library. The amount of data we've been placing in the trie has been increasing over time and the most recent trie generation caused the following overflow:
File "marisa_trie.pyx", line 422, in marisa_trie.BytesTrie.init (src/marisa_trie.cpp:7670)
File "marisa_trie.pyx", line 127, in marisa_trie.Trie._build (src/marisa_trie.cpp:2768)
RuntimeError: lib/marisa/grimoire/trie/tail.cc:192: MARISA_SIZE_ERROR: buf.size() > MARISA_UINT32_MAX
If there's any more info you need please let me know!
Marisa works well on my local machine (perl 5.22). But on a machine with CentOS6 (perl 5.10.1) I get the following error:
perl sample.pl
Can't load '/usr/local/lib64/perl5/auto/marisa/marisa.so' for module marisa: /usr/local/lib64/perl5/auto/marisa/marisa.so: undefined symbol: _ZTVN10__cxxabiv120__si_class_type_infoE at /usr/lib64/perl5/DynaLoader.pm line 200.
at /usr/local/lib64/perl5/marisa.pm line 11
Compilation failed in require at sample.pl line 1.
BEGIN failed--compilation aborted at sample.pl line 1.
Any suggestions?
(PS had to upgrade to newer versions of autoconf and automake in order to compile marisa on CentOS6).
Can the ID be assigned in ascending lexicographical order?
For example, I hope that the ID corresponding to "bbb" is larger than the ID corresponding to "a".
Hope to provide an interface to return ID greater/smaller than a certain string.
The purpose is to transform the comparison of strings into comparison of cooresponding IDs.
I haven't read the code yet, and I don't know if this requirement is feasible in the current architecture.
Looking forward to any feedback, thank you very much.
Documentation (docs/readme.*.html
) must be updated for v0.2.5.
warning: ignoring file ./.libs/libcmdopt.a, file was built for archive which is not the architecture being linked (x86_64)。
I installed automake, libtool by homebrew.
marisa version: 0.2.5
I find the function _find
, but I think it is an internal function and I don't know how to use it, is it right ? _find(key, 0, 0, len(key))
Hi, at first I would like to express my appreciation for this great work! It is amazing how much compression can be achieved by using this technology!!
I would like to make a completely different implementation of the same idea with several modifications for use in my project Unishox. Are there any restrictions on using this idea for a different implementation and document it indicating the differences from this work? I could not find a white paper for marisa. Has it been patented?
Hi,
I am currently trying to insert every consecutive k-mer (subsequence of length k) of the human chromosome 21 into your trie structure. Since DNA has only a character alphabet of {A, C, G, T} I thought that might be a good use case for tries. With k=32 we are talking about ~55 million "words" to be inserted.
It seems that the bottleneck here is the marisa::Keyset
, since this data structure obviously gets way too big for the RAM. I never even make it to the building step of marisa::trie
. Is there a way to avoid building the full keyset first? Any chance to store data of that size with this library? I use pretty much the code from your README in section "Library->How to use".
Hello, I'm giving marisa a quick review as part of an Ubuntu Main Inclusion Request: https://bugs.launchpad.net/ubuntu/+source/marisa/+bug/1914808 This is just a quick review, nothing too deep.
Coverity reported a race condition in Mapper::open_()
:
marisa-trie/lib/marisa/grimoire/io/mapper.cc
Line 139 in 006020c
If another process were to swap out the file between the stat()
call and the open()
call, the size
might refer to a different file.
This probably isn't a big deal, since a process that could do that could do other silly things to break marisa, but I figured it's worth reporting all the same.
Coverity gives a lot of results, some look like mistakes in Coverity, some look like harmless issues (std::rand()
use in tests, exceptions bubbling up to main()
in tests, etc.). I'll attach the whole Coverity report to the Ubuntu bug, in case you're curious.
Thanks
lib/marisa/grimoire/io/mapper.cc:141:
1. fs_check_call: Calling function "stat" to perform check on "filename".
lib/marisa/grimoire/io/mapper.cc:141:
2. path: Condition "!(stat(filename, &st) != 0)", taking true branch.
lib/marisa/grimoire/io/mapper.cc:142:
3. path: Condition "!((marisa::UInt64)st.st_size > 18446744073709551615UL /* (size_t)~((size_t)0) */)", taking true branch.
lib/marisa/grimoire/io/mapper.cc:145:
4. toctou: Calling function "open" that uses "filename" after a check function. This can cause a time-of-check, time-of-use race condition.
hi @s-yata
Thank you for your excellent marisa-trie
.
I have a question, hope you can give some help or hint:
The document on http://s-yata.github.io/marisa-trie/docs/readme.en.html has this description:
If an input line contains horizontal tabs, the last one serves as the delimiter between a key and its weight which is used to optimize the order of nodes. Estimated frequency of each key, given as the weight, may improve the search performance.
I'm wonder if marisa-trie can keep the original predefined weight of a special key? I mean not the updated weight
.
while (trie.predictive_search(agent)) {
const marisa::Key key = agent.key();
key.weight(); // this `weight` is not the original weight
}
Can marisa-trie
provide another api to get the original predefined weight? That will be very helpful for many use-cases, eg: a word with predefined frequency as the weight in a large words dictionary.
It would be nice to be able to read and map (or mostly map and read the rest) data from 0.1.5, which I know uses a different format. Are the formats close enough that it would be a moderately small amount of extra code to read? At first glance, I see Trie
and MarisaTrie
have most fields in common, but 0.1.6 has labels_
and links_
while 0.2.6 has bases_
and extras_
. I didn't look at whether the formats of BitVector
or other classes changed.
Following up from #7 marisa::grimoire::trie::ReverseKey
takes advantage of pointer underflow to simplify the code. Unfortunately, it seems that the pointer underflow behavior is undefined. Clang, for example, exhibits a different behavior than GCC.
A quick search suggests that relying on this behavior opens up marisa-trie
to compiler-specific bugs and security exploits.
The purpose of this issue is to track these consequences and discuss possible fixes.
From Show at Stack Overflow on the behavior in C++:
The standard specifies in §5.7/4 that:
When an expression that has integral type is added to or subtracted from a pointer, the result has the type of the pointer operand. If the pointer operand points to an element of an array object, and the array is large enough, the result points to an element offset from the original element such that the difference of the subscripts of the resulting and original array elements equals the integral expression. [...] Moreover, if the expression P points to the last element of an array object, the expression (P)+1 points one past the last element of the array object, and if the expression Q points one past the last element of an array object, the expression (Q)-1 points to the last element of the array object. If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behaviour is undefined.
A similar quote can be found for subtraction in §5.7/5. So given that overflow and underflow are special cases of pointers that exceed the bounds of object to which they originally point to, the behaviour would simply be undefined.
From rampion at Stack Overflow on the behavior in C99:
From section 6.5.6 of the combined C99 + TC1 + TC2 (pdf):
If both the pointer operand and the result point to elements of the same array object, or one past the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.
Some resources:
Would be nice to tag the release in git.
Segmentation fault in marisa-test
occurs if Marisa is configured with CXXFLAGS="-O3"
.
Problem does not occur for me with additional -march=westmere
or -march=native
(e.g. CXXFLAGS="-O3 -march=westmere"
).
I use GCC 10.2.0.
$ CXXFLAGS="-g -O3" ./configure
...
$ make
...
$ make check
...
make check-TESTS
make[2]: Entering directory '/tmp/marisa-trie/tests'
make[3]: Entering directory '/tmp/marisa-trie/tests'
PASS: base-test
PASS: io-test
PASS: vector-test
PASS: trie-test
../test-driver: line 109: 14027 Segmentation fault "$@" > $log_file 2>&1
FAIL: marisa-test
============================================================================
Testsuite summary for marisa 0.2.6
============================================================================
# TOTAL: 5
# PASS: 4
# SKIP: 0
# XFAIL: 0
# FAIL: 1
# XPASS: 0
# ERROR: 0
============================================================================
See tests/test-suite.log
Please report to [email protected]
============================================================================
make[3]: *** [Makefile:715: test-suite.log] Error 1
make[3]: Leaving directory '/tmp/marisa-trie/tests'
make[2]: *** [Makefile:823: check-TESTS] Error 2
make[2]: Leaving directory '/tmp/marisa-trie/tests'
make[1]: *** [Makefile:925: check-am] Error 2
make[1]: Leaving directory '/tmp/marisa-trie/tests'
make: *** [Makefile:456: check-recursive] Error 1
$ LD_LIBRARY_PATH="$(pwd)/lib/marisa/.libs" gdb tests/.libs/marisa-test
...
(gdb) r
Starting program: /tmp/marisa-trie/tests/.libs/marisa-test
marisa-test.cc:13: TestEmptyTrie(): ok
marisa-test.cc:87: TestTinyTrie(): ok
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7fab3c5 in marisa::grimoire::algorithm::details::get_label<marisa::grimoire::trie::Entry> (depth=1, unit=...)
at ../../../../lib/marisa/grimoire/algorithm/sort.h:19
19 return (depth < unit.length()) ? (int)(UInt8)unit[depth] : -1;
(gdb) bt
#0 0x00007ffff7fab3c5 in marisa::grimoire::algorithm::details::get_label<marisa::grimoire::trie::Entry> (depth=1, unit=...)
at ../../../../lib/marisa/grimoire/algorithm/sort.h:19
#1 marisa::grimoire::algorithm::details::sort<marisa::grimoire::trie::Entry*> (l=l@entry=0x55555557c110, r=r@entry=0x55555557c660, depth=depth@entry=1)
at ../../../../lib/marisa/grimoire/algorithm/sort.h:103
#2 0x00007ffff7fab4ea in marisa::grimoire::algorithm::details::sort<marisa::grimoire::trie::Entry*> (l=0x55555557aed0, r=0x55555557dce0,
depth=depth@entry=0) at ../../../../lib/marisa/grimoire/algorithm/sort.h:132
#3 0x00007ffff7fac2c3 in marisa::grimoire::algorithm::sort<marisa::grimoire::trie::Entry*> (end=<optimized out>, begin=<optimized out>)
at ../../../../lib/marisa/grimoire/algorithm/sort.h:189
#4 marisa::grimoire::Algorithm::sort<marisa::grimoire::trie::Entry*> (end=<optimized out>, begin=<optimized out>, this=<optimized out>)
at ../../../../lib/marisa/grimoire/algorithm.h:15
#5 marisa::grimoire::trie::Tail::build_ (this=0x7fffffffd130, entries=..., offsets=0x7fffffffd310, mode=MARISA_TEXT_TAIL) at tail.cc:161
#6 0x00007ffff7facde7 in marisa::grimoire::trie::Tail::build (this=this@entry=0x7fffffffd918, entries=..., offsets=offsets@entry=0x7fffffffd310,
mode=MARISA_TEXT_TAIL) at tail.cc:41
#7 0x00007ffff7fb569b in marisa::grimoire::trie::LoudsTrie::build_next_trie<marisa::grimoire::trie::Key> (this=0x7fffffffd630, keys=...,
terminals=0x7fffffffd310, config=..., trie_id=1) at ../../../../lib/marisa/grimoire/trie/config.h:35
#8 0x00007ffff7fbea57 in marisa::grimoire::trie::LoudsTrie::build_trie<marisa::grimoire::trie::Key> (this=this@entry=0x7fffffffd630, keys=...,
terminals=terminals@entry=0x7fffffffd4c0, config=..., trie_id=trie_id@entry=1) at louds-trie.cc:300
#9 0x00007ffff7fb5940 in marisa::grimoire::trie::LoudsTrie::build_ (this=0x7fffffffd630, keyset=..., config=...) at louds-trie.cc:257
#10 0x00007ffff7fb69e3 in marisa::grimoire::trie::LoudsTrie::build (this=this@entry=0x555555576bd0, keyset=..., flags=flags@entry=135169)
at louds-trie.cc:27
#11 0x00007ffff7fa975a in marisa::Trie::build (this=this@entry=0x7fffffffdb38, keyset=..., config_flags=config_flags@entry=135169) at trie.cc:16
#12 0x000055555555ada4 in (anonymous namespace)::TestTrie (num_tries=1, tail_mode=MARISA_TEXT_TAIL, node_order=MARISA_WEIGHT_ORDER, keyset=...)
at marisa-test.cc:268
#13 0x000055555555bd0a in (anonymous namespace)::TestTrie (tail_mode=MARISA_TEXT_TAIL, node_order=MARISA_WEIGHT_ORDER, keyset=...) at marisa-test.cc:357
#14 0x000055555555be2e in (anonymous namespace)::TestTrie (tail_mode=MARISA_TEXT_TAIL) at marisa-test.cc:367
#15 0x0000555555559b7e in (anonymous namespace)::TestTrie () at marisa-test.cc:372
#16 main () at marisa-test.cc:383
(gdb)
how do u perform deletion on keys? or replacement of key-value?
As already noted in (closed) issue #18 , marisa::Keyset
poses an input size (memory) bottleneck for trie construction.
dawgdic has a DawgBuilder class which accepts input in (LC_ALL=C, i.e. memcmp) sorted order, and builds their data structure directly from that to avoid the need for an in-memory input record buffer (which is what I understand Marisa's marisa::Keyset
to be).
Since we have good tools for scalable sorting (even out-of-core; e.g. GNU sort), is is possible a similar MarisaBuilder could accept input in some preprocessed (e.g. sorted) order, and avoid the need for a marisa::Keyset
? I think some of us would be willing to externally preprocess our input data if so.
Due to the fact riscv64 is not explicitly mentioned in include/marisa/base.h, MARISA_WORD_SIZE ends up being incorrectly set to 32 on this architecture. Fortunately, this is trivial to address - just add
|| (defined(__riscv) && (__riscv_xlen == 64))
to the end of the #if statement in base.h:31-34.
Confirmed to work on a physical rv64 Linux system.
References:
cpp -dM
on an rv64 systemThese might be helpful for people unfamiliar with autoconf & automake.
We've managed to spot one more issue with Windows support in marisa-trie
. Trie.write
fails with
lib/marisa/grimoire/io\mapper.cc:127: MARISA_IO_ERROR: file_ == INVALID_HANDLE_VALUE
You can see the Python test case causing the error here.
README.md says:
Lines 58 to 61 in 006020c
It is not clear whether this means one must obey the terms of both:
or either:
COPYING.md says:
Line 3 in 006020c
I've often seen the term "dual-licensed" used to mean one can choose one license or the other, but it would be good to specify the interpretation you intended explicitly.
I am trying to install a package in windows 10 which requires marisa trie but i am facing below error. I tried installing as a standalone package as well but still throwing the same error.
> Searching for marisa_trie
> Reading https://pypi.org/simple/marisa_trie/
> Downloading https://files.pythonhosted.org/packages/20/95/d23071d0992dabcb61c948fb118a90683193befc88c23e745b050a29e7db/marisa-trie-0.7.5.tar.gz#sha256=c73bc25d868e8c4ea7aa7f1e19892db07bba2463351269b05340ccfa06eb2baf
> Best match: marisa-trie 0.7.5
> Processing marisa-trie-0.7.5.tar.gz
> Writing C:\Users\Gabriel\AppData\Local\Temp\easy_install-0neivqdq\marisa-trie-0.7.5\setup.cfg
> Running marisa-trie-0.7.5\setup.py -q bdist_egg --dist-dir C:\Users\Gabriel\AppData\Local\Temp\easy_install-0neivqdq\marisa-trie-0.7.5\egg-dist-tmp-5v_selap
> warning: no files found matching 'lib\COPYING'
> agent.cc
> keyset.cc
> trie.cc
> mapper.cc
> marisa-trie\lib\marisa/grimoire/io\mapper.cc(4): fatal error C1083: Cannot open include file: 'windows.h': No such file or directory
> error: Setup script exited with error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio 14.0\\VC\\BIN\\x86_amd64\\cl.exe' failed with exit status 2
Could you please check this
Hello,
I'm one of the maintainers of the Python bindings to marisa-trie
. Recently we started testing the code on Windows and came across a known issue with the 64-bit version. The library compiles with a number of invalid-cast warnings, but most of the tests fail to pass, which may indicate that despite successful compilation, the library is broken.
Here's a sample of the compilation warnings:
lib/marisa/grimoire/vector\bit-vector.cc(771) : warning C4267: 'argument' : conversion from 'size_t' to 'const marisa::UInt32', possible loss of data
lib/marisa/grimoire/vector\bit-vector.cc(776) : warning C4267: 'argument' : conversion from 'size_t' to 'const marisa::UInt32', possible loss of data
lib/marisa/grimoire/vector\bit-vector.cc(815) : warning C4267: 'argument' : conversion from 'size_t' to 'const marisa::UInt32', possible loss of data
You can see full compilation log on AppVeyor.
Hi,
Did the serialization format change between 0.1.5 and 0.2.4?
What were the changes?
Thanks, Gleb
For the 0.2.5 release you attached the archive as a release download. Great!
For the 0.2.6 release the archive is instead in the release description. This does not give it the same predictable URL as a release download. Could you please add it as a release download as well, and use release downloads for all future versions?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.