jermp / tongrams Goto Github PK

A C++ library providing fast language model queries in compressed space.

License: MIT License

CMake 0.53% C++ 75.68% Shell 0.02% Python 0.39% Roff 23.38%

trie elias-fano ngrams store-frequency-counts gram-counts-files language-model

tongrams's Introduction

Tongrams - Tons of N-Grams

Tongrams is a C++ library to index and query large language models in compressed space, as described in the following papers

by Giulio Ermanno Pibiri and Rossano Venturini. Please, cite these papers if you use Tongrams.

NEWS!

The language model estimation library is available here.
A Rust implementation by kampersanda is available here.

Introduction

More specifically, the implemented data structures can be used to map N-grams to their corresponding (integer) frequency counts or to (floating point) probabilities and backoffs for backoff-interpolated Kneser-Ney models.

The library features a compressed trie data structure in which N-grams are assigned integer identifiers (IDs) and compressed with Elias-Fano as to support efficient searches within compressed space. The context-based remapping of such identifiers permits to encode a word following a context of fixed length k, i.e., its preceding k words, with an integer whose value is bounded by the number of words that follow such context and not by the size of the whole vocabulary (number of uni-grams). Additionally to the trie data structure, the library allows to build models based on minimal perfect hashing (MPH), for constant-time retrieval.

When used to store frequency counts, the data structures support a lookup() operation that returns the number of occurrences of the specified N-gram. Differently, when used to store probabilities and backoffs, the data structures implement a score() function that, given a text as input, computes the perplexity score of the text.

This guide is meant to provide a brief overview of the library and to illustrate its functionalities through some examples.

Building the code
Input data format
Building the data structures
Tests
Benchmarks
Statistics
Python Wrapper
Authors
Bibliography

Building the code

The code has been tested on Linux Ubuntu with gcc 5.4.1, 7.3.0, 8.3.0, 9.0.0; Mac OS X El Capitan with clang 7.3.0; Mac OS X Mojave with clang 10.0.0.

The following dependencies are needed for the build: CMake and Boost.

If you have cloned the repository without --recursive, you will need to perform the following commands before building:

git submodule init
git submodule update

To build the code on Unix systems (see file CMakeLists.txt for the used compilation flags), it is sufficient to do the following.

mkdir build
cd build
cmake ..
make

You can enable parallel compilation by specifying some jobs: make -j4.

For best of performace, compile as follows.

cmake .. -DCMAKE_BUILD_TYPE=Release  -DTONGRAMS_USE_SANITIZERS=OFF -DEMPHF_USE_POPCOUNT=ON -DTONGRAMS_USE_POPCNT=ON -DTONGRAMS_USE_PDEP=ON
make

For a debug environment, compile as follows instead.

cmake .. -DCMAKE_BUILD_TYPE=Debug -DTONGRAMS_USE_SANITIZERS=ON
make

Unless otherwise specified, for the rest of this guide we assume that we type the terminal commands of the following examples from the created directory build.

Input data format

The N-gram counts files follow the Google format, i.e., one separate file for each distinct value of N (order) listing one gram per row. We enrich this format with a file header indicating the total number of N-grams in the file (rows):

<total_number_of_rows>
<gram1> <TAB> <count1>
<gram2> <TAB> <count2>
<gram3> <TAB> <count3>
...

Such N files must be named according to the following convention: <order>-grams, where <order> is a placeholder for the value of N. The files can be left unsorted if only MPH-based models have to be built, whereas these must be sorted in prefix order for trie-based data structures, according to the chosen vocabulary mapping, which should be represented by the uni-gram file (see Subsection 3.1 of [1]). Compressing the input files with standard utilities, such as gzip, is highly recommended. The utility sort_grams can be used to sort the N-gram counts files in prefix order. In conclusion, the data structures storing frequency counts are built from a directory containing the files

1-grams.sorted.gz
2-grams.sorted.gz
3-grams.sorted.gz
...

formatted as explained above.

The file listing N-gram probabilities and backoffs is conform to, instead, the ARPA file format. The N-grams in the ARPA file must be sorted in suffix order in order to build the reversed trie data structure. The utility sort_arpa can be used for that purpose.

The directory test_data contains:

all N-gram counts files (for a total of 252,550 N-grams), for N going from 1 to 5, extracted from the Agner Fog's manual Optimizing software in C++, sorted in prefix order and compressed with gzip;
the query file queries.random.5K comprising 5,000 N-grams (1,000 for each order and drawn at random);
the ARPA file arpa which lists all N-grams sorted in suffix order as to build backward tries efficiently;
the sample_text query file (6,075 sentence for a total of 153,583 words) used for the perplexity benchmark; its companion sample_text.LESSER file includes just the first 10 sentences.

For the following examples, we assume to work with the sample data contained in test_data.

Building the data structures

The two executables build_trie and build_hash are used to build trie-based and (minimal perfect) hash-based language models, respectively. Run the executables without any arguments to know about their usage.

We now show some examples.

Example 1

The command

./build_trie ef_trie 5 count --dir ../test_data --out ef_trie.count.bin

builds an Elias-Fano trie

of order 5;
that stores frequency counts;
from the N-gram counts files contained in the directory test_data;
with no context-based remapping (default);
whose counts ranks are encoded with the indexed codewords (IC) technique (default);
that is serialized to the binary file ef_trie.count.bin.

Example 2

The command

./build_trie pef_trie 5 count --dir ../test_data --remapping 1 --ranks PSEF  --out pef_trie.count.out

builds a partitioned Elias-Fano trie

of order 5;
that stores frequency counts;
from the N-gram counts files contained in the directory test_data;
with context-based remapping of order 1;
whose counts ranks are encoded with prefix sums (PS) + Elias-Fano (EF);
that is serialized to the binary file pef_trie.count.out.

Example 3

The command

./build_trie ef_trie 5 prob_backoff --remapping 2 --u -20.0 --p 8 --b 8 --arpa ../test_data/arpa --out ef_trie.prob_backoff.bin

builds an Elias-Fano trie

of order 5;
that stores probabilities and backoffs;
with context-based remapping of order 2;
with <unk> probability of -20.0 and using 8 bits for quantizing probabilities (--p) and backoffs (--b);
from the arpa file named arpa;
that is serialized to the binary file ef_trie.prob_backoff.bin.

Example 4

The command

./build_hash 5 8 count --dir ../test_data --out hash.bin

builds a MPH-based model

of order 5;
that uses 8 bytes per hash key;
that stores frequency counts;
from the N-gram counts files contained in the directory test_data;
that is serialized to the binary file hash.bin.

Tests

The test directory contains the unit tests of some of the fundamental building blocks used by the implemented data structures. As usual, running the executables without any arguments will show the list of their expected input parameters. Examples:

./test_compact_vector 10000 13
./test_fast_ef_sequence 1000000 128

The directory also contains the unit test for the data structures storing frequency counts, named check_count_model, which validates the implementation by checking that each count stored in the data structure is the same as the one provided in the input files from which the data structure was previously built. Example:

./test_count_model ef_trie.count.bin ../test_data

where ef_trie.count.bin is the name of the data structure binary file (maybe built with the command shown in Example 1) and test_data is the name of the folder containing the input N-gram counts files.

Benchmarks

For the examples in this section, we used a desktop machine running Mac OS X Mojave, equipped with a 2.3 GHz Intel Core i5 processor (referred to as Desktop Mac). The code was compiled with Apple LLVM version 10.0.0 clang with all optimizations (see section Building the code). We additionally replicate some experiments with an Intel(R) Core(TM) i9-9900K CPU @ 3.60 GHz, under Ubuntu 19.04, 64 bits (referred to as Server Linux). In this case the code was compiled with gcc 8.3.0.

For a data structure storing frequency counts, we can test the speed of lookup queries by using the benchmark program lookup_perf_test. In the following example, we show how to build and benchmark three different data structures: EF-Trie with no remapping, EF-RTrie with remapping order 1 and PEF-RTrie with remapping order 2 (we use the same names for the data structures as presented in [1]). Each experiment is repeated 1,000 times over the test query file queries.random.5K. The benchmark program lookup_perf_test will show mean time per run and mean time per query (along with the total number of N-grams, total bytes of the data structure and bytes per N-gram).

./build_trie ef_trie 5 count --dir ../test_data --out ef_trie.bin
./lookup_perf_test ef_trie.bin ../test_data/queries.random.5K 1000

./build_trie ef_trie 5 count --remapping 1 --dir ../test_data --out ef_trie.r1.bin
./lookup_perf_test ef_trie.r1.bin ../test_data/queries.random.5K 1000

./build_trie pef_trie 5 count --remapping 2 --dir ../test_data --out pef_trie.r2.bin
./lookup_perf_test pef_trie.r2.bin ../test_data/queries.random.5K 1000

The results of this (micro) benchmark are summarized in the following table.

Data structure	Remapping order	Bytes x gram	µs x query - Desktop Mac	µs x query - Server Linux
EF-Trie	0	2.40	0.435	0.316
EF-RTrie	1	1.93 (-19.7%)	0.583	0.428
PEF-RTrie	2	1.75 (-26.9%)	0.595	0.427

For a data structure storing probabilities and backoffs, we can instead test the speed of scoring a text file by using the benchmark program score. A complete example follows.

./build_trie ef_trie 5 prob_backoff --u -10.0 --p 8 --b 8 --arpa ../test_data/arpa --out ef_trie.prob_backoff.8.8.bin
./score ef_trie.prob_backoff.8.8.bin ../test_data/sample_text

The first command will build the data structure, the second one will score the text file sample_text contained in test_data. The input text file must contain one sentence per line, with words separated by spaces. During the scoring of the file, we do not wrap each sentence with markers <s> and </s>.

An examplar output could be (OOV stands for Out Of Vocabulary):

perplexity including OOVs = 493720.19
perplexity excluding OOVs = 1094.2574
OOVs = 55868
corpus tokens = 153583
corpus sentences = 6075
elapsed time: 0.037301 [sec]

Statistics

The executable print_stats can be used to gather useful statistics regarding the space usage of the various data structure components (e.g., gram-ID and pointer sequences for tries), as well as structual properties of the indexed N-gram dataset (e.g., number of unique counts, min/max range lengths, average gap of gram-ID sequences, ecc.).

As an example, the following command:

./print_stats data_structure.bin

will show the statistics for the data structure serialized to the file data_structure.bin.

Python Wrapper

The directory python includes a simple python wrapper with some examples. Check this out!

Authors

Bibliography

[1] Giulio Ermanno Pibiri and Rossano Venturini Efficient Data Structures for Massive N-Gram Datasets. In the Proceedings of the 40-th ACM Conference on Research and Development in Information Retrieval (SIGIR 2017): 615-624.
[2] Giulio Ermanno Pibiri and Rossano Venturini. Handling Massive N-Gram Datasets Efficiently. ACM Transactions on Information Systems (TOIS) 37.2 (2019): 1-41.

tongrams's People

Contributors

Stargazers

Watchers

Forkers

chenkovsky pythseq xiaominghero mkleo ndvbd ghostdogx tmarkovich zameji shannonyu feddybear kiminh razdeep sawndip pombredanne sqripter abdullah-alattar xkey- clayne rnshah9

tongrams's Issues

Can't compile tongrams

Hi I get an error when trying to compile the code:

In file included from /SSD/pedros-corner/tongrams/include/utils/util.hpp:19:0, from /SSD/pedros-corner/tongrams/test/test_count_model.cpp:3: /SSD/pedros-corner/tongrams/include/../external/essentials/include/essentials.hpp: In member function ‘typename std::enable_if<std::is_pod<_Tp>::value>::type essentials::sizer::visit(T&)’: /SSD/pedros-corner/tongrams/include/../external/essentials/include/essentials.hpp:361:62: error: must #include <typeinfo> before using typeid node n(pod_bytes(val), m_current->depth + 1, typeid(T).name()); ^ /SSD/pedros-corner/tongrams/include/../external/essentials/include/essentials.hpp: In member function ‘typename std::enable_if<std::is_pod<_Tp>::value>::type essentials::sizer::visit(std::vector<_RealType>&)’: /SSD/pedros-corner/tongrams/include/../external/essentials/include/essentials.hpp:374:37: error: must #include <typeinfo> before using typeid typeid(std::vector<T>).name()); ^ /SSD/pedros-corner/tongrams/include/../external/essentials/include/essentials.hpp: In member function ‘typename std::enable_if<(! std::is_pod<_Tp>::value)>::type essentials::sizer::visit(std::vector<_RealType>&)’: /SSD/pedros-corner/tongrams/include/../external/essentials/include/essentials.hpp:385:50: error: must #include <typeinfo> before using typeid node n(0, parent->depth + 1, typeid(T).name()); ^ /SSD/pedros-corner/tongrams/include/../external/essentials/include/essentials.hpp: In function ‘void essentials::print_size(Data&)’: /SSD/pedros-corner/tongrams/include/../external/essentials/include/essentials.hpp:437:30: error: must #include <typeinfo> before using typeid sizer visitor(typeid(Data).name()); ^ In file included from /SSD/pedros-corner/tongrams/test/test_count_model.cpp:6:0: /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp: At global scope: /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:13:45: error: ‘empty’ declared as an ‘inline’ field inline static const std::string empty = ""; ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:13:45: error: in-class initialization of static data member ‘const string cmd_line_parser::parser::empty’ of non-literal type /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:13:45: error: call to non-constexpr function ‘std::__cxx11::basic_string<_CharT, _Traits, _Alloc>::basic_string(const _CharT*, const _Alloc&) [with _CharT = char; _Traits = std::char_traits<char>; _Alloc = std::allocator<char>]’ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp: In member function ‘T cmd_line_parser::parser::parse(const string&) const’: /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:115:12: error: expected ‘(’ before ‘constexpr’ if constexpr (std::is_same<T, std::string>::value) { ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:117:11: error: ‘else’ without a previous ‘if’ } else if constexpr (std::is_same<T, char>::value or std::is_same<T, signed char>::value or ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:117:19: error: expected ‘(’ before ‘constexpr’ } else if constexpr (std::is_same<T, char>::value or std::is_same<T, signed char>::value or ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:120:11: error: ‘else’ without a previous ‘if’ } else if constexpr (std::is_same<T, unsigned int>::value or std::is_same<T, int>::value or ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:120:19: error: expected ‘(’ before ‘constexpr’ } else if constexpr (std::is_same<T, unsigned int>::value or std::is_same<T, int>::value or ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:124:11: error: ‘else’ without a previous ‘if’ } else if constexpr (std::is_same<T, unsigned long int>::value or ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:124:19: error: expected ‘(’ before ‘constexpr’ } else if constexpr (std::is_same<T, unsigned long int>::value or ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:129:11: error: ‘else’ without a previous ‘if’ } else if constexpr (std::is_same<T, float>::value or std::is_same<T, double>::value or ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:129:19: error: expected ‘(’ before ‘constexpr’ } else if constexpr (std::is_same<T, float>::value or std::is_same<T, double>::value or ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:132:11: error: ‘else’ without a previous ‘if’ } else if constexpr (std::is_same<T, bool>::value) { ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:132:19: error: expected ‘(’ before ‘constexpr’ } else if constexpr (std::is_same<T, bool>::value) { ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:144:5: error: expected ‘}’ at end of input } ^ /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp: In instantiation of ‘T cmd_line_parser::parser::parse(const string&) const [with T = std::__cxx11::basic_string<char>; std::__cxx11::string = std::__cxx11::basic_string<char>]’: /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:104:24: required from ‘T cmd_line_parser::parser::get(const string&) const [with T = std::__cxx11::basic_string<char>; std::__cxx11::string = std::__cxx11::basic_string<char>]’ /SSD/pedros-corner/tongrams/test/test_count_model.cpp:40:69: required from here /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:144:5: warning: no return statement in function returning non-void [-Wreturn-type] /SSD/pedros-corner/tongrams/test/../external/cmd_line_parser/include/parser.hpp:114:32: warning: unused parameter ‘value’ [-Wunused-parameter] T parse(std::string const& value) const { ^ CMakeFiles/test_count_model.dir/build.make:62: recipe for target 'CMakeFiles/test_count_model.dir/test/test_count_model.cpp.o' failed make[2]: *** [CMakeFiles/test_count_model.dir/test/test_count_model.cpp.o] Error 1 CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/test_count_model.dir/all' failed make[1]: *** [CMakeFiles/test_count_model.dir/all] Error 2 Makefile:83: recipe for target 'all' failed make: *** [all] Error 2

I'm not sure why this is happening. Is on Ubuntu 16.04, a server. Any ideas?. Thanks.

Compile fails on gcc 4.9, Debian Jessie

Hi y'all,

thanks for open-sourcing this code. I am looking into using Tongrams to compress 34 million trigrams for Icelandic. However, when I try to compile the code using the standard gcc 4.9 that comes with Debian Jessie, I get multiple instances of the following error or similar ones:

(p357) villi@brandur:~/github/tongrams/build$ make -j4
[ 21%] [ 21%] [ 21%] [ 21%] Built target compact_vector_test
Building CXX object CMakeFiles/build_mph_lm.dir/build_mph_lm.cpp.o
Building CXX object CMakeFiles/build_trie_lm.dir/build_trie_lm.cpp.o
Building CXX object CMakeFiles/check_count_model.dir/test/check_count_model.cpp.o
[ 26%] Built target fast_ef_sequence_test
Scanning dependencies of target hash_compact_vector_test
[ 31%] Building CXX object CMakeFiles/hash_compact_vector_test.dir/test/hash_compact_vector_test.cpp.o
In file included from /home/villi/github/tongrams/test/../mph_count_lm.hpp:6:0,
                 from /home/villi/github/tongrams/test/../lm_types.hpp:11,
                 from /home/villi/github/tongrams/test/check_count_model.cpp:4:
/home/villi/github/tongrams/test/../utils/parsers.hpp: In constructor ‘tongrams::arpa_parser::arpa_parser(const char*)’:
/home/villi/github/tongrams/test/../utils/parsers.hpp:110:31: error: use of deleted function ‘std::basic_ifstream<char>::basic_ifstream(const std::basic_ifstream<char>&)’
             , m_cur_line_num(0)
                               ^

Any hints? Is this not supposed to work on gcc 4.9? Upgrading gcc in Jessie is a bit of a pain, BTW.

One master tool instead of many different executables

Create one master tool tongrams with sub-tools, like tongrams build and tongrams query instead of many separate executables.

how to use tongram in a class

Hi @jermp

I want to add n-gram model to a class, but I don't know how to call the class from the main function.

I defined class A which contain n-gram model

template
class A
{
private:
Model* model;

public:
A(string model_path = ""){
model = new Model;
tongrams::util::load(*(model), model_path);
}
void fool(){
auto state = model->state();
}
};

but in main(), I don't know how to call class A, since I need a specific type for the model.
Do you know how to solve this?

Thank you in advance.

sort_grams - found the bug causing the exception

Apparently, sort_grams fails when the 3rd argument is a full path of file.
So doing

./sort_grams 1-grams.sorted.gz 1-grams.sorted.gz 1-grams.sorted.again.txt

Works, but

./sort_grams 1-grams.sorted.gz 1-grams.sorted.gz /usr/tongrams/1-grams.sorted.done.txt

Fails with

terminate called without an active exception
Aborted (core dumped)

SIGABRT Crash

I'm getting this error while running the program

System Configuration:

Ubuntu : 20.04
gcc : g++ (Ubuntu 9.3.0-10ubuntu2) 9.3.0
boost : libboost-all-dev 1.71.0.0ubuntu2

Input:

./build_trie ef_trie 5 count --dir ../test_data --out ef_trie.count.bin

Output:

2020-07-15 10:47:41: Reading 1-grams counts
2020-07-15 10:47:41: Reading 2-grams counts
2020-07-15 10:47:41: Reading 3-grams counts
2020-07-15 10:47:41: Reading 4-grams counts
2020-07-15 10:47:41: Reading 5-grams counts
2020-07-15 10:47:41: Building vocabulary
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

GDB Backtrace Log:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7bcf859 in __GI_abort () at abort.c:79
#2  0x00007ffff7e55951 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff7e6147c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007ffff7e614e7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x00007ffff7e61799 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#6  0x00007ffff7e55562 in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#7  0x0000555555592e4f in __gnu_cxx::new_allocator<unsigned char>::allocate (this=0x7fffffffcd28, __n=33016750080)
    at /usr/include/c++/9/ext/new_allocator.h:102
#8  std::allocator_traits<std::allocator<unsigned char> >::allocate (__a=..., __n=33016750080)
    at /usr/include/c++/9/bits/alloc_traits.h:444
#9  std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_allocate (this=0x7fffffffcd28, __n=33016750080)
    at /usr/include/c++/9/bits/stl_vector.h:343
#10 std::vector<unsigned char, std::allocator<unsigned char> >::reserve (this=0x7fffffffcd28, __n=33016750080)
    at /usr/include/c++/9/bits/vector.tcc:78
#11 0x0000555555592edd in tongrams::grams_counts_pool::grams_counts_pool (this=0x7fffffffcd00, num_bytes=<optimized out>)
    at /home/ubuntu/tongrams/include/utils/pools.hpp:104
#12 0x00005555555e7ec3 in tongrams::trie_count_lm<tongrams::single_valued_mpht<tongrams::hash_compact_vector<unsigned long>, emphf::jenkins64_hasher>, tongrams::identity_mapper, tongrams::sequence_collection, tongrams::indexed_codewords_sequence, tongrams::fast_ef_sequence, tongrams::ef_sequence>::builder::build_vocabulary (this=0x7fffffffdee0, counts_builder=...)
    at /home/ubuntu/tongrams/include/trie_count_lm.hpp:120
#13 0x00005555555eb33e in tongrams::trie_count_lm<tongrams::single_valued_mpht<tongrams::hash_compact_vector<unsigned long>, emphf::jenkins64_hasher>, tongrams::identity_mapper, tongrams::sequence_collection, tongrams::indexed_codewords_sequence, tongrams::fast_ef_sequence, tongrams::ef_sequence>::builder::builder (this=0x7fffffffdee0, input_dir=<optimized out>, order=<optimized out>, 
    remapping_order=<optimized out>) at /usr/include/c++/9/ext/new_allocator.h:89
#14 0x0000555555585f77 in main (argc=<optimized out>, argv=<optimized out>) at /home/ubuntu/tongrams/src/build_trie.cpp:159

Remove dependency from boost

Currently, boost is used:

for the preprocessor's for_each;
for memory mapped files;
for iterating through gzipped files.

format for vocabulary file

Thank you for your amazing work @jermp.
I am trying to sort an arpa file as the following command
./sort_arpa 3 input.arpa vocab.txt output.arpa
It cased the following error:

Building vocabulary
terminate called after throwing an instance of 'std::runtime_error'
what(): first line must be non-empty and contain the number of lines.
Aborted (core dumped)

I think the problem is my vocab.txt does not in the correct format. Could you give me an example for vocab file?

Thank you in advance.

Using Tongrams

Hi!
Can you list what operation this data structure provide?
Is the data structure internally sorted by the weights, so we can easily, in O(k), find the k-top next tokens of an ngram prefix?
Can this data structure be used from python?

Sequence is not sorted

Got the following files:

ls trie_data/
1-grams.sorted.gz  2-grams.sorted.gz  3-grams.sorted.gz

I am trying the command:

./build_trie  ef_trie 3 count --dir ./trie_data/ --out ef_trie.count.bin

But getting the error

error at position 23/186616
360087 < 400844
terminate called after throwing an instance of 'std::runtime_error'
  what():  sequence is not sorted

I did use sort_grams command, on the Ngrams files. but still getting the error.

lookup() - Segmentation fault when ngram is not in data structure

I am trying to use Tongrams, and to eventually write a python wrapper for it. For now, I created an Eclipse CDT project from the cmake files using: cmake -G "Eclipse CDT4 - Unix Makefiles" ./

I created the data structure (pef_trie) from the test set. Now when I try to lookup for an ngram which is not found I get a segmentation fault:

stl_string_adaptor adaptor;
uint64_t value1 = model.lookup("or compilation before it can", adaptor); // Works well
std::cout << value1 << std::endl;
uint64_t value2 = model.lookup("or compilation before it or", adaptor); // Segmentation Fault

an error when I try python tongrams

Hi Giulio Ermanno Pibiri,

I installed python wrapper, but when I import tongrams, I faced this error:

import tongrams
ImportError: /opt/conda/lib/python3.7/site-packages/tongrams.cpython-37m-x86_64-linux-gnu.so: undefined symbol: _ZN8tongrams12trie_prob_lmINS_18double_valued_mphtIN5emphf16jenkins64_hasherEEENS_15identity_mapperENS_29quantized_sequence_collectionENS_14compact_vectorENS_16fast_ef_sequenceENS_11ef_sequenceEE5scoreERNS_16prob_model_stateImEESt4pairIPKhSG_ERb

could you help me to solve this error?
Thank in advance

Can't load MPH-based models in Python

Created MPH-based model with

./build_hash  3 8 count --dir ./trie_data/ --out hash.bin

It works with the given executables, test_count_model and print_stats,
but when I try to load it in Python I get the following error:

==== tongrams binary format ====
library version: 1.0
data structure type: mph
hash_key_bytes: 8
value type: count
================================
Loading data structure type: mph64_count_lm
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc

Python Code:

import tongrams

count_model_filename = "./hash.bin"
count_model = tongrams.CountModel(count_model_filename)

sort_arpa can't work

Hello

I have met a problem. When I run the test sort_arpa script
./sort_arpa 5 ../test_data/arpa ../test_data/1-grams.sorted.gz ./arpa.sorted

an error occured:

Sorting with 100% of available RAM (8260710400/8260710400)
2022-05-10 11:00:28: Building vocabulary
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
#1 0x00007ffff72707f1 in __GI_abort () at abort.c:79
#2 0x00007ffff78c5957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff78cbae6 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff78cbb21 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff78cbd54 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007ffff78cc2dc in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x000055555556ddcb in __gnu_cxx::new_allocator::allocate (this=0x7fffffffd418, __n=6608568320) at /usr/include/c++/7/ext/new_allocator.h:111
#8 std::allocator_traits<std::allocator >::allocate (__a=..., __n=6608568320) at /usr/include/c++/7/bits/alloc_traits.h:436
#9 std::_Vector_base<unsigned char, std::allocator >::_M_allocate (this=0x7fffffffd418, __n=6608568320) at /usr/include/c++/7/bits/stl_vector.h:172
#10 std::vector<unsigned char, std::allocator >::_M_allocate_and_copy<std::move_iterator<unsigned char*> > (this=0x7fffffffd418, __last=..., __first=..., __n=6608568320)
at /usr/include/c++/7/bits/stl_vector.h:1260
#11 std::vector<unsigned char, std::allocator >::reserve (__n=6608568320, this=0x7fffffffd418) at /usr/include/c++/7/bits/vector.tcc:73
#12 tongrams::grams_counts_pool::grams_counts_pool (num_bytes=6608568320, this=0x7fffffffd3f0) at /tongrams/include/utils/pools.hpp:104
#13 tongrams::build_vocabulary (vocab_filename=0x5555557a5670 "../test_data/1-grams.sorted.gz", vocab=..., bytes=6608568320) at /tongrams/include/sorters/sorter_common.hpp:12
#14 0x000055555556a692 in main (argc=, argv=) at /tongrams/src/sort_arpa.cpp:55

I need your help. Thanks!

Trying build_trie with arpa file.

Given the following command:

 ./build_trie ef_trie 3 prob_backoff --remapping 2 --u -20.0 --p 8 --b 8 --arpa lmclean.arpa   --out ef_trie.prob_backoff.bin

Getting the error:

arpa file contains wrong data:
        'السعدي' should have been found within previous order grams

Where I am sure it exists:

grep "السعدي" lmclean.arpa
-5.097526       السعدي  -0.3184454

Does it not work with non-ascii characters?

Move as much of things out of headers as possible and make tongrams a shared library

Implement building ngrams storage via python

Hi. I have written an abstraction layer around multiple libraries doing word splitting (londonisacapitalofgreatbritain must become london is a capital of great britain). All the libs rely on preprocessed ngrams dicts, some on unigrams, some additionally on bigrams. All of them store them very inefficiently - as a text file, 1 line - one n-gram. For bigrams it already causes duplication.

My middleware provides a unified interface to them, and also converts their ngrams formats to each other.

I'd like to support your lib format for ngrams storage too. But it'd require some way to convert other formats into your format and back.

jermp / tongrams Goto Github PK

tongrams's Introduction

Tongrams - Tons of N-Grams

NEWS!

Introduction

Table of contents

Building the code

Input data format

Building the data structures

Example 1

Example 2

Example 3

Example 4

Tests

Benchmarks

Statistics

Python Wrapper

Authors

Bibliography

tongrams's People

Contributors

Stargazers

Watchers

Forkers

tongrams's Issues

Recommend Projects

Recommend Topics

Recommend Org