Light

jermp / tongrams_estimation Goto Github PK

View Code? Open in Web Editor NEW

16.0 5.0 2.0 55.73 MB

A C++ library implementing fast language models estimation using the 1-Sort algorithm.

License: MIT License

CMake 1.26% C++ 98.74%

ngrams ngram-language-model

tongrams_estimation's Introduction

Tongrams Estimation

Modified Kneser-Ney language model estimation powered by Tongrams.

This C++ library implements the 1-Sort algorithm described in the paper Handling Massive N-Gram Datasets Efficiently by Giulio Ermanno Pibiri and Rossano Venturini, published in ACM TOIS, 2019 [1].

Compiling the code

git clone --recursive https://github.com/jermp/tongrams_estimation.git
mkdir -p build; cd build
cmake ..
make -j

Sample usage

After installation of dependencies and compilation of the code, you can use the sample text (first 1M lines from the 1Billion corpus; see the paper for dataset information) in the directory test_data. The text is gzipped, so it must be first uncompressed.

cd build
gunzip ../test_data/1Billion.1M.gz

1. Estimation

Then you can estimate a Kneser-Ney language model of order 5 (using 25% of RAM and whose index is serialized to the file index.bin) as follows.

./estimate ../test_data/1Billion.1M 5 --tmp tmp_dir --ram 0.25 --out index.bin

2. Computing Perplexity

With the index built and serialized to index.bin you can compute the perplexity score with:

./external/tongrams/score index.bin ../test_data/1Billion.1M

3. Counting N-Grams

You can also extract n-gram counts. An example follows below, for 3-grams.

./count ../test_data/1Billion.1M 3 --tmp tmp_dir --ram 0.25 --out 3-grams

The output file 3-grams will list all extracted 3-grams sorted lexicographically in the following standard format:

<total_number_of_rows>
<gram1> <TAB> <count1>
<gram2> <TAB> <count2>
<gram3> <TAB> <count3>
...

where each <gram> is a sequence of words separated by a whitespace character.

Dependencies

Bibliography

[1] Pibiri, Giulio Ermanno, and Rossano Venturini. "Handling Massive N-Gram Datasets Efficiently." ACM Transactions on Information Systems (TOIS) 37.2 (2019): 1-41.

tongrams_estimation's People

Contributors

Stargazers

Watchers

Forkers

anooppoommen clayne

tongrams_estimation's Issues

Count Feature Requests

Hello, this a great tool and having looked through the code it seems to be really well made.

I was trying to figure out how to modify this code base to do a few more things.

Count n-grams of 1 and 2 words. I realize you've likely disabled this because it makes huge files. But if it could optionally block counts lower than 1 or N it could be extremely useful, and have manageable file sizes.
Filter out low counts when counting n-grams. Basically just made this tool better at extracting word patterns but ignoring single or low counts that don't amount to a pattern. Ideally it would be an input on the command line.

Thanks for making this great open source tool!

Cannot create estimations for model order other than 5

Using the estimate command for any other ngram model results in a segfault
./estimate large_lang.txt --tmp tmp_hi --ram 0.25 --out index.bin

estimating with 268435456/17179869184 bytes of RAM (1.5625%)
{"dataset":"2021-03-15", "order":4, "RAM":268435456, "threads":12, "counting": {block size = 2236962
sorting took 0.505281 [sec]
block size = 745603
sorting took 0.128015 [sec]
        counting_writer thread stats:
        flushed blocks: 2
        O time: 0.347683
        CPU time: 0.633296
        reader thread stats:
        CPU time: 1.42818 [sec]
        I time: 0.32765 [sec]
"CPU":1.42818, "I":0.32765, "O":0.347683, "total":1.97759}, "adjusting": {vocabulary size: 94357
merging 2 files
        using min. load size of 53687088 because not enough RAM is available
num_ngrams_per_block = 2236962 ngrams
MERGE DONE: 2916244 N-grams
        time waiting for disk = 1.9108e-05 [sec]
        smoothing time: 0.090805 [sec]
        adjusting_writer thread stats:
        flushed blocks: 2
        write time: 0.180041
number of ngrams:
1-grams: 94357
2-grams: 856047
3-grams: 0
4-grams: 5021288
total num. grams: 5971692
total num. tokens: 3360614
Here is the stats
"CPU":0.249942, "I":0.0514812, "O":0.180041, "total":0.410131}, "last": {processing 2 blocks
Segmentation fault: 11```



From what I can see somehow the number of ngrams  for N -1 gram for N another gram is becoming 0

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.