Git Product home page Git Product logo

tongrams_estimation's Introduction

Tongrams Estimation

Modified Kneser-Ney language model estimation powered by Tongrams.

This C++ library implements the 1-Sort algorithm described in the paper Handling Massive N-Gram Datasets Efficiently by Giulio Ermanno Pibiri and Rossano Venturini, published in ACM TOIS, 2019 [1].

Compiling the code

git clone --recursive https://github.com/jermp/tongrams_estimation.git
mkdir -p build; cd build
cmake ..
make -j

Sample usage

After installation of dependencies and compilation of the code, you can use the sample text (first 1M lines from the 1Billion corpus; see the paper for dataset information) in the directory test_data. The text is gzipped, so it must be first uncompressed.

cd build
gunzip ../test_data/1Billion.1M.gz
1. Estimation

Then you can estimate a Kneser-Ney language model of order 5 (using 25% of RAM and whose index is serialized to the file index.bin) as follows.

./estimate ../test_data/1Billion.1M 5 --tmp tmp_dir --ram 0.25 --out index.bin
2. Computing Perplexity

With the index built and serialized to index.bin you can compute the perplexity score with:

./external/tongrams/score index.bin ../test_data/1Billion.1M
3. Counting N-Grams

You can also extract n-gram counts. An example follows below, for 3-grams.

./count ../test_data/1Billion.1M 3 --tmp tmp_dir --ram 0.25 --out 3-grams

The output file 3-grams will list all extracted 3-grams sorted lexicographically in the following standard format:

<total_number_of_rows>
<gram1> <TAB> <count1>
<gram2> <TAB> <count2>
<gram3> <TAB> <count3>
...

where each <gram> is a sequence of words separated by a whitespace character.

Dependencies

  1. boost
  2. sparsehash

Bibliography

[1] Pibiri, Giulio Ermanno, and Rossano Venturini. "Handling Massive N-Gram Datasets Efficiently." ACM Transactions on Information Systems (TOIS) 37.2 (2019): 1-41.

tongrams_estimation's People

Contributors

anooppoommen avatar jermp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

tongrams_estimation's Issues

Count Feature Requests

Hello, this a great tool and having looked through the code it seems to be really well made.

I was trying to figure out how to modify this code base to do a few more things.

  1. Count n-grams of 1 and 2 words. I realize you've likely disabled this because it makes huge files. But if it could optionally block counts lower than 1 or N it could be extremely useful, and have manageable file sizes.

  2. Filter out low counts when counting n-grams. Basically just made this tool better at extracting word patterns but ignoring single or low counts that don't amount to a pattern. Ideally it would be an input on the command line.

Thanks for making this great open source tool!

Cannot create estimations for model order other than 5

Using the estimate command for any other ngram model results in a segfault
./estimate large_lang.txt --tmp tmp_hi --ram 0.25 --out index.bin

estimating with 268435456/17179869184 bytes of RAM (1.5625%)
{"dataset":"2021-03-15", "order":4, "RAM":268435456, "threads":12, "counting": {block size = 2236962
sorting took 0.505281 [sec]
block size = 745603
sorting took 0.128015 [sec]
        counting_writer thread stats:
        flushed blocks: 2
        O time: 0.347683
        CPU time: 0.633296
        reader thread stats:
        CPU time: 1.42818 [sec]
        I time: 0.32765 [sec]
"CPU":1.42818, "I":0.32765, "O":0.347683, "total":1.97759}, "adjusting": {vocabulary size: 94357
merging 2 files
        using min. load size of 53687088 because not enough RAM is available
num_ngrams_per_block = 2236962 ngrams
MERGE DONE: 2916244 N-grams
        time waiting for disk = 1.9108e-05 [sec]
        smoothing time: 0.090805 [sec]
        adjusting_writer thread stats:
        flushed blocks: 2
        write time: 0.180041
number of ngrams:
1-grams: 94357
2-grams: 856047
3-grams: 0
4-grams: 5021288
total num. grams: 5971692
total num. tokens: 3360614
Here is the stats
"CPU":0.249942, "I":0.0514812, "O":0.180041, "total":0.410131}, "last": {processing 2 blocks
Segmentation fault: 11```



From what I can see somehow the number of ngrams  for N -1 gram for N another gram is becoming 0 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.